Why Does AI Come For Programming First

Profit and data, that's probably the most straightforward answer.

The detailed answer

Wait, I thought we have much more data in copywriting, customer service, medical care, and law? That's true, and there are also plenty of AI/LLM startups focused on these fields like Harvey, Intercom, or Cera. However, code generation is still the top use case for LLMs right now. Why? It's likely for two reasons: the market size and ease of data collection.

The market size

Let's go back to before the introduction of LLMs to the world. In 2022, out of the top 10 most valuable companies in the US, half of them were software/hardware companies. This shows that the tech sector is where all the money is at the moment. Today, in 2024, 7 out of 10 most valuable US companies are tech-oriented, and what is the technical skillset that powers all of them? Programming. Yeah, so when every LLM innovation and benchmark is guided toward programming tasks, it's simply because we're trying to create profits for the most dominant and valuable industry in the US and the world.

Secondly, software/hardware are also highly interdisciplinary fields. Healthcare requires them. Education requires them. Real Estate requires them. Basically every single industry you can think of requires them. Software/hardware aren't just small components in these industries either. They're basically the backbone and management system for the world. Thus, solving programming drastically impacts all other industries too. There's an old saying on Twitter that:

Every company is a tech company.

In the future, it's likely that every company is going to be an AI company.

Data collection

If we discard text data in blogs, literature, and copywriting, code is probably easier data to scrape compared to laws and medical records. Yeah, I know you can scrape the whole "Pass the Bar" book and train your LLM on it. However, this wouldn't make the LLM more intelligent in law-specific tasks. The LLM may be exceptional at recalling information about a discrete law in Kentucky, but it cannot win your insurance lawsuits. These kinds of more advanced tasks require more proprietary data from different law firms or court case transcripts. The same goes for medical use cases. The LLM can recite the correct names for different diseases, but when you get a patient with 5 different conflicting symptoms, I wouldn't trust an LLM. Then, doctors' medical notes or hospital records could be helpful. But again, these are highly classified and scattered data. They can even be stored in a 20-year-old notebook instead of a real database.

This is where programming differentiates from many of these fields: its openness, open source in particular. The innate nature and culture of open source allows all types of programming patterns, designs, and use cases to be shared publicly online. A gold mine of data for LLMs to train on for free. Although this doesn't guarantee high-quality code generation because open source and public code don't mean quality code. It could even produce lower-quality code in general. Despite that, the scale of the code data that we have is still considerable when training LLMs.

This doesn't mean that programming is solved by LLMs. Far from it. Companies are still spending billions of dollars on talent and infrastructure to solve this. However, it does seem like the first technical task that AI will crack, based on the amount of investment we're throwing at it.