One of the traditional applications of artificial intelligence is in translating between languages. It’s so common that it’s boring: Google Translate has been around forever. You can use your iPhone to take a picture of a street sign in a foreign language, get an instant translation, and that’s not a miracle, but ordinary. I know lots of people excited about technology, but nobody who’s excited about translation as the next big thing — like, don’t we already have it?
In this essay, I’m going to make exactly that case: translation is the next big thing. Language translation as a concept has been so normalized for so long that progress in it and related fields is hugely underrated, especially in its future economic impact. While we are early on in an artificial intelligence boom and there are many open questions as to how and where exactly it will play out, I believe that one of the highest-confidence AI bets you can make is on ubiquitous language translation, which will have a significant globalizing economic and social impact, perhaps on par with the invention of the smartphone. Let’s dig in.
(1) The Tech
While traditional, text-based language-to-language translation has gotten much better in recent years,1 the key is that several other technologies have become really good in parallel, and are coming together productively:
Speech-to-text (i.e. automated transcription)
Text-to-speech (i.e. machine-generated voice)
Optical character recognition (OCR)
Audio and visual style transfer
When you put them together, magic emerges. For example, you can get perfectly translated videos,2 even with all the lip movements adjusted to match the sound.
Importantly, this is not some highly proprietary, secretive R&D effort: it’s so simple and accessible that solo researchers can put together open-source versions that get the job done. Part of the reason why these technologies will be so disruptive is that their building blocks are often easy-to-work-with pieces of open-source software. The technology is already here: we are on the cusp of it going mainstream.
From dubbing videos, it’s only a short leap to doing it in real time: instantaneous language translation. Imagine walking around a foreign city with your Airpods in, and the chatter of another language becomes English in your ears. It’s been a long time coming: Bose and Google tried it back in 2018, and Timekettle seems to have gotten it to work today.
It’s a safe assumption that not too far in the future, you will have automatic, instantaneous translation always on. If you’re on a Zoom call with folks who don’t speak your language, then Zoom will translate it in real-time. If you’re walking around a city where you don’t speak the language, but you’re wearing a pair of inconspicuous smart glasses (Meta Ray-Bans below), they will overwrite the street signs in English. You could go on a date with someone who doesn’t speak your language — and so long as you’re both wearing translation earphones, it could go just fine.
In general, I believe that people will interact with the world around them more and more through digital interfaces. Many use cases have been cited for smart glasses and earphones — navigation, digital assistant, recording, etc. — but it seems clear that translation is at the top of the list, both in terms of usefulness and ease of implementation. It is too easy and the benefits are too large for this not to become the case.
(2) The Impact
The key term for thinking about internet-age economy and society is innervation:3 to create a nervous system for a large organism. Historically, humanity was a large set of mostly-independent actors, and it took a very long time for information to spread from one group of humans to another. What the telegraph and radio started to do — and the internet kicked into high gear — is to give humanity a nervous system: a centralized, instantaneous way to transmit information across all people.
In this framing, ubiquitous translation is an extremely powerful agent of innervation. Most people do not speak one another’s languages, which means that the true extent of internet-enabled connectivity today is smaller than it seems. While it is true that 5.3 out of 8.1 billion people have been connected via the internet, they congregate in language-specific groups. For example, I don’t speak Russian, Arabic, or Japanese, so I’m mostly disconnected from those pockets of the internet. The below table shows how small these language groups are, relative to the size of the internet’s full population.
From the mid-90s through the early 2000s, the story of internet adoption was mostly about desktop users in developed nations getting broadband access, amounting to about two billion people. From about 2010 onward, the story of the next three billion people coming online was mostly about smartphones: the great agent of global innervation was the inexpensive smartphone with data, bringing the internet to mobile-first communities all over the developing world.
It is in this sense that ubiquitous language translation could have an impact on par with the smartphone: not by bringing more folks online, but by greatly increasing the connectivity of people on the internet. Simply put, given instant, always-on translation, everyone will be able to communicate better,4 with far more people than before.
Our cultural melting pot is about to become much larger, and the globalizing effects will be massive. There will be gains: for example, in exposure to a greater diversity of cultures and viewpoints, a higher degree of economic participation and exchange, and an easier ability to engage with one’s interests. There will be losses: for example, astroturfing and viral misinformation will be easier to spread, and language-specific internet subcultures will be lost as their users are amalgamated into larger communities.5
But as far as I can see, the largest impact will be on the global labor market. We are still very early when it comes to remote work. The tools for it have become better, and pandemic-times have greatly increased willingness of employers to hire remote employees. But there are two very important considerations:
Remote work is still hampered by language barriers. While there are lots of folks all over the globe who speak English at a professional level, far more do not. Additionally, even fluent speakers may still face discrimination against their accents. People are biased against accents, and this likely creates subtle reluctance about hiring remotely in general. Real-time translation can remove these barriers: not just by translating between languages, but also by localizing accents to the listener6 — thereby removing subtle but powerful sources of workplace harassment and discrimination.
Implicitly, most of this discussion is about English-speaking firms hiring internationally. But that’s a biased perspective: most firms are not English-speaking. Suppose that you’re a Greek entrepreneur who speaks only Greek, looking to hire internationally: it will be difficult. It doesn’t help you if international applicants speak English, while you and your office don’t.
You may be thinking about always-on, real-time translation expanding the English-speaking internet, and US firms doing more international hiring. Those are true, but translation would provide a much bigger relative benefit — leveling the playing field — to those who are currently outside that ecosystem. Countries like Turkey, Brazil, and Bangladesh7 are the biggest winners from this kind of development.
In sum, the global labor market will become radically more efficient, and its current many national-linguistic subsectors will blend into one. The promise has been made for a long time, but at last, hiring someone five thousand miles away might become literally indistinguishable from hiring someone fifty miles away. How supply and demand will play out is hard to predict, but I suspect that that citizens of developing nations will gain hugely in employment opportunities, while elite knowledge work positions in wealthy enclaves will go remote at lower cost.
(3) Investing
We are on the precipice of a large-scale set of changes that, as a matter of technology and economics, seems inevitable. It raises questions about capital allocation: how do you best bet on this? That’s not easy to answer. The challenges are:
In predicting second-order consequences;
Greater efficiency does not always create opportunities to capture profits;
This is a long-term secular trend that may not take place overnight, and positioning today for a five-, ten-, or fifteen-year trend is not easy.
Regardless, I’ll provide some thoughts below.
Invest At the Fundamental Model Layer?
Probably not for me. My first hunch is that future generations of LLMs will outperform state-of-the-art machine translation systems. My second hunch is that open-source LLMs will probably perform just as well as proprietary LLMs in this respect in the long run. While I see some fields where proprietary fundamental models may maintain some perpetual utility edge, I don’t think that’s the case here.8
At the Fine-Tuned Layer?
Translations are contextual! As I suggested in footnote 1, one of the great perks of the architectures at play is the ability to apply context and stylize the output: the desired style of translation will differ depending on whether the input is, for example, a contract, a patent, a sales email, a poem, or a novel. If I’m listening to a Spanish podcast translated into English, I might want it narrated in the voice of Brian Blessed; or if I’m reading shareholder letters from Japanese companies, I might like them written in the crisp prose of Steve Jobs. While there will certainly be general translation models, you may expect to see thousands of fine-tuned stylistic models. I am uncertain whether this yields a single venture-scale opportunity or an artisanal cottage industry.
At the Interface Layer?
To me, this is the more likely bet. I mentioned that I expect to see more and more digital interfaces that people use to navigate the world around them. While that includes things like smart glasses and earphones, it also includes regular desktop and mobile software. Imagine an application that is always on and translates any foreign words that pop up on your screen, before you even see them. Imagine an audio plugin that translates any language coming out of your speakers.
Software: there is a lot of diverse opportunity for these translation layers, and the field is fresh. Some competition is starting to crop up in the video translation space, but it’s still early. Due to the accessibility of the technology, there is some danger of commoditization, so startups seeking to sustain significant profit margins long-term will need to think about network effects and moats. Regardless, there certainly exists some opportunity.
Hardware: this gets a little more interesting. The space is wide-open, and the user experience offered by these physical devices will be paramount. Surely it’s possible to make a nicer product than Timekettle. This may be a good arena for a talented product designer to build something excellent and establish a quality moat. It’s noteworthy that while Apple and Meta should be seen as formidable long-term competitors, Apple hasn’t really managed to integrate AI with its hardware productively. It is surprising that Siri is so far behind nowadays; perhaps this is an area of institutional weakness for Apple that can be attacked.
At the Geographic Level?
Another approach is to focus on the downstream consequences of translation. The effects on the labor market, accelerating globalization, etc. lend themselves well to betting on labor marketplaces, high-skilled education targeted at geographies with strong labor/wage arbitrages,9 remote work enablement, collaboration software, and so on. The main question on these opportunities is whether they are more ripe for new entrants, or for incumbents with existing network effects.
Finally, while we have discussed labor markets at length, I also suggested earlier that the largest relative benefit might accrue to entrepreneurs who are currently outside the English-speaking internet ecosystem. There are many geographies that currently do not have globally significant entrepreneurship, but do have the talent and the regulatory environment for it. They just need the global language access. For example, look at how successful South Korea has been in exporting culture (film, television, music), punching far above its weight on a global scale: but not yet in software. As language barriers fade, this too shall come.
Thanks to Evan and Gavin for their comments and feedback on this piece.
There’s some debate as to whether the current generation of Large Language Models are superior to state-of-the-art machine translators, such as DeepL. You can read some of the discussion about which is better in which context here and here. To me, it seems clear that right now the field is divided: in some contexts, machine translators are better, and in other contexts, LLMs are better. I think there is good reason to believe that future, higher-parameter LLMs will be better than the current ones, and I expect they will eventually (stochastically) dominate traditional machine translators.
An additional important perk of LLMs, as opposed to machine translators, is that they can be supplied with additional instructions. For example, I might submit an English text to Google Translate, and ask for it to be translated into German. I’ll get a result. However, I could prompt an LLM with the same translation request, and ask it to translate in the style of, for example, the translator Michael Hofmann. Or I could request the translation to lean into the style of German novelists Stefan Zweig or Herman Hesse. The ability to perform not just translation, but translation in a particular style, or more generally translating with additional meta-context seems like a valuable point in favor of LLM translators.
Some of these forms of better communication are not immediately obvious. For example, voice messages are far more common in some cultures and language groups than others; I think that has something to do with the compatibility of that language with a digital keyboard. Again, the notion of a multi-modal, always-on digital interface that handles translation could greatly bring down accessibility barriers that are currently present everywhere for folks whose language doesn’t use a Latin alphabet.
Similar to how tens of thousands of niche online forum communities died as discussion migrated to mass platforms like Reddit and Facebook. It will be a brave new world for any internet anthropologist.
For a related contemporary example, see this speech by Javier Milei, translated by HeyGen. What’s noteworthy about this speech is that HeyGen not only translated Milei to English, but then applied Milei’s accent back to the translation for authenticity.
For this example, I have picked countries that have low rates of English proficiency and are by far the globally primary speaker of their national language.
I mean “here” in the general sense: not just language-to-language translation, but also speech-to-text, text-to-speech, speech style transfer, etc. To me, this whole class of applications looks like something where capabilities eventually max out for practical purposes, and open-source models will get there.
I completely agree with the premise. Translation is an enormous and under-appreciated opportunity.
Re: fundamental model layer, though...
Traditional and fine-tuned LLMs *will* be excellent at this, eventually. But real time translation during conversation may be a little like AR/VR: we need it to be fast enough to be fluid, and the challenge of making it that fast could really high.
In that case, general LLMs and fine-tuned ones might not be good enough for a long time. Translation differs because...
1) You might need shallow inference from a wide context window with focused depth in a narrow one. (It's critical to know previously stated names, and where you are. But not every detail about what is going on.)
2) The audio in language A is obviously an incredibly important input into any model to predict the word in language B.
In particular, (1) and (2) suggest training a different neural network. Which isn't a huge deal, but... it is incredibly expensive and needs effort. And once you've designed this, you might have a different amount of data to manage (much less?!) which leads to a slightly different ideal hardware solution. In theory, someone like Nuance *should* be eating this problem up, but I would bet against them.
I'm not from the world of computer science but my interest in AI was sparked by watching 'The A.I. Dilemma' presentation from Tristan Harris and Aza Raskin. In describing the advent of transformers, the 2017 gamechanger for AI, they remarked "The sort of insight was that you can start to treat absolutely everything as language, but it turns out you don't just have to do that with text. This works for almost anything."
This is what made me realise the potential of AI and why it implicates everybody. Lawyers communicating and collaborating with musicians and mathematicians. It's something people should be exploring and reckoning with as soon as possible.