In 2012, Dr Tomas Mikolov received his PhD in Artificial Intelligence at the Brno University of Technology in the Czech Republic with a thesis named ‘Statistical Language Models Based on Neural Networks’. Working for Google Research, a year later, he published two highly influential papers in which he introduced the Continuous Bag of Words (CBOW) and skip-gram algorithms, also known as Word2Vec. As a result, words could be numerically represented in a dense continuous space following a simple training procedure. This was one of the first numerical approaches that effectively captured word semantics and allowed for much larger vocabularies to be processed. Many state-of-the-art NLP tasks were outperformed using this technique, and Word2Vec successors still play an important role in language models that are now considered state-of-the-art. Dr Mikolov believes complex systems might be the next step towards intelligent language models. However, to reach such intelligent language models, the scientific paradigm needs to change to create an equal level playing field and allow for novelty.
A decade after having defended his PhD thesis, he has been cited more than 125 000 times and has an h-index of 49 and an i-10 index of 85, according to Google Scholar. In 2014, he moved to Facebook and then returned to the Czech Republic in 2020. He builds his team to develop a system that could gradually evolve into strong artificial intelligence at the Czech Institute of Informatics, Robotics and Cybernetics.
Your Word2Vec algorithm was revolutionary in the natural language processing domain. Can you describe the publications leading up to your work?
A very influential research group, led by psychologist David Rumelhart, was working on similar concepts already in the 80s. One of Rumelhart’s students was Geoff Hinton, famous for his work on neural networks. In the 80s, they already used neural networks and distributed representations to represent words and showed interesting properties. In the 90s, Jeff Elman used recurrent neural networks to model language. He used artificial data generated from simple grammars written by hand. Hence, his work had many simplifications and was not as complex as today; it was not even close to our current state-of-the-art. But it was a very inspiring and forward-looking approach to representing language. A very influential publication in 1991 called: ‘Finding structure in time’ discusses approaches to represent time in connectionist models. This was inspiring work for me when I worked on language models as a student. Yoshua Bengio published an influential neural language modeling paper around 2002 in which he outperformed standard language modeling baselines on small datasets. Later, I published several papers with Yoshua and visited his group for half a year. Lastly, the first person I discovered to use neural networks for general sequence prediction with state-of-the-art performance on challenging benchmarks was Matt Mahoney – his PAQ algorithms were basically neural language models used for data compression with amazing performance.
Who influenced you the most?
The most influential person for me was Matt Mahoney in the very beginning, and then later Holger Schwenk; I found his papers simpler to read than Yoshua’s. It contained quickly implementable methods instead of using unnecessary complex approaches. Hence, I was trying to do something similar myself. When I started my master thesis in 2006, one of the first models I implemented was a recurrent neural language model. I did not know any prior work then, and I was very excited about this recurrent network idea I had just invented. But it did not work well at first – it was better than the n-gram model but worse than a simple feedforward neural network. At the time, it was very challenging to make such models work because we did not know how to handle the exploding and vanishing gradients. The ‘ ‘community’s interest in the learned memory in recurrent networks was high in the 80s and 90s. However, nobody knew whether stochastic gradient descent could work, and some papers claimed it could not. Furthermore, while people had limited success on small datasets, nobody could successfully train recurrent networks on large datasets, at least not without sacrificing most of their performance. Nowadays, we know they can, and these stories from the past are hard to grasp.
This was an exciting story for me. I came up with the idea to generate text from neural language models in the summer of 2007 and compared that to the text generated from n-gram models (taking inspiration from the SRILM toolkit). The fluency increase was incredibly high, and I knew immediately that this was the future. It felt very cool to observe these results while knowing I’m the first person to ever see this – like discovering an unknown island full of strange animals.
When I started working on RNNs, I did not know about vanishing and exploding gradients. After some time, I managed to make the RNNs work very well on small datasets. This was a challenge on its own – to evaluate various language models and compare them, as typically, at the time, all published models were evaluated on private data. Also, the code was not released. Luckily, I managed to get one dataset during my internship at Johns Hopkins University in the group of Fred Jelinek in 2010. After a few small tweaks, I published it on my website, and that’s how the now very well-known Penn Treebank language modeling benchmark came into existence. It has nothing to do with the treebanks at all – it was simply meant by me to be used for comparing different language modeling techniques while being compatible with previously published results by JHU researchers.
I also published my RNNLM code in 2010, including the text generation part, so that other researchers could reproduce my results easily. This was crucial: the improvements over n-grams I was obtaining were so high that pretty much nobody at the time believed my results could be correct.
However, the chances that my recurrent networks did not converge grew with the dataset size. This messy behavior on large datasets was unpredictable. While some 90% of models trained on Penn Treebank did converge to good performance, on larger datasets, it dropped to something like 10%. Because RNNs are difficult to implement from scratch, I assumed there must be a bug in my code. I thought I just computed the gradients incorrectly or had some numerical issues.
I looked for the bug for days. Ultimately, I isolated the place where entropy spiked, and things worsened. Some gradients became massive and screwed up the training by overwriting the models’ weights.
What did you do to resolve this?
My solution was crude. I truncated the gradient values if they passed a threshold value. Any mathematician seeing this trick would think it is horrible. Still, the main thing was that the gradients exploded very rarely, and thus anything that prevented the explosions was a good enough solution. This heuristic worked and allowed recurrent neural language models to be scaled to larger datasets. Nowadays, it is much easier to debug code because you know what results you are aiming for with standard models evaluated on standard datasets. But in my time, this was different. I was getting new state-of-the-art results but did not know how much further one could go. It was exciting; I was climbing a mountain that no one had visited before me, and I had no idea how high it was. Eventually, I managed to get a perplexity on Penn Treebank around 70, which was about half of what n-grams had. This remained the state-of-the-art result for quite some years. Although here I can complain that language modeling results became incorrectly reported around 2014: with the invention of dropouts, researchers started focusing on achieving the best possible results with a single model. But then all my results were discarded, which were model ensembles. However, the dropout technique is an ensemble in disguise.
Following this experience, I found another inaccuracy in the deep learning narrative. In 2014-2016, when deep learning’s popularity skyrocketed, there came explanations for why the boom happened at this point and not before. Many dedicate the rise in popularity to the increasing computing power and large datasets. But this is not the whole story. The reason that it started working is that we* figured out how to use the algorithms correctly. For example, you could take my RNNLM code and run it on hardware and datasets from the 90s – and you would obtain state-of-the-art results by a large margin compared to techniques from this time.
Obviously, having more computing power never hurts, and it was crucial for the adoption by the industry. However, the deciding factor for the popularity in the research community was how to use these algorithms correctly; the increasing computational power was secondary. I would also rate open sourcing and overall reproducibility as very important factors. Deep ‘ ‘learning’s history is much richer and longer than many people think nowadays.
*This was, of course, not just me; Alex Krizhevsky made CNNs work for image classification, George Dahl, Abdel-rahman Mohamed and others did figure out how to use deep neural networks for speech recognition, and also many other PhD students of our generation did contribute.
Did you already think about representing words differently during your PhD?
Indeed, I did not come up with Word2Vec when I worked at Google; I already did before then. The first thing I did, similar to Word2Vec, was during my master thesis in 2006. I didn’t know much about neural networks at the time. I saw a paper from Yoshua Bengio that used a projection and a hidden layer. I didn’t know how to work with neural networks with more than one hidden layer, so I decided to split the model into two parts. The first was just like Word2vec – it learned the word representations from the training set. The second network then used these concatenated representations at the input to represent the context and predict the next word. Both networks had just one hidden layer, and the results were pretty good – similar to Yoshua’s paper.
During my PhD, my first paper at an international conference was about this model. While it wasn’t all that impressive, I knew that good word vectors could be learned using rather simple models. Later, I saw several papers that used more complex neural architectures to learn word vectors. It did seem rather silly to me – people would train a full neural language model, then throw it away and keep just the first weight matrix. But the research community interested in this was tiny, and I thought it was not worth publishing anything on this topic. Later, when finishing my PhD, I interned at Microsoft Research with Geoff Zweig. He was an amazing supervisor, but sometimes he would express doubts about neural networks being the future of language modeling – so I was thinking of how to impress him.
What did you do to convince him?
This is a funny story. I made some calculations and double-checked the outcome before approaching him. Then, I asked him whether it would be possible to apply simple additions and subtractions to word vectors. I asked him what the closest vector would be after subtracting ‘man’ from ‘king’ and adding ‘woman’ (other than the input word, or else you would often end where you started).
He told me this was a rather silly idea and that nothing sensible could come out of this. So, I immediately took him to my computer and showed him the experiment – it returned ‘queen‘. He was amazed and started playing around. He tried past tenses of verbs and plurals, etcetera. Some ideas worked, and some did not. But it was much better than random guesses. It is very fascinating to see these analogies for the first time. It raises fundamental questions. Why does it show these regularities? Why is this fully linear? Why don’t you multiply vectors instead of adding and subtracting them?
Were your Google colleagues as skeptical as your supervisor?
I wouldn’t call Geoff Zweig skeptical but rather careful. He was actually very supportive, and it was easy to convince him that some ideas were worth pursuing. I’ve had much more trouble earlier in my career. When I started working on neural language models, I received extremely negative reviews from a local linguist at the Brno University of Technology. He would go as far as to say that the whole idea of using neural networks for modeling language is complete bullshit and that my results must be fake. He almost managed to get me kicked out of the PhD program.
When I joined Google Brain, several colleagues were already trying to learn word representations. However, they tried to train large language models to get the word vectors. In the large language models, 99,9% of the training time, you update parameters irrelevant to the word vectors. From my master thesis in 2006, I knew that if the final task was not language modeling, such large language models were unnecessary. Instead, using a simpler model to compute word vectors suffices.
I shared this insight with some colleagues. However, no one really listened. Some were following a Stanford paper, which was complicated and contained many unnecessities. Having just started at Google Brain, my first goal was to show how the problem can be solved efficiently. I started playing around, and it worked well within a few weeks. Using an ordinary desktop computer, I could train models using hundreds of millions of words in a few hours. My model beat an internal Google model that was trained over weeks on many machines.
What happened then?
Yoshua had just organized a new conference, ICLR, and asked if I could submit a paper about word analogies as it was quite a surprising result back then. He thought this would make a cool paper. He reached out to me halfway through December; the deadline was early January. So I spent my Christmas holidays writing the Word2Vec paper in California. It was not very well written, but I cared more about the implementation and results than the paper. Supported by my colleagues, I submitted the paper to ICLR. But unfortunately, the reviews were quite negative (it was an open review, so it should still be accessible). One reviewer complained that the model did not consider word order. Another reviewer tried to force me into citing other papers more extensively, which I already cited, and which were published after my papers on the model from my master thesis (which already contained the main idea).
Here is a funny detail. Although a well-known conference now, this was ICLR’s first edition and was small. The acceptance rate was around 70%, so pretty much everything that was not totally awful was accepted. But the Word2Vec paper got rejected, although today, it is probably more cited than all the accepted papers at ICLR 2013 together. Then, I decided to write another paper with several extensions. This paper was finally accepted to NIPS.
You have never published your first paper anywhere else, have you?
The first paper was accepted for a workshop after it was rejected for the ICLR conference. But I don’t think a workshop counts as a publication. Also, it was published on Arxiv, and I was happy people could read it. When I published it, I knew it was much better than currently available – at least in the aspects I cared about. The algorithm was not overly complex and provided very good results in practice.
Did you expect the paper to become this well-cited?
The neural language modeling community was small when I published the paper. However, I was very optimistic and expected that at least fifty people would use it within a year. Six months after the paper was published, it still went unnoticed. This was because Google did not approve me to open-source the code. Initially, they perceived the code as a competitive advantage. However, I kept on pushing to open-source it. The senior people around me told me to stop trying, as I would never get the approval. Luckily, I knew people at Google Brain with even higher positions. They managed to bypass the blockade. Finally, Google approved to open source the code around August 2013. That is also why the code was somewhat over-optimized: while waiting for approval, I tweaked the code to make it shorter and faster. Once the code was open-sourced, the interest skyrocketed. Many people were interested in Google’s machine-learning activities and liked that Google was open-sourcing its code. This helped immensely. I was, in fact, surprised how many people started using the code and the pre-trained models for all kinds of purposes, in some cases even outside the area of modeling words and language.
Why are you advocating open source?
As a student, I found it difficult to compare different algorithms from peers because it was often impossible. Some fifteen years ago, publishing a language modeling paper evaluated on a private dataset without any open-source implementation was normal. That was, in my view, the main reason why language modeling research did not show much progress over the previous decades. I reached out to several researchers asking for their datasets, but without success. Nobody could validate published results at some point, and the community was dead. I have found that certain people are even cheating when reporting results, for example, by using weak baselines or reporting the best results after tuning hyperparameters on the test set (or even training models on the test set, which was rare but not unheard of). I was inspired by Matt Mahoney’s work in the data compression community and wanted to rebuild my interest in statistical language modeling, so I wanted to publish both my code and the data whenever possible. Of course, one big aspect was that when I started publishing my results with large-scale neural language models, my improvements were so big that pretty much the whole research community did not believe my results could be correct. But since nobody could find any mistake in my code (and many tried – I received a number of emails from people who thought they finally found a “bug” in my code), my RNNLM toolkit became used in several big companies, and language modeling research finally took off. That was the beginning of deep learning in NLP.
Are there downsides to open-sourcing?
I think so. When new students join the AI community, they should try to develop their own models and discover new ideas. It is, however, very difficult because they end up competing against state-of-the-art models which were incrementally optimized over the years by many researchers.
Alternatively, a student can download someone else’s code and even pre-trained models, which are usually complex, and they probably don’t fully understand it. Then they tweak it, make incremental changes and publish the results in a paper. This approach is much easier. Meanwhile, this is a dangerous development for science, as it locks us in a local optimum. Several dominant ideas are being over-explored, and very few people are thinking about novel approaches that can bring new paradigm shifts. The open-sourcing and publish-or-perish together helped to create an environment where taking risks is not rewarded.
So there are benefits to open-sourcing code. Meanwhile, the adverse effects are apparent. Is there an ‘optimal’ approach in the middle?
Given the importance of computational power, the group with the most GPUs has a major advantage over everyone else. This discourages people from academia and creates unfair competition; not everyone has the same starting conditions. It is as if you were to go to the Olympic games to compete in running. However, during the race, you are competing against people on bicycles. Regardless of how good you are, you will lose. The same happens when students with limited resources compete against tech giants. They might have better ideas but still get their papers rejected for not being state-of-the-art. This issue needs to be addressed by the community.
A simple solution to this problem is applying computational limits allowed for publishing a paper in certain benchmark tracks. Following this approach, papers should be submitted together with code that can train in X hours on a standardized machine. Still, one could exhaustively explore the search space and submit code with the best hyperparameters, so people with more computing power would still have an advantage. But at least the competition would be fairer. By the way, when Matt Mahoney proposed his compression challenge, he already did consider this.
What are other issues in the machine learning community?
As the AI community doubles in size every few years, dominant scientists easily take over the mindsets of junior researchers. However, those dominant scientists who post tons of tweets and Facebook posts and have strong opinions on everything are not always the ones who made as strong contributions as they like to pretend. A community where a few dominant senior scientists are leading masses of junior researchers then starts to look like some sort of a cult. That means some ideas, techniques or models are blindly pushed forward without real evidence that such ideas are worth all the effort. For example, the Generative adversarial networks (GANs) seemed incredibly over-hyped. This is not a new phenomenon. When I was a student, I remember being confused by the popularity of Latent Dirichlet allocation – it also did not seem to work any better than simpler baselines. But nowadays I see this as a bigger problem, as information spreads much faster.
The over-emphasis on results achieved through brute force is illustrative of this problem. Many believe good models are complex-looking and full of hyper-parameters with little tweaks. If a simple idea that works is proposed, a reviewer often argues that anyone could have done it and, thus, it is not worth publishing. I have seen this several times and consider it completely dumb. In fact, I believe in the opposite: simple ideas that work in practice are the most valuable and difficult ones to discover. Sort of like in physics, where scientists are trying to develop more and more general theories that would explain as many phenomena as possible.
In fact, this happened to Word2Vec, but also to some of my language modeling work. When a poor reviewer sees two papers with similar ideas, but one paper also adds a dozen unnecessary tweaks, the poor reviewer will choose the complicated paper as the better one, as it seems more work went into it. In reality, the opposite is often true – if you can get state-of-the-art results with a simple idea, then the idea is probably really good.
How can we get better reviewers?
Machine learning could take inspiration from physics. Over centuries of research, physicists have aimed to create simple theories aiming to explain everything. Meanwhile, the opposite happens within machine learning. We should abandon the emphasis on state-of-the-art results and complex models and focus on discovering interesting new ideas. Of course, this is highly subjective, and it would be better if we could turn machine learning into an Olympics discipline with clear rules that decide who is better. But as I mentioned before, I don’t think this is easily achievable. Today, you can propose an amazing new idea that can potentially be the next state-of-the-art and still get discouraged and rejected by the community for not being state-of-the-art on some large benchmark. PhD students don’t get enough time to develop their own methods and approaches. We should change this and start rewarding novelty and simplicity, even if it is not easy to measure.
Perhaps you’ve heard of the reviewing experiment at NIPS. More groups of reviewers reviewed papers to see how much the accept/reject decisions correlate. And it was found that there was a strong correlation only for the very poor papers. In other words, the reviewing system is very random.
We should aim to create a better reviewing system. Currently, we do not have quality feedback in the reviewing system; the system allows reviewers to be constantly wrong and still be able to review more papers. We should have databases of reviewers automatically tracking their performance. Their quality should be computed based on their ability to predict successful papers. For example, papers with excellent ideas but poor English should be accepted.
In the IEEE SMC conference plenary talk, you described your focus on complex systems as the next step in artificial intelligence. Is this an approach to elegantly simplify the rules for computer scientists?
Complex systems are simple systems where complexity arises through emergent/evolutionary mechanisms that you don’t even specify. Let’s take ‘The Game of Life’ as an example. You start with something simple and then simulate the system until all kinds of complex structures emerge. This has always been my view of the universe. Many things around us seem complex. However, these complexities can be seen as by-products of evolution. Natural intelligence is a product of evolution. If we want to mimic this with artificial intelligence, we should follow a similar approach – allow AI to evolve and have the potential to increase in its complexity spontaneously.
How does this compare with evolutionary algorithms?
One could use evolutionary algorithms to approach this. However, I don’t think these algorithms capture evolution well. They perform stochastic optimization. If the fitness function improves, then you follow this random direction. Hence, the gradients are randomly selected instead of being computed. But this is not evolution, in my view – after all, evolutionary algorithms tend to stagnate rather quickly. Real evolution can be found in complex systems, even deterministic ones. Nothing is stochastic in the Game of life; you don’t have to roll a die. And still, you can see novel patterns arising. My goal is to create systems that can evolve spontaneously based on the emergence of complexity. I feel that discovering machine learning models that can implicitly grow in complexity has the potential to make our AI models much more powerful. It could be a way to make machine learning truly creative.
How will you create such systems?
I suspect understanding emergence is needed to solve AI. However, we don’t understand this direction well. When I started working on recurrent neural networks, I hoped these could be a shortcut towards interesting, complex systems where emergence happens within the model’s memory. But typical recurrent network architectures have somewhat limited memory capacity. We need to design novel machine-learning models, training algorithms and evaluation metrics. I am working on this with my students.
How would this contribute to the machine-learning community?
The community has embodied a herding culture. We are all going in the same direction, building on the existing pile. This mindset might be strengthened by all the open-sourcing and public benchmarks I advocated for quite heavily. However, the approaches on which we are extending might be wrong. Then, everyone builds on flawed assumptions. If this is the case, it needs to be fixed. As I mentioned, we should explore different ideas and reward novelty within the research community.
This sounds like a reason not to open-source more.
Open sourcing is, in my opinion, great, and we should continue. Remember that when almost nobody published code and datasets were private, researchers generally did not trust each other’s results. The language modeling community was pretty much dead.
At the same time, we should avoid the dangers of open-sourcing: too much incremental work, minor tweaks that provide tiny improvements (and only sometimes), and discouragement of exploring novel ideas.
This marks the end of the interview, is there a last remark you want to make?
We should be more open to originality and new directions. Yet, this is very difficult to judge. Do we want to see seemingly insane ideas at conferences? As a community, we need to make conferences more interesting so that you don’t just see hundreds of modifications of Transformers or their applications to hundreds of datasets. Let’s be more ambitious and more exploratory.