Training a next-word prediction model from scratch
In the last post we played around with BERT and saw that it predicted words pretty well. To show how much BERT improves word prediction over previous state-of-the-art models, we will train our own word prediction model using a lesser model. The model we’ll use is an LSTM - we’re not going to delve into what these model are or how they work in the post (you can read more about them here).