Visualization of WikiText-2

Introduction

There are many methods on how to train a model to predict an unknown word. Today lets take a look at Continuous-Bag-of-Words (CBOW) Embeddings. CBOW works by predicting a single word based on a fixed window size of context words. For example, if our sentence is “The boy walked the dog in the park”, using the context words of [“boy”, “walked”, “dog”, “in”] we try to predict “the”. The original paper shows CBOW to be a fast converging algorithm which works well in learning syntactic relationships between words.

Quick guide for interaction: Hold and select to pick a particular region to zoom in on. Double-click to zoom back out. Top right task bar shows additional options such as panning and selection methods. Control-F to search for a particular word if you’re stuck as to where to get started.

For the visualization, I used WikiText-2 which is formed from a collection of Wikipedia articles. To construct the CBOW model，I have an embedding layer followed by one hidden layer then an output layer. After succesfully training the model on the WikiText2 dataset, we can extract the weights from the embedding layer to create the TSNE, a technique for embedding high dimensional data for visualization in low dimension space. The visualization shows the thousand most frequent words. Panning through the visualization, you may see many clusters which make sense such as words relating to time being closeby such as “hours, minutes, century”. There are also large gaps in other similar words such as the months of the year which are scattered far and wide. At a high level, words with similar meanings should cluster together, however when reducing dimensionality, it is unclear what exact factors are chosen to represent words so there will be some strange effects present. Of importance is that TSNE is non-deterministic being that results may vary slightly each time even when generating from the same model weights.

DISCLAIMER

TSNE does not retain distances between clusters after dimensionality reduction and also collapses wide stretching and expands upon dense regions. For these reasons, and many more, it is not recommended to draw conclusions solely from distances alone. The clusters could be real or simply artifacts of a Gaussian/t-distribution. TSNE CAN produce “fake” patterns and is used for visual appeal and not for classification methods. Afterall, squishing hundreds and thousands of dimensions down to two can result in heavy distortions.

Reference