Natural Language Processing

Please subscribe so that you can receive updates
and new articles in your inbox. They are always free.

Original: 02/01/20
Revised: no

$$Rome + France - Italy = \space ?$$

What do you think the answer should be? But moreover, what are we doing here!? Adding and subtracting words as if they were numbers!? The fact that this equation makes sense is one of the most remarkable successes with using neural networks for Natural Language Processing (NLP). This success of AI has had the most visible impact in our lives.

The majority of AI applications have been so far based on supervised learning, i.e., where an AI system is trained on a large set of human-labeled data and it learns from that data how to label other data which it has not seen before (case in point, we show it 60,000 pictures of handwritten digits and it will learn to recognize handwritten digits that it has not seen before). One of the largest data sets that humanity has at its disposal is in text form. How can this data be seen as labeled? The answer turns out to be of great importance.

The main working assumption in modern NLP is that of distributional semantics, namely that words only have meaning relative to the context in which they are used. The popular expression of this way of looking at words is You know a word by the company it keeps. Writers are fully aware of the fact that the meaning of words is mainly contextual. When we look at textual data, we can treat each word as being labeled by the contexts in which it appears. That's why the learning we do on text is supervised learning.

The modern treatment of this distributional semantics approach is based on machine learning with neural nets. A neural network is used to learn a representation of each word as a vector. The learning process does it in such a way that two words are semantically close if the vectors representing them are similar (similarity is a precise linear algebra operation, but we do not need to worry about that now). Below is a presentation of Word2Vec, the best known such representation algorithm.

So questions about words become questions in linear algebra about their representing vectors. Remarkably, the equation we showed above can be solved by such a representation: $$Rome + France - Italy = Paris$$

The easiest way for AI developers to become familiar with this vector approach to NLP is to download pre-trained online models. Moreover, most NLP libraries that developers use in their applications also contain such pre-trained models not only for English, but for other languages as well. Google makes a number of such models available, and depending on the corpus of data that makes the most sense for you application (Wikipedia, Google News, Freebase, etc.), you may find one of them useful.

Let's build now a bridge from textual data to the data which is of most interest to us. It will be helpful for the reader to keep this bridge in mind and refer to this section when necessary. The data that is at the center of our attention is graph data. We will see in the dedicated article Graphs of Data that graph processing borrows from the embedding technique we discussed above. We established that word embeddings are based on the idea that words live in neighborhoods (their various contexts) and that these neighborhoods determine their meaning. But a node in a graph also has neighborhoods, namely the various subgraphs containing other nodes which can be reached by traversing the graph starting from the node of interest. So one could also look at a graph as labeled data as well and we could also embed a graph in a vector space, in such a way that the similarity between the vectors corresponds to how close the nodes are in the graph.