ELMo can do better Information Retrieval rather than traditional static word embedding
Table of contents:
- Introduction of Information Retrieval
- Traditional embedding techniques
- Why ELMo
- Code along
From the last decade, people are more dependent on Google for searching any thing which is required. And most of the time Google returns relevant searches. Standing in the year of 2021, Google is the best Search Engine. The success came from its desire and ability to provide higher-quality results for each user. Understanding search intent and finding the most accurate and relevant websites that match each query have allowed Google to stand out from the competition.
It is nothing but the ability of retrieval information of user input query from the billions and billions of data available in the internet. This is called information retrieval process.
It basically checks the similarity with the query and the other documents present in internet. Then based on similarity score, Google represents the documents in descending order.
Traditional embedding techniques
Information Retrieval can be considered as a NLP task, where we need to convert the text into numeric term and then we need to do the document similarity comparison. In the world of NLP, there are multiple process to convert text into numeric more genuinely in vector — BOW, Tfidf, Hashing technique, word embeddings like w2v, Glove etc. Some of them are capable to hold the semantic similarity and some of them are not. Semantic similarity based techniques like w2v, Glove can give better text transformation as they are able to understand the meaning of similar words.
But here is also a problem, all the traditional embedding techniques are static. That means in real life, a fixed word can give different meaning on different sentences based on the parts of speech and the position of the word. But if we use traditional embedding technique, then they will return same vector for the word in different sentences. Like — 1. I will visit river bank in the evening. 2. I went to bank for withdraw some money. Both the sentences have the common word ‘bank’ and the word has completely different meanings. But if we use traditional embedding method, they will return same vector for the word ‘bank’ in the two sentences. But it has to be two different vectors for the same word ‘bank’ as the meaning is different.
Here dynamic word embedding comes into picture.
ELMo is contextual word embedding technique that is build by AllenAI in the year of 2018, just after Transformers was introduced by Google brain.
This word embeddings is helpful in achieving state-of-the-art (SOTA) results in several NLP tasks.
What’s make ELMo’s performance better?
ELMo is build on top of a two-layer bidirectional language model (biLM). This biLM model has two layers stacked together. Each layer has 2 passes — forward pass and backward pass. So, from there ELMo learned language understanding from being trained to predict the next word in a sequence of words — a task called Language Modeling — — — Analytics Vidya
As ELMo uses bidirectional model as for the learning purpose, so when ever you will pass a text, it will capable to understand the meaning and able to generate dynamic word embedding depending on it’s POS and position. That’s why ELMo is a better choice for converting the text into vectors. BERT word embedding plays the similar role.
I found the wonderful blog, where document ranking with weighted w2v(static embedding) is nicely explained. I have taken few concepts from the writer, thanks to him. Here I used a small data set for the experiment purpose. First I used w2v embedding and then used ELMo embedding to compare the accuracy matrix using mAP. Basically for information retrieval, no proper accuracy matrix is there. Here is used mAP just for comparing the model performances.
Now what is mAP?
Suppose we have a model that labels some samples positive and some samples as negative. Now, precision tells us how effective our model is in labeling samples positive. Mathematically, it is given by,
Here, True positives are the positive samples that model labeled as positive and False positives are the actually negative samples that model labeled as positive. So, False positives are the samples that model mislabeled. Therefore, Precision can also be seen as the percentage of true positives among all the samples labeled as positive.
I am sure that you have read about it before in classification problems. But, there is a twist in the way it is used in information retrieval. Let’s understand it with an example.
Suppose our information retrieval model ranks the documents according to their relevance and returns the top 5 documents. According to our model, all the five documents returned are relevant to the query, but when we checked the ground truth, we found documents at rank 2 and 4 as non-relevant. Now, let’s try to measure this using precision.
In information retrieval, the meaning of precision still remains the same, but the way we draw results from it changes. Here, we calculate precision at a specific rank. This precision is denoted by P@K, where K is the rank at which precision was calculated.
Let’s calculate P@K for the above example. At rank 1, we have a relevant document. So, our precision(P@1) becomes one because there are no false positives here. If we had a non-relevant document at this position, then P@1 would have been 0.
Now, let’s move forward and calculate precision at rank 2(P@2). Here, we will consider both documents at rank 1 and rank 2. Since one is relevant and the other is non-relevant, the P@2 will be 0.5. If we take a look at rank 3 we’ll find that there are 2 relevant and 1 non-relevant documents till rank 3, i.e., there are 2 true positives and 1 false positive. Now, if you’ll calculate P@3, you’ll get 0.67.
Similarly, you can calculate P@4 and P@5, which will be 0.5 and 0.6, respectively. I am sure by now, you have understood the way we calculate P@K. Now, we have covered precision. Let’s now understand what average precision is.
Average precision is the average of precision but, while calculating it, we only take precision at those ranks where we have a relevant document. Let’s understand it by calculating average precision(AP@K) for the previous example. Here, we have non-relevant documents at rank 2 and 4. Therefore, we’ll skip their precisions while calculating the average precision. So, our average precision for this query will be,
This is how, mAP is calculated for document ranking/information.
Let’s jump into code part. Here I will not go though all the steps, if you are interested, the you can visit to my notebook. You need to do basic text preprocessing stuffs before jump into build the system.
The output is:
Now, using ELMo embedding:
The output is:
So, you can find that using ELMo, mean absolute precision is high, using same dataset.
Now, time for creating a function so that taking user query, it can retrieve the similar documents.
The output is the following -
Nice !! Similarly you can also build your own IR system.