Discover the secrets of Google search engine and learn how to create your own search engine using math and coding.
You don't have to be a mathematician to have a feel for numbers.” - John Forbes Nash, Jr. | ||||||||||||||
Introduction | ||||||||||||||
Have you ever imagined how Google search engine works? How can their system ranks million of thousands of documents and return very accurate results to every user every day? Do you know that we can create a simple text search engine using just math and some coding? That’s exactly the topic I’m going to cover in this article and at the end of the post, you will be capable to implement your own version of the search engine and understand the concepts behind a Vector Space powered Search Engine. | ||||||||||||||
Vector space model | ||||||||||||||
Imagine that you are searching for information about airplanes and you want the result to be ordered by relevance, all the information that you have about airplanes is raw stored inside thousands of documents in a database, this documents are completely raw and unstructured, all you have are the text itself, so, you can’t search the term “airplane” on a category relation from your tables, then, how can the computer find what documents are relevant to you and order it? This can be done using the vector space algebraic model. | ||||||||||||||
The algebraic vector space model can represent objects(a text in our case) as vectors of identifiers, like index terms. This model will be a n-dimensions plan, where each dimension corresponds to a unique term from our index. So, if we have a text “airplane fly” we would have two indexes: “airplane” and “fly”, this two indexes can be represented in a plan like this: | ||||||||||||||
Now, let’s get this very simple scenario for the explanation, with a simple one word query and two documents with no new words(new words would update our index and our plain, adding new dimensions to it): | ||||||||||||||
| ||||||||||||||
Here the magic starts to appear, now that we have both a query and documents where to research, we need a way to find a relation between them. Remember the plane above? He will be the material to understand how it will work, but first, lets change our scenario to a more usable form: | ||||||||||||||
If this image remember you about something, congrats, you have a good memory and remember about vectors representation. Our strings were converted into groups of numbers, that could represent coordinates on a plan, and, for our lucky, we have a plan. Let’s draw the resultant vectors: | ||||||||||||||
We have all our documents and our query plotted into the plan, using both indexes “airplane” and “fly” as XY axis. For humans, it’s pretty easy to say that the doc1 is the most relevant and should be returned as the result, but for the computer, it needs a logical way to discover that, and now, we created a simple way with algebra to do that. Take a closer look to the angles between the query and both documents 1 and 2(blue and orange curves), the doc1 have a smaller angle than doc2 and it’s closer to the query vector, in summary, it’s way more similar than the doc2. This entire process can be represented and performed by calculating the cosine similarity between the docs and the query using this formula: | ||||||||||||||
Let’s break down the formula: | ||||||||||||||
Dot Product | ||||||||||||||
This image above represents the euclidean dot product formula, that can be defined as the sum of the products of the components of each vector: | ||||||||||||||
Basically, it represents the sum of the product of all the vector components of our query and the document we are comparing. If we applied it to the query and doc1, the result would be 1 as shown below: | ||||||||||||||
Magnitude | ||||||||||||||
In other hand, the image above represents the Magnitude formula, that can be defined as a property which determines the size/length of a mathematical object, in our case, vectors. Can be represented as: | ||||||||||||||
Basically, it’s used to describe the “size” of our document or query vector. At the bottom part of our formula, we calculate the product of both magnitudes. If we applied it as we did earlier, the result would be 1.414213562373095 as shown below: | ||||||||||||||
Final result | ||||||||||||||
When combining both parts that we covered above, the result would be 0.7071067811865475 as shown below: | ||||||||||||||
This result represents the cosine similarity between doc1 and the query, if we put it in percentage, we would have 70.71% similarity between the query and the document, meaning that this is probably the correct result! | ||||||||||||||
Remember that using documents or querys with more unique words and longer contents don’t change anything, it will basically represents new dimensions and components to be calculated by the computer! | ||||||||||||||
Coding it | ||||||||||||||
Now that we have covered all the mathematical part, we just need to transform all of it into code. The first part I’m going to show, is how to create the index terms and terms frequency from each text that we are going to process, to do this, I’m going to first remove the noise from the text(numbers, non-ascii, punctuation etc) and then, create a dictionary object with the keys as the index terms and the value as the frequency. | ||||||||||||||
| ||||||||||||||
If we execute this method on the texts we saw before, the result would be this: | ||||||||||||||
| ||||||||||||||
The next step would be convert the formula. As we know, we need two main calculations to find our result, the dot product and magnitude, both this methods are already implemented on numpy, but, I’m going to implement it anyway, so you can understand it better by seeing real code. Below, you can check both functions and the respective counterparts from numpy: | ||||||||||||||
| ||||||||||||||
| ||||||||||||||
Now, we just need to wrap it all together, to create a function that returns the cosine similarity between two text objects. Below you can check the final function to calculate the cosine similarity, this function already expects inputs as terms frequency dictionary: | ||||||||||||||
| ||||||||||||||
If we executed it with the same input we used to perform the formulas above, the result would be exactly 0.7071067811865475 as expected. Of course this piece of code was made to just compare two objects, a real use case would need to iterate over all the documents. Below you will find an example how it would looks like(with a simple modification to avoid divisions by 0): | ||||||||||||||
| ||||||||||||||
This method would process multiple documents and return the top ten results, ordered by relevance. | ||||||||||||||
Conclusion | ||||||||||||||
In a world where we always need better ways to discover and find relevant data, search engines are a really important topic to master and study, with this simple solution, you can already process thousands of documents and find the top relevant results! I really recommend that real world implementations should be improved to include some performance related changes, like filters and edge cases ! | ||||||||||||||
If you want to use a ready solution for a search engine with a more robust implementation or just want to see how it looks like, I really recommend that you check pylexitext, a python library I have created and have been working on, there you can find a collection of text related engines and NLP stuff, created to be more usable to anyone. Feel free to contribute if you want, it would be a pleasure work with you! | ||||||||||||||
References & auxiliary material | ||||||||||||||
| ||||||||||||||