How to Create a Simple Search Engine using Math and Coding

Discover the secrets of Google search engine and learn how to create your own search engine using math and coding.

Victor Bona

6 min

| ||||||||||||||

## Introduction | ||||||||||||||

Have you ever imagined how Google search engine works? How can their system ranks million of thousands of documents and return very accurate results to every user every day? Do you know that we can create a simple text search engine using just math and some coding? That’s exactly the topic I’m going to cover in this article and at the end of the post, you will be capable to implement your own version of the search engine and understand the concepts behind a | ||||||||||||||

## Vector space model | ||||||||||||||

Imagine that you are searching for information about airplanes and you want the result to be ordered by relevance, all the information that you have about airplanes is raw stored inside thousands of documents in a database, this documents are completely raw and unstructured, all you have are the text itself, so, you can’t search the term “airplane” on a category relation from your tables, then, how can the computer find what documents are relevant to you and order it? This can be done using the vector space algebraic model. | ||||||||||||||

The algebraic vector space model can represent objects(a text in our case) as vectors of identifiers, like index terms. This model will be a n-dimensions plan, where each dimension corresponds to a unique term from our index. So, if we have a text “ | ||||||||||||||

Now, let’s get this very simple scenario for the explanation, with a simple one word query and two documents with no new words(new words would update our index and our plain, adding new dimensions to it): | ||||||||||||||

| ||||||||||||||

Here the magic starts to appear, now that we have both a query and documents where to research, we need a way to find a relation between them. Remember the plane above? He will be the material to understand how it will work, but first, lets change our scenario to a more usable form: | ||||||||||||||

If this image remember you about something, congrats, you have a good memory and remember about vectors representation. Our strings were converted into groups of numbers, that could represent coordinates on a plan, and, for our lucky, we have a plan. Let’s draw the resultant vectors: | ||||||||||||||

We have all our documents and our query plotted into the plan, using both indexes “ | ||||||||||||||

## Let’s break down the formula: | ||||||||||||||

## Dot Product | ||||||||||||||

This image above represents the euclidean dot product formula, that can be defined as the sum of the products of the components of each vector: | ||||||||||||||

Basically, it represents the sum of the product of all the vector components of our query and the document we are comparing. If we applied it to the query and doc1, the result would be | ||||||||||||||

## Magnitude | ||||||||||||||

In other hand, the image above represents the Magnitude formula, that can be defined as a property which determines the size/length of a mathematical object, in our case, vectors. Can be represented as: | ||||||||||||||

Basically, it’s used to describe the “size” of our document or query vector. At the | ||||||||||||||

## Final result | ||||||||||||||

When combining both parts that we covered above, the result would be | ||||||||||||||

This result represents the cosine similarity between doc1 and the query, if we put it in percentage, we would have | ||||||||||||||

Remember that using documents or querys with more unique words and longer contents don’t change anything, it will basically represents new dimensions and components to be calculated by the computer! | ||||||||||||||

## Coding it | ||||||||||||||

Now that we have covered all the mathematical part, we just need to transform all of it into code. The first part I’m going to show, is how to create the index terms and terms frequency from each text that we are going to process, to do this, I’m going to first remove the noise from the text(numbers, non-ascii, punctuation etc) and then, create a dictionary object with the keys as the index terms and the value as the frequency. | ||||||||||||||

| ||||||||||||||

If we execute this method on the texts we saw before, the result would be this: | ||||||||||||||

| ||||||||||||||

The next step would be convert the formula. As we know, we need two main calculations to find our result, the | ||||||||||||||

| ||||||||||||||

| ||||||||||||||

Now, we just need to wrap it all together, to create a function that returns the cosine similarity between two text objects. Below you can check the final function to calculate the cosine similarity, this function already expects inputs as terms frequency dictionary: | ||||||||||||||

| ||||||||||||||

If we executed it with the same input we used to perform the formulas above, the result would be exactly | ||||||||||||||

| ||||||||||||||

This method would process multiple documents and return the top ten results, ordered by relevance. | ||||||||||||||

| ||||||||||||||

In a world where we always need better ways to discover and find relevant data, search engines are a really important topic to master and study, with this simple solution, you can already process thousands of documents and find the top relevant results! I really recommend that real world implementations should be improved to include some performance related changes, like filters and edge cases ! | ||||||||||||||

If you want to use a ready solution for a search engine with a more robust implementation or just want to see how it looks like, I really recommend that you check pylexitext, a python library I have created and have been working on, there you can find a collection of text related engines and NLP stuff, created to be more usable to anyone. Feel free to contribute if you want, it would be a pleasure work with you! | ||||||||||||||

## References & auxiliary material | ||||||||||||||

- https://boyter.org/2010/08/build-vector-space-search-engine-python/
- https://www.youtube.com/watch?v=ainS3BBn7rs
- https://en.wikipedia.org/wiki/Vector_space_model
- https://en.wikipedia.org/wiki/Cosine_similarity
- https://en.wikipedia.org/wiki/Magnitude_(mathematics)#Euclidean_vector_space
- https://en.wikipedia.org/wiki/Dot_product
- https://numpy.org/doc/stable/reference/generated/numpy.inner.html
- https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html
- https://github.com/vicotrbb/Pylexitext
| ||||||||||||||