Malhar: An experiment to improve text search on many fronts!

I have been working on a (fuzzy)text-search engine on and off for about a year now. Earlier i tested a proof of concept with some real-life queries on about a million hacker-news stories, and found it a promising avenue to pursue on. Current version improves upon almost every front to enable a very fast and robust fuzzy(+keyword) search on real-life datasets/domains! This post is intended to describe the current state of the project, with some Hows and Whys it came to be so, rather than a deep technical dive.

Why:

Why even try to make a new text search engine? There are lots of excellent text-search engines/libraries like Lucene, Tantivy, Type-sense on which huge ecosystems like ElasticSearch, Solr, Quickwit, Type-Sense have been developed upon. So where does this engine fit in? There are a few scenarios i think many of these engines fall short in, which i discuss below. Initial motivation was to improve performance for atleast some of those scenarios. I know performance is actually a subjective term, particularly so with search, only the end user would be the true judge of that!

Fuzzy Search: Some of the search engines do provide some sort of fuzzy-search capabilities, generally using combination of trie data-structure and levenshtein distance . But for almost all of them this is an after-thought, something to be added upon only after initial keyword search engine capabilities. I will discuss how malhar tries to handle this by keeping fuzzy-search as a fundamental feature and where keyword-search is just a special case of it.
Much better suggestions: Not being able to provide suggestions(even bad ones) is a no-go for me, Humans cannot keep all the information in their heads and computing/search systems are supposed to help with it. Giving a bad suggestion may jog my memory, and may be i can work towards the correct spelling. Give me some parameter to make an engine spit out more suggestions. This is really useful if you are doing search for a non-trivial word, where you remember a jumble of characters, and not exact word. Technically this would increase the percentage of resolved queries. Some engines do return some suggestions based on some substring matching, but it feels very limited and very rarely it is integrated with fuzzy-search. This becomes really useful when working with real-life digitized/OCR-ed documents, where underlying system may interpret a character incorrectly, or spelling mistakes are present in documents! Most of engines don't return more possible guesses particularly when query exactly matches with some text in the dataset.
Duplication: Since most of such systems provide full text search as a feature/service on already indexed data, you may have to migrate a subset of your data to that new engine/server to have a better search. This migration may not be straightforward and data marshalling maybe required. I think in terms of borrowing original content for some time to do an isolated indexing to enable fuzzy search. Actual data could reside anywhere from a database to a CSV file or S3 storage, as long as user could retrieve it using some unique key! So Newer search index remains fully unaware of the semantics of stored original data and very minimal duplication is performed in form of extracted tokens.
Minimal: Minimal in terms of RAM and minimal in terms of dependencies. Not encumbered by licenses and dependencies, run it on any chipset or OS as long as there is a C compiler! Malhar is written from grounds up in Nim and python while utilizing standard libraries only, hence allowing possible ports to other languages with no dead weight. I have come to appreciate self-contained code as it invites more users to hack and improve upon the fundamentals more quickly. I don't think stuffing a JVM into PyLucene to make it work with python should be only way possible! I feel it is really underestimated how much cognitive pressure complex build systems put on someone just trying to add a feature or improve upon existing one, those complex build scripts rarely works the first time. Python is mainly used for experimenting with new ideas, if an idea seems solid, it is ported to Nim/C for performance reasons. These iterations take time as at each stage i have to keep source-code self contained and performant, but it gets easier as fundamental knowledge of data-structures and their trade-offs improves.

How:

The broad answer is to use a vector for each unique word, and during a query do a nearest neighbor search to return the top-k best matches. Assuming we can generate such a vector, it would solve most the issues i discussed above. Keyword search is equivalent to returning a top suggestion (i.e with minimum or zero distance). We would be able to return a possible set of suggestions, even if a keyword is found, we can still return suggestions as possible variations of the query. It would make it immensely useful for a user by providing a glimpse of the info stored akin to the query. It works and i have written about it in the earlier posts. Basically it re-purposes a LSH hash developed to identify patterns in the binaries for malware detection. Lots of small improvements have been gradually added to use it as an out-of-box robust spelling checker! I think that this hash could be improved further by adapting it for pure text-search, away from its initial purpose, but it would take some experiments and time.

It could also help in mitigating the problem of hard-coded stemmers and Lemmatisation code, which feels like a catch-up game, as there is no pattern to it, and varies vividly on language by language basis. By treating stemmed/lemmed variation of a word as a mis-spelled one, we could use the same fuzzy-search framework to match such words to their root. This approach seems more generic as it does away with any language specific knowledge! It may not be so straightforward, as for a query heat, user may expect to get records with heated as a desired keyword, but head may be a better (fuzzy) and valid suggestion according to engine! Setting a high recall leads to a much higher probability of including heated as a suggestion, though all the returned suggestions are good ones according to underlying engine. We can also let user provide heat as a prefix to be must included in suggestions. I feel it is all about providing tools to refine a search up-to a high degree of flexibility.

Being minimalistic still required a lot of effort while being fast enough to return (fuzzy)results in milliseconds for millions of documents!

Since it involves fixed-size Hashes, and a loop over all those hashes for each query without any heuristics or possible alternate paths depending upon query, memory and latency requirements are much easier to predict. Also since number of threads is provided as a part of API, and not as some environment variable , it should be easier to scale and visualize the effect of number of CORES on any system. But actual benefits of parallel processing may be felt only after some work threshold ! Posix threads are used for threading, so any system supporting that standard should be able to leverage multi-threading support during query!

There is also a lot of time spent on better tokenization process, better(relevant) tokens i could extract from the underlying text, more would be the probability of matching a user query. This also leads to much better non-prefix matching performance unlike almost all of engines i have encountered. It is a very difficult problem to correctly segment a longer word, without blowing-up times for indexing. Most of the engines would fail while searching for a non-prefix substring like 12p3 in a word like lambda12p3131. Ours could also fail, but with much lower probability, and further provides an escape hatch which could eventually resolve the query.

Technically it follows the same flow as most of engines, i.e we get some text, text is then tokenized into tokens, tokens are then put into an inverted-index like data-structure during indexing stage. During query, we collect the best matching tokens for each user keyword and corresponding document-ids. In the end a bucket-sort could be done to boost-up the common documents. Algorithms behind search-engines are unexpectedly quite simple, it is the data-distribution (in amount and variation) that makes or breaks a search engine!

Show me the stats!

I tested upon a ~2.2 Million recipes dataset from https://github.com/Glorf/recipenlg . It is an interesting dataset as each word is generated character by character leading to minor variations of the original/correct spelling. Just matching with exact keyword for a query may miss the actual desired record, and this is where capability of returning more possible matches really shines! Most of the engines treat exact matching as the only correct solution.

Engine extracts about ~180K unique tokens from these documents, this count is not that large given the number of documents, as titles are result of combination of limited number of NER tokens' and are generated by a NLP model! A ratio b/w number of tokens and number of documents can also provide a rough idea for compressiblity of the dataset!

Some stats (on my intel i5 8th generation using a single CORE):

*Number of documents: ~2.2Million

*Capacity (engine) : 200k

*Unique tokens: ~180k

*Memory occupied: ~108Mb

*Mean latency: <10ms

Occupied RAM amount is just for index alone, original dataset could be stored anywhere, as long as user can get the corresponding record given a unique key. (Matching) Unique keys/document-ids along with matched-tokens for each key/document are returned by the engine, user can just then collect the corresponding original records given those keys. This way engine remains fully unaware of original-data storage semantics. Engine as library could be embedded inside any application that implements any sort of search interface. I only started working on it when i needed a robust(minimal) search interface for my multi-modal search-engine.

Comparison:

I compare it with type-sense engine, as they have one of the best fuzzy-search capabilities amongst all the mature search engines. They also host a bunch of demos for various datasets, recipe search is one of them, which makes it easier to compare oranges with oranges. Search is a very subjective concept and i think its very difficult to quantize it with a not-so-useful benchmark or some score.

For the record, i think type-sense is a great (fuzzy) search engine, with fully open-source code, and glad they are making money with their cloud offering. Its very competitive with Algolia in performance while being transparent and fast even with millions of documents! I thought a direct comparison of search's speed and performance on the exact same dataset would be a nice litmus test!

Our engine demo is hosted on 2 shared Virtual CPUs (aka single CORE). It is running behind a Caddy server as a single flask application. This is the cheapest configuration i could find as i don't have much ~money~ resources to host it, but its actually faster than my CPU, while handling 20-30s query per second!

Malhar Demo: https://recipe-demo.eagledot.xyz

Type-sense Demo: https://recipe-search.typesense.org

Current Status:

Most of the indexing and querying code works, indexing for now is single-threaded. Still to implement some obvious optimizations along with saving/loading of index. More debug code to better handle possible edge cases, and return stats . I develop for now on windows machine (which is not ideal), and cross-compile to linux. Code should work on MacOS out of the box, but i don't have a platform/machine to test it on.

Final words:

I think progress could be made on every code base, as new platforms/systems/hardware and ideas are produced and there exists a huge action space to try out those. But various factors must be in alignment to actually being able to traverse that space and come out with a significant improvement on the other side. Even armed with a good theory, it could be really difficult to actual produce a working fast enough code to complement that. I think these trade-offs are what make computer-science such an interesting field, as we can optimize for the domain/problem in-hand. Sometimes we get lucky and have a pareto improvement moment and that is only La Vida .