Malhar: An experiment to improve text search on many fronts!
I have been working on a (fuzzy)text-search engine on and off for about a year now. Earlier i tested a proof of concept with some real-life queries on about a million hacker-news stories, and found it a promising avenue to pursue on. Current version improves upon almost every front to enable a very fast and robust fuzzy(+keyword) search on real-life datasets/domains! This post is intended to describe the current state of the project, with some Hows and Whys it came to be so, rather than a deep technical dive.
Why:
Why even try to make a new text search engine? There are lots of excellent text-search engines/libraries like Lucene
, Tantivy
, Type-sense
on which huge ecosystems like ElasticSearch, Solr, Quickwit, Type-Sense have been developed upon. So where does this engine fit in? There are a few scenarios i think many of these engines fall short in, which i discuss below. Initial motivation was to improve performance
for atleast some of those scenarios. I know performance
is actually a subjective term, particularly so with search
, only the end user would be the true judge of that!
-
Fuzzy Search
: Some of the search engines do provide some sort of fuzzy-search capabilities, generally using combination oftrie data-structure
andlevenshtein
distance . But for almost all of them this is an after-thought, something to be added upon only after initial keyword search engine capabilities. I will discuss how malhar tries to handle this by keepingfuzzy-search
as a fundamental feature and wherekeyword-search
is just a special case of it. -
Much better suggestions
: Not being able to provide suggestions(even bad ones) is a no-go for me, Humans cannot keep all the information in their heads and computing/search systems are supposed to help with it. Giving a bad suggestion may jog my memory, and may be i can work towards the correct spelling. Give me some parameter to make an engine spit out more suggestions. This is really useful if you are doing search for a non-trivial word, where you remember a jumble of characters, and not exact word. Technically this would increase the percentage of resolved queries. Some engines do return some suggestions based on somesubstring
matching, but it feels very limited and very rarely it is integrated withfuzzy-search
. This becomes really useful when working with real-life digitized/OCR-ed documents, where underlying system may interpret a character incorrectly, or spelling mistakes are present in documents! Most of engines don't return more possible guesses particularly when query exactly matches with some text in the dataset. -
Duplication
: Since most of such systems providefull text search
as a feature/service on already indexed data, you may have to migrate a subset of your data to that new engine/server to have a bettersearch
. This migration may not be straightforward and datamarshalling
maybe required. I think in terms ofborrowing
original content for some time to do an isolated indexing to enable fuzzy search. Actual data could reside anywhere from a database to a CSV file or S3 storage, as long as user could retrieve it using someunique
key! So Newer search index remains fully unaware of the semantics of storedoriginal data
and very minimal duplication is performed in form of extractedtokens
. -
Minimal
: Minimal in terms of RAM and minimal in terms of dependencies. Not encumbered by licenses and dependencies, run it on any chipset or OS as long as there is a C compiler! Malhar is written from grounds up in Nim and python while utilizing standard libraries only, hence allowing possible ports to other languages with no dead weight. I have come to appreciate self-contained code as it invites more users to hack and improve upon the fundamentals more quickly. I don't think stuffing aJVM
intoPyLucene
to make it work with python should be only way possible! I feel it is really underestimated how much cognitive pressure complex build systems put on someone just trying to add a feature or improve upon existing one, those complex build scripts rarely works the first time. Python is mainly used for experimenting with new ideas, if an idea seems solid, it is ported to Nim/C for performance reasons. These iterations take time as at each stage i have to keep source-code self contained and performant, but it gets easier as fundamental knowledge ofdata-structures
and theirtrade-offs
improves.
How:
The broad answer is to use a vector
for each unique word, and during a query do a nearest
neighbor search to return the top-k best matches. Assuming we can generate such a vector, it would solve most the issues i discussed above. Keyword search is equivalent to returning a top suggestion (i.e with minimum or zero distance). We would be able to return a possible set
of suggestions
, even if a keyword
is found, we can still return suggestions as possible variations of the query
. It would make it immensely useful for a user by providing a glimpse of the info stored akin to the query.
It works and i have written about it in the earlier posts. Basically it re-purposes a LSH
hash developed to identify patterns in the binaries for malware detection. Lots of small improvements have been gradually added to use it as an out-of-box robust spelling checker! I think that this hash
could be improved further by adapting it for pure text-search, away from its initial purpose, but it would take some experiments and time.
It could also help in mitigating the problem of hard-coded stemmers
and Lemmatisation
code, which feels like a catch-up
game, as there is no pattern to it, and varies vividly on language by language basis. By treating stemmed/lemmed
variation of a word as a mis-spelled one, we could use the same fuzzy-search
framework to match such words to their root
. This approach seems more generic as it does away with any language specific knowledge!
It may not be so straightforward, as for a query heat
, user may expect to get records with heated
as a desired keyword, but head
may be a better (fuzzy) and valid suggestion according to engine!
Setting a high recall leads to a much higher probability of including heated
as a suggestion, though all the returned suggestions are good ones according to underlying engine. We can also let user provide heat
as a prefix
to be must
included in suggestions
. I feel it is all about providing tools to refine a search up-to a high degree of flexibility.
Being minimalistic
still required a lot of effort while being fast
enough to return (fuzzy)results in milliseconds for millions of documents!
Since it involves fixed-size Hashes, and a loop over all those hashes for each query without any heuristics or possible alternate paths depending upon query, memory and latency requirements are much easier to predict. Also since number of threads
is provided as a part of API, and not as some environment variable , it should be easier to scale and visualize the effect of number of CORES on any system. But actual benefits of parallel processing may be felt only after some work
threshold ! Posix threads are used for threading, so any system supporting that standard should be able to leverage multi-threading support during query!
There is also a lot of time spent on better tokenization
process, better(relevant) tokens i could extract from the underlying text, more would be the probability of matching a user query. This also leads to much better non-prefix
matching performance unlike almost all of engines i have encountered. It is a very difficult problem to correctly segment
a longer word, without blowing-up times for indexing. Most of the engines would fail while searching for a non-prefix
substring like 12p3
in a word like lambda12p3131
. Ours could also fail, but with much lower probability, and further provides an escape hatch which could eventually resolve the query.
Technically it follows the same flow as most of engines, i.e we get some text, text is then tokenized
into tokens
, tokens
are then put into an inverted-index
like data-structure during indexing
stage. During query
, we collect the best matching tokens
for each user keyword and corresponding document-ids
. In the end a bucket-sort
could be done to boost-up
the common
documents. Algorithms behind search-engines are unexpectedly quite simple, it is the data-distribution
(in amount and variation) that makes or breaks a search engine!
Show me the stats!
I tested upon a ~2.2 Million
recipes dataset from https://github.com/Glorf/recipenlg .
It is an interesting dataset as each word is generated character by character leading
to minor variations of the original/correct spelling. Just matching with exact keyword
for a query
may miss the actual desired record, and this is where capability of returning more possible matches really shines! Most of the engines treat exact
matching as the only correct solution.
Engine extracts about ~180K
unique tokens from these documents, this count is not that large given the number of documents, as titles are result of combination of limited number of NER
tokens' and are generated by a NLP model! A ratio b/w number of tokens and number of documents can also provide a rough idea for compressiblity of the dataset!
Some stats (on my intel i5 8th generation using a single CORE):
*Number of documents
: ~2.2Million
*Capacity (engine)
: 200k
*Unique tokens
: ~180k
*Memory occupied
: ~108Mb
*Mean latency
: <10ms
Occupied RAM
amount is just for index
alone, original dataset could be stored anywhere, as long as user can get the corresponding record
given a unique key
. (Matching) Unique keys/document-ids
along with matched-tokens
for each key/document are returned by the engine, user can just then collect the corresponding original records given those keys
. This way engine remains fully unaware
of original-data storage semantics. Engine as library could be embedded inside any application that implements any sort of search interface. I only started working on it when i needed a robust(minimal) search interface for my multi-modal search-engine.
Comparison:
I compare it with type-sense
engine, as they have one of the best fuzzy-search
capabilities amongst all the mature search engines. They also host a bunch of demos for various datasets, recipe search
is one of them, which makes it easier to compare oranges with oranges. Search
is a very subjective concept and i think its very difficult to quantize it with a not-so-useful benchmark or some score.
For the record, i think type-sense
is a great (fuzzy) search engine, with fully open-source code, and glad they are making money with their cloud offering. Its very competitive with Algolia in performance while being transparent and fast even with millions of documents! I thought a direct comparison of search's speed and performance on the exact same dataset would be a nice litmus test!
Our engine demo is hosted on 2 shared Virtual CPUs (aka single CORE)
. It is running behind a Caddy
server as a single flask
application. This is the cheapest configuration i could find as i don't have much ~money~ resources to host it, but its actually faster than my CPU, while handling 20-30s query per second!
Malhar Demo: https://recipe-demo.eagledot.xyz
Type-sense Demo: https://recipe-search.typesense.org
Current Status:
Most of the indexing
and querying
code works, indexing
for now is single-threaded. Still to implement some obvious optimizations along with saving/loading
of index
.
More debug
code to better handle possible edge cases
, and return stats
. I develop for now on windows
machine (which is not ideal), and cross-compile to linux. Code should work on MacOS
out of the box, but i don't have a platform/machine to test it on.
Final words:
I think progress could be made on every code base, as new platforms/systems/hardware and ideas are produced and there exists a huge action space to try out those. But various factors must be in alignment to actually being able to traverse that space and come out with a significant improvement on the other side. Even armed with a good theory, it could be really difficult to actual produce a working fast enough code to complement that. I think these trade-offs are what make computer-science such an interesting field, as we can optimize for the domain/problem in-hand. Sometimes we get lucky and have a pareto improvement
moment and that is only La Vida .