How to Build a Simple In-Browser Full Text Search Engine

Because sometimes bureaucracy gets in the way.

Jefferey Cave
9 min readMar 6, 2022
Photo by Ruchindra Gunasekara on Unsplash

A couple of years ago, I was asked to create a file repository for an air-gapped conda repository. The idea was that, for security reasons, we would have a list of allowed libraries inside the very expensive secure environment, and only those libraries. It was a reasonably simple setup, and only required a large amount of storage, and a simple static web server to deliver them over.

The simple filtered mirror allowed us to respond to user needs much more rapidly, and to add or remove libraries from the secure environment very quickly. This meant the security team was much more willing to allow packages into the environment, knowing it would be hours to remove them rather than months or years.

  • They were so thrilled, we started using patterns to allow packages.
  • We quickly went from dozens of packages to thousands.
  • Users developed a new question to phone in and ask:

which libraries are available?

The list of libraries available was less than the official repositories but significantly larger than a human could reasonably be expected to read through, it was also spread across several folders. All-in-all it was just difficult to keep track of.

What users required was a search engine, like Anaconda.org, but limited to only our organisationally available packages.

An Aside

It’s at this moment, in our story, that I find myself away from home on business. Having attended an evening lecture, in a pub in Gatineau Quebec, on using technical systems to affect social change, and having had a few drinks, I found myself walking back to my hotel, alone, in the dead of a Canadian winter. Trying to not freeze to death, and figure out where my hotel was, I started thinking about search problems. By the time i got back to my hotel, there was no way I could sleep, so began to implement the solution.

Ahhh… the life of a programmer.

Unfortunately, Bureaucracy got in the way. Getting the static HTTP server had been a feat of negotiation. In an effort to get the filesystem shared via HTTP, I had explicitly stated that this would only have static rendering turned on. This allowed us to skip undergoing months of security evaluations, no server-side processing meant no security risk to the network.

In order to accommodate both the need for a search engine and the lack of server-side processing, I built a simple search engine inside the browser.

Implementation

I have actually done this a couple of times before. There have been a few instances in my career where getting tools and computers have been significant barriers to a simple data processing engine. The key to this is that you can drop a simple html file on your served location, and it can then look up the data stored on the server.

For the reader’s benefit, I have a simple example working that indexes the CIA World Factbook, and then offer a search of the content. Search results offer a link to the actual page on the CIA’s website.

Auto-index and Loading

The key to this is to ensure that some form of auto-index is on. The server must advertise the available datasets to be processed in order for the client to discover they need processing. There are a couple of ways to achieve this, the simplest is to turn the service on in the server.

In similar systems, I have also used `bash` or `PowerShell` scripts to just list all the datasets present: no information about them, just that they are present

ls --format=single-column ./data/* > index.txt

The location of that file can then be passed to the script as its starting point for gathering all information

const basePath = window.location;
const datasets = `${basePath}/index.txt`;

Once we have the list of data to be aggregated, we can begin the process of downloading, parsing, and storing.

Given we used PouchDB, we can immediately begin loading the text into the local database.

const db = new PouchDB('searcher');
async function LoadDB(){
let recs = await fetch('./index.txt');
recs = await rec.text();
recs = rec.split('\n');
for(let loc of recs){
let content = await fetch(loc);
let nRec = {
_id: loc,
text: nRec
};
let oRec = await db.get(loc);
if(RecordsDiffer(nRec,oRec)){
db.put(nRec);
}
}
}

The full text is now stored in the database for future use by the user. It is recommended that this be scanned periodically to see if any changes have occurred. (loader.js)

Sanitising the Text

In database development, and searching, an index can be thought of as a quick reference. In the days of manual searching, a card catalogue offered an alphabetic search ordered by subject, that allowed a quick lookup. Rather than having to look through every book in an entire building, you can look through a smaller listing in a single box.

Less is more, smaller is faster.

For the purposes of this discussion, it is useful to have a couple of documents we want to search:

Fact 21: For every 25 percent increase in problem complexity, there is a 100 percent increase in complexity of the software solution. that’s not a condition to try to change (even though reducing complexity is always a desirable thing to do); that’s just the way it is.

Fact 22: Eighty percent of software work is intellectual. A fair amount of it is creative. Little of it is clerical.

— Robert L. Glass, Facts and Fallacies of Software Engineering

So our sample for discussion will revolve around indexing the text of Glass’s Facts to allow for rapid lookup. This might be useful to attach to a microphone in the office which then displays an appropriate “fact” depending on what is being discussed.

Each fact can be placed in a separate record for our indexing to pick up later. This will allow us to treat each fact as a distinct entity.

./data/fact-01.txt
./data/fact-02.txt
./data/fact-03.txt

One of the issues with text such as this is that a lot of text is actually meaningless, or at least carries low meaning. Some level of transformation should be performed to remove needless information.

function sanitize(text){
text = text.toLowerCase();
text = text.replace(/[^a-z]/g,'.');
text = text.split('.');
text = text.filter(d=>{return d.length > 3;});
return text;
}
let sanitized = sanitize(facts[21]);

As part of indexing fact number 21, we remove

  • case
  • non-text
  • short words (three characters, aka: Stop Words)

resulting in a sanitised list of words. (crepo.js)

[ 'every', 'percent', 'increase', 'problem', 'complexity', 'there', 'percent', 'increase', 'complexity', 'software', 'solution', 'that', 'condition', 'change', 'even', 'though', 'reducing', 'complexity', 'always', 'desirable', 'thing', 'that', 'just' ] 

The list can be further reduced by noting the duplicate words. A word count is a useful way to weigh the value of a term in a search

function WordCount(list){
let count = {};
for(let word of list){
count[word] = count[word] || 0;
count[word]++;
}
return count;
}
let count = WordCount(sanitized);

Creating the index

For the purpose of this search, I chose to do a match on each word. This was done for simplicity, and because it allows for searching of words out of order.

Examples of the kinds of searches I want a user to be able to use:

  • complexity: a word that is overly complex (could be complex)
  • ompl: a word that is a partial match
  • software olut: multiple words
  • solution software: words out of order

These examples represent a couple of different cases, that we need to handle. Most interesting is the “partial match” scenario.

To create an index we need a list based on our lookup value that

| key      | score | document |
| -------- | ----- | -------- |
| database | 100 | <url> |
| atabase | 99 | <url> |
| tabase | 98 | <url> |
| abase | 97 | <url> |
| base | 96 | <url> |
| ase | 95 | <url> |
| databas | 99 | ... |
| databa | 98 | ... |
| ... | ... | ... |

When you search for “data search”, it finds

document  1 
data 97
search 100

For a total of 198 points, and then sorts by highest scoring documents (Code: search.js)

Extras

Lemmatization

As another layer, we can consider lemma. Lemmatization refer to the most foundational word that is represented by a word. For example, intellectual can be thought of as being the same as intellect. This will help reduce the amount of possible typographic difference that occur later, for example, a person that remembers that is should find that's, come to think of it, is is a short word.

every 25 percent increase problem complex there 100 percent increase complex software solution that condition change even though reduce complex always desire thing that just

This is a much simpler version of the text that will reduce the amount of informational entropy (aka things that can go wrong in my head when remembering).

This text is well sanitised and ready to undergo indexing.

Conclusion

The next morning, after writing the solution, I felt pretty pleased with myself for having a complete solution that I could email to my primary customers (a few Data Science managers throughout the organisation). I was so pleased, I showed it to a colleague who helped me with optimisations to the index.

Found this Intersting? Leave a Tip…. it helps

Years later, and it is still the way the organisation tracks available conda packages.

To be fair, the only reason I was able to pull this off overnight was that I have implemented this so many times. I have used browser-side indexes to implement dashboards for regression testing, employee work allocation, personal blog searches, and a Project Gutenberg search engine. It even ended up being the basis for a class I taught on Introductory JavaScript.

It is one of the reasons I love JavaScript: it is always available for data processing, whatever crazy idea you happen to have at that moment.

Since originally writing this, I have discovered that Lawson has actually written a full text search for PouchDB. You should use his solution based on the lunr engine.

Non-Optimal

This is not an optimal solution.

In this case, every user of a search engine must download and generate their own instance of the index. This requires that the client download the entirety of the dataset the first time (increasing network traffic), for each browser they use.

In the regression test dashboard, the initial download and parse of the test results took approximately 3 hours. Caching meant that it was almost instantaneous if you kept up to date each day, but that first load was a big one.

One of the key advantages to server-side processing is the ability to take the inbound data, and generate the indexes once for all the customers. You could split the difference by generating the index on the server and having the clients download that constructed index. This would split the processing load between a central processor for the central data, and distributed processing for the individual searches.

This hints at an interesting balance between shared and distributed processing. Using databases that support the CouchDB interchange protocol (CouchBase, CouchDB, PouchDB, CloudAnt), it is possible for each user to process new data as they find it, but also to share the results back to a central pool which is used by the next person. This Lazy-Load form of data processing has some theoretical advantages to it (in a trusted environment) in that it would require very little central processing (expensive) and instead rely on workstations that are already in use. At the same time, it would share the workload to ensure no one computer took the brunt of the full calculation.

Lastly, if this solution intrigued you, you should look into the Lucene based searches provided for CouchDB and included with all the commercial offerings.

Please leave a comment or ask a question, and if you found this content valuable, please remember to click on follow.

More content at PlainEnglish.io. Sign up for our free weekly newsletter. Follow us on Twitter and LinkedIn. Join our community Discord.

--

--

Jefferey Cave

I’m interested in the beauty of data and complex systems. I use story telling to help others see that beauty. https://www.buymeacoffee.com/jeffereycave