Sprawler

Info

Sprawler is both in its developmental stage and pre-alpha stage. A release has recently come, and can be found in the Downloads section. If you would like to help with Sprawler, please go to the Help/Donations Wanted Page for more information about helping us out.

Documentation

As of now here are the Sprawler docs. There's more to come, so check back.

First of all, here is the general idea of how Sprawler will work

Sprawler will index the internet. That's the simplest way to put it. There are really two major parts to every search engine - an indexer, and a searcher. The tool you use on any engine is of course the searcher part. The indexer runs totally independantly of the search part (it has to, or it would take weeks, months, or years to return your results!) The indexer builds an index (duh), and the search digs through that index.

I believe the indexer should be broken into two pieces - a master indexer, and a client indexer. The master is responsible for putting the indexed data that is has received from a client into a database (more on that later), keeping track of urls to be sprawled, already sprawled, or in need of resprawling (I use "sprawl" to mean index). The master hands a list of urls (quantity configurable on the client side, up to a maximum setting on the master side) to each client, and expects to recieve some sprawled data in return after some time period.

The client indexer takes the list of urls from the master, and sprawls each one, grabbing important data like the other urls on the page, bold words, title words, italic words, linked words, alt text for images, etc, and puts those in a specified format. The client saves that data to disk in a tidy format, and upon completion of the sprawling of those urls that it received from the master, it returns the sprawled data back to the master for insertion into the master index. The client then requests more urls, rinse, repeat.

I think this is the correct structure for the engine - it allows us to have many distributed client indexers all utilitizing "cheap" bandwidth (broadband?), and sending jsut the important information and compressed data to the master - the master can focus on inserting the data in the right spot in the index. There can possibly be more than one master - however I have not made any plans for this at this time. I think it allows us to have nearly infinite growth possibilities.

Now - the real problem isn't digging through the net - that's easy. Anyone can write a crawler. The hardest part of building a search engine, is the database. How do you store 1-5 BILLION pieces of information, and find that one tidbit in less than a quarter of a second? How can you store the fact that one word may occur 10-50 Billion times on the net - or more? Imagine having even one billion cd's - how would you organize them so you could find them? In less than a second? Well, I've spent a long long time thinking about those questions. I believe I have sufficient answers to those questions.

My solution isn't mind blowing, it is simple. So simple, that it doesn't even seem like it would work well. However, some preliminary testing I have done shows it works very well. So here it is - basically, file systems are fast. Very fast. We also know a few key pieces of information about the data we are looking for - we know what it is (a word, a url), and we know that is what it is for sure. There is no doubt that the word "expansion" is the word expansion. We also know that a url is unique (by definition). Therefore, we can compute a normalized unique identifier for a word, or url, and use that as an index identifier, and that id is used to store the information for that unique data in a very deep tree. So - now that I've mentioned the "theory" part of that, here's more of a practical use description:

The word "expansion" for example, has a unique signature that's easy to compute. That unique id is "1007607386". It is only a one way id however - you can't take that id, and compute the word "expansion" from it. Similar to many cryptographic methods. So now, I can take that unique id, and build a directory structure from that:

/1/0/0/7/6/7/3/8/6/

and store a data file inside that directory:

/1/0/0/7/6/7/3/8/6/1007607386.db

So now I can take the word "expansion", calculate the uniq id (which, for those of you who are curious, is a simple checksum - very very fast to compute, and a 1 in 4 billion chance we'll have a duplicate (not exactly, but I won't go into that now)), and find the data file in just a few lines of perl code (simple and fast lines too!). You can also go in reverse - from the id to the word, since inside the db file, the word is stored. As a side note, one could chop the id into larger chunks, like /10/07/60/73/86/1007607386.db too, but for growth, the previous method is better. Urls are stored the same way - the url "http://www.sprawler.com/index.html" is broken into two parts (possibly three): a domain part, and a path part (and possibly a protocol part). So, the previously mentioned url would break into pieces like "http://www.sprawler.com" and "/index.html" (but possibly "http://" "www.sprawler.com" and "/index.html"). Each piece would be id'ed with the above method, and converted to something like:

http:// = 3947228689
www.sprawler.com = 3942606425
/index.html = 2560468444

so if we did the three tier method, we'd have:

/3947228689/ (all http indexes would be in this directory)

/3947228689/3/9/4/2/6/0/6/4/2/5/ (all http indexes for the www.sprawler.com domain would be in here)

/3947228689/3/9/4/2/6/0/6/4/2/5/2/5/6/0/4/6/8/4/4/4/2560468444.db (all information for the page: http://www.sprawler.com/index.html would be in this db file)

So in each word db file, we store information about which url id's that word is found in, and in each url db file, we store what words can be found on that page (and possibly where) along with other information (page size, type, language, last index date, etc).

Yes, the directories are nasty - yes, they are hard to read. No worries - we never have to. The computer can open these files very quickly. Plus, for those of you familiar with NFS, this is a very easy method to implement many NFS servers (very fast also) to supply db file data quickly to a team of searching machines. We can also scale it slowly or rapidly, depending on growth. We can also add disk space on the fly, live, without disrupting service.

So - what do we need? We need modules and subroutines to grab the data from pages, and store them in a predefined way. I have lots of experience with these things, so I can guide you all through the good/bad/ugly of it if needed. We need a functioning client and master indexer. We'll need some console bare bones db test scripts. Of course, we'll need some hardware, bandwidth, and diskspace. Don't worry about all that yet - let's get the code going, and I'll worry about the rest..

Build map for client-indexer

This file attempts to outline the basic routines needed for the client indexer code. It is meant to give a basic rough guide on how to create the client portion of the indexing tool.

The following actions should take place when the indexer starts:

  Load configuration data from conf file
  Look for urls that still need to be indexed (never completed from last run)

If found, index these urls

If not found, request new batch of urls from master
  Reload conf file when completed with one run - so changes can be made on the fly
  Loop until killed.

Modules needed: (- to do, * done)
  - gets urls from disk and master indexer, returns 1 if successful, 0 if not. takes # of urls as argument
  * grabs data from url, returns data in a list. Takes url as argument
  - takes data, "index" it, and return the gobs of sifted indexes
  - takes indexed data, and stores in "completed" area, returns 1 if success, 0 if not
  * loads config file, returns hash of config values
  - yanks urls out of html received, and returns them to indexer module

Build map for master-indexer

This file attempts to outline the basic routines needed for the master indexer code. It is meant to give a basic rough guide on how to create the master portion of the indexing tool.

The following actions should take place when the indexer starts:

Load configuration data from conf file
Look for urls that have been indexed, and checked-in
Pull data from those files, and store in the real engine index
Remove file from checked-in directory
Look for requests for urls, if found, put next batch of urls as requested

Modules needed: (- to do, * done)
    * Loads config file specified
    - Check in indexed url files
    - Sift data from checked-in files
    - Saves data sifted from checked-in files to engine index
    - Looks for and responds to url requests (checks-out)
    - Calculates url-id from url
    - Calculates word-id from word

This page last updated: Jan 20, 2004