Info
Sprawler is both in its developmental stage and pre-alpha stage. A
release has recently come, and can be found in the
Downloads section. If you would like to help with Sprawler, please
go to the Help/Donations Wanted Page
for more information about helping us out.
|
|
-
|
Documentation
As of now here are the Sprawler docs. There's more to come, so check back.
First of all, here is the general idea of how Sprawler will work
Sprawler will index the internet. That's the simplest way to put it.
There are really two major parts to every search engine - an indexer,
and a searcher. The tool you use on any engine is of course the
searcher part. The indexer runs totally independantly of the search
part (it has to, or it would take weeks, months, or years to return your
results!) The indexer builds an index (duh), and the search digs through
that index.
I believe the indexer should be broken into two pieces - a master
indexer, and a client indexer. The master is responsible for putting
the indexed data that is has received from a client into a database
(more on that later), keeping track of urls to be sprawled, already
sprawled, or in need of resprawling (I use "sprawl" to mean index).
The master hands a list of urls (quantity configurable on the client side,
up to a maximum setting on the master side) to each client, and expects
to recieve some sprawled data in return after some time period.
The client indexer takes the list of urls from the master, and sprawls
each one, grabbing important data like the other urls on the page, bold
words, title words, italic words, linked words, alt text for images,
etc, and puts those in a specified format. The client saves that data
to disk in a tidy format, and upon completion of the sprawling of those
urls that it received from the master, it returns the sprawled data
back to the master for insertion into the master index. The client then
requests more urls, rinse, repeat.
I think this is the correct structure for the engine - it allows us to
have many distributed client indexers all utilitizing "cheap" bandwidth
(broadband?), and sending jsut the important information and compressed
data to the master - the master can focus on inserting the data in the
right spot in the index. There can possibly be more than one master -
however I have not made any plans for this at this time. I think it
allows us to have nearly infinite growth possibilities.
Now - the real problem isn't digging through the net - that's easy.
Anyone can write a crawler. The hardest part of building a search
engine, is the database. How do you store 1-5 BILLION pieces of
information, and find that one tidbit in less than a quarter of a
second? How can you store the fact that one word may occur 10-50
Billion times on the net - or more? Imagine having even one billion
cd's - how would you organize them so you could find them? In less than a
second? Well, I've spent a long long time thinking about those
questions. I believe I have sufficient answers to those questions.
My solution isn't mind blowing, it is simple. So simple, that it
doesn't even seem like it would work well. However, some preliminary
testing I have done shows it works very well.
So here it is - basically, file systems are fast. Very fast. We also
know a few key pieces of information about the data we are looking for
- we know what it is (a word, a url), and we know that is what it is for
sure. There is no doubt that the word "expansion" is the word
expansion. We also know that a url is unique (by definition).
Therefore, we can compute a normalized unique identifier for a word, or
url, and use that as an index identifier, and that id is used to store
the information for that unique data in a very deep tree.
So - now that I've mentioned the "theory" part of that, here's more of
a practical use description:
The word "expansion" for example, has a unique signature that's easy
to compute. That unique id is "1007607386". It is only a one way id
however - you can't take that id, and compute the word "expansion" from
it. Similar to many cryptographic methods. So now, I can take that
unique id, and build a directory structure from that:
/1/0/0/7/6/7/3/8/6/
and store a data file inside that directory:
/1/0/0/7/6/7/3/8/6/1007607386.db
So now I can take the word "expansion", calculate the uniq id (which,
for those of you who are curious, is a simple checksum - very very fast
to compute, and a 1 in 4 billion chance we'll have a duplicate (not
exactly, but I won't go into that now)), and find the data file in just
a few lines of perl code (simple and fast lines too!). You can also go
in reverse - from the id to the word, since inside the db file, the
word is stored. As a side note, one could chop the id into larger chunks,
like /10/07/60/73/86/1007607386.db too, but for growth, the previous
method is better. Urls are stored the same way - the url
"http://www.sprawler.com/index.html" is broken into two parts (possibly
three): a domain part, and a path part (and possibly a protocol part).
So, the previously mentioned url would break into pieces like
"http://www.sprawler.com" and "/index.html" (but possibly "http://"
"www.sprawler.com" and "/index.html"). Each piece would be id'ed with
the above method, and converted to something like:
http:// = 3947228689
www.sprawler.com = 3942606425
/index.html = 2560468444
so if we did the three tier method, we'd have:
/3947228689/ (all http indexes would be in this directory)
/3947228689/3/9/4/2/6/0/6/4/2/5/ (all http indexes for the
www.sprawler.com domain would be in here)
/3947228689/3/9/4/2/6/0/6/4/2/5/2/5/6/0/4/6/8/4/4/4/2560468444.db
(all information for the page: http://www.sprawler.com/index.html would be
in this db file)
So in each word db file, we store information about which url id's that
word is found in, and in each url db file, we store what words can be
found on that page (and possibly where) along with other information
(page size, type, language, last index date, etc).
Yes, the directories are nasty - yes, they are hard to read. No
worries - we never have to. The computer can open these files very quickly.
Plus, for those of you familiar with NFS, this is a very easy method to
implement many NFS servers (very fast also) to supply db file data
quickly to a team of searching machines. We can also scale it slowly
or rapidly, depending on growth. We can also add disk space on the fly,
live, without disrupting service.
So - what do we need? We need modules and subroutines to grab the data
from pages, and store them in a predefined way. I have lots of
experience with these things, so I can guide you all through the
good/bad/ugly of it if needed. We need a functioning client and master
indexer. We'll need some console bare bones db test scripts. Of
course, we'll need some hardware, bandwidth, and diskspace. Don't
worry about all that yet - let's get the code going, and I'll worry about the
rest..
Build map for client-indexer
This file attempts to outline the basic routines needed for the client indexer code.
It is meant to give a basic rough guide on how to create the client portion of the
indexing tool.
The following actions should take place when the indexer starts:
Load configuration data from conf file
Look for urls that still need to be indexed (never completed from last run)
If found, index these urls
If not found, request new batch of urls from master
Reload conf file when completed with one run - so changes can be made on the fly
Loop until killed.
Modules needed: (- to do, * done)
- gets urls from disk and master indexer, returns 1 if successful, 0 if not. takes # of urls as argument
* grabs data from url, returns data in a list. Takes url as argument
- takes data, "index" it, and return the gobs of sifted indexes
- takes indexed data, and stores in "completed" area, returns 1 if success, 0 if not
* loads config file, returns hash of config values
- yanks urls out of html received, and returns them to indexer module
Build map for master-indexer
This file attempts to outline the basic routines needed for the master indexer code.
It is meant to give a basic rough guide on how to create the master portion of the
indexing tool.
The following actions should take place when the indexer starts:
- Load configuration data from conf file
- Look for urls that have been indexed, and checked-in
- Pull data from those files, and store in the real engine index
- Remove file from checked-in directory
- Look for requests for urls, if found, put next batch of urls as requested
Modules needed: (- to do, * done)
* Loads config file specified
- Check in indexed url files
- Sift data from checked-in files
- Saves data sifted from checked-in files to engine index
- Looks for and responds to url requests (checks-out)
- Calculates url-id from url
- Calculates word-id from word
|