- Transparency. Nutch is open source, so anyone can see how the ranking algorithms work. With commercial search engines, the precise details of the algorithms are secret so you can never know why a particular search result is ranked as it is. Furthermore, some search engines allow rankings to be based on payments, rather than on the relevance of the site's contents. Nutch is a good fit for academic and government organizations, where the perception of fairness of rankings may be more important.
- Understanding. We don't have the source code to Google, so Nutch is probably the best we have. It's interesting to see how a large search engine works. Nutch has been built using ideas from academia and industry: for instance, core parts of Nutch are currently being re-implemented to use the Map Reduce distributed processing model, which emerged from Google Labs last year. And Nutch is attractive for researchers who want to try out new search algorithms, since it is so easy to extend.
- Extensibility. Don't like the way other search engines display their results? Write your own search engine--using Nutch! Nutch is very flexible: it can be customized and incorporated into your application. For developers, Nutch is a great platform for adding search to heterogeneous collections of information, and being able to customize the search interface, or extend the out-of-the-box functionality through the plugin mechanism. For example, you can integrate it into your site to add a search capability.
Nutch installations typically operate at one of three scales: local filesystem, intranet, or whole web. All three have different characteristics. For instance, crawling a local filesystem is reliable compared to the other two, since network errors don't occur and caching copies of the page content is unnecessary (and actually a waste of disk space). Whole-web crawling lies at the other extreme. Crawling billions of pages creates a whole host of engineering problems to be solved: which pages do we start with? How do we partition the work between a set of crawlers? How often do we re-crawl? How do we cope with broken links, unresponsive sites, and unintelligible or duplicate content? There is another set of challenges to solve to deliver scalable search--how do we cope with hundreds of concurrent queries on such a large dataset? Building a whole-web search engine is a major investment. In " Building Nutch: Open Source Search," authors Mike Cafarella and Doug Cutting (the prime movers behind Nutch) conclude that:
... a complete system might cost anywhere between $800 per month for two-search-per-second performance over 100 million pages, to $30,000 per month for 50-page-per-second performance over 1 billion pages.
This series of two articles shows you how to use Nutch at the more modest intranet scale (note that you may see this term being used to cover sites that are actually on the public internet--the point is the size of the crawl being undertaken, which ranges from a single site to tens, or possibly hundreds, of sites). This first article concentrates on crawling: the architecture of the Nutch crawler, how to run a crawl, and understanding what it generates. The second looks at searching, and shows you how to run the Nutch search application, ways to customize it, and considerations for running a real-world system.
Nutch Vs. Lucene
Nutch is built on top of Lucene, which is an API for text indexing and searching. A common question is: "Should I use Lucene or Nutch?" The simple answer is that you should use Lucene if you don't need a web crawler. A common scenario is that you have a web front end to a database that you want to make searchable. The best way to do this is to index the data directly from the database using the Lucene API, and then write code to do searches against the index, again using Lucene. Erik Hatcher and Otis Gospodnetić's Lucene in Action gives all of the details. Nutch is a better fit for sites where you don't have direct access to the underlying data, or it comes from disparate sources.
Architecture
Nutch divides naturally into two pieces: the crawler and the searcher. The crawler fetches pages and turns them into an inverted index, which the searcher uses to answer users' search queries. The interface between the two pieces is the index, so apart from an agreement about the fields in the index, the two are highly decoupled. (Actually, it is a little more complicated than this, since the page content is not stored in the index, so the searcher needs access to the segments described below in order to produce page summaries and to provide access to cached pages.)
The main practical spin-off from this design is that the crawler and searcher systems can be scaled independently on separate hardware platforms. For instance, a highly trafficked search page that provides searching for a relatively modest set of sites may only need a correspondingly modest investment in the crawler infrastructure, while requiring more substantial resources for supporting the searcher.
We will look at the Nutch crawler here, and leave discussion of the searcher to part two.
The Crawler
The crawler system is driven by the Nutch crawl tool, and a family of related tools to build and maintain several types of data structures, including the web database, a set of segments, and the index. We describe all of these in more detail next.
The web database, or WebDB, is a specialized persistent data structure for mirroring the structure and properties of the web graph being crawled. It persists as long as the web graph that is being crawled (and re-crawled) exists, which may be months or years. The WebDB is used only by the crawler and does not play any role during searching. The WebDB stores two types of entities: pages and links. A page represents a page on the Web, and is indexed by its URL and the MD5 hash of its contents. Other pertinent information is stored, too, including the number of links in the page (also called outlinks); fetch information (such as when the page is due to be refetched); and the page's score, which is a measure of how important the page is (for example, one measure of importance awards high scores to pages that are linked to from many other pages). A link represents a link from one web page (the source) to another (the target). In the WebDB web graph, the nodes are pages and the edges are links.
A segment is a collection of pages fetched and indexed by the crawler in a single run. The fetchlist for a segment is a list of URLs for the crawler to fetch, and is generated from the WebDB. The fetcher output is the data retrieved from the pages in the fetchlist. The fetcher output for the segment is indexed and the index is stored in the segment. Any given segment has a limited lifespan, since it is obsolete as soon as all of its pages have been re-crawled. The default re-fetch interval is 30 days, so it is usually a good idea to delete segments older than this, particularly as they take up so much disk space. Segments are named by the date and time they were created, so it's easy to tell how old they are.
The index is the inverted index of all of the pages the system has retrieved, and is created by merging all of the individual segment indexes. Nutch uses Lucene for its indexing, so all of the Lucene tools and APIs are available to interact with the generated index. Since this has the potential to cause confusion, it is worth mentioning that the Lucene index format has a concept of segments, too, and these are different from Nutch segments. A Lucene segment is a portion of a Lucene index, whereas a Nutch segment is a fetched and indexed portion of the WebDB.
The crawl tool
Now that we have some terminology, it is worth trying to understand the crawl tool, since it does a lot behind the scenes. Crawling is a cyclical process: the crawler generates a set of fetchlists from the WebDB, a set of fetchers downloads the content from the Web, the crawler updates the WebDB with new links that were found, and then the crawler generates a new set of fetchlists (for links that haven't been fetched for a given period, including the new links found in the previous cycle) and the cycle repeats. This cycle is often referred to as the generate/fetch/update cycle, and runs periodically as long as you want to keep your search index up to date.
URLs with the same host are always assigned to the same fetchlist. This is done for reasons of politeness, so that a web site is not overloaded with requests from multiple fetchers in rapid succession. Nutch observes the Robots Exclusion Protocol, which allows site owners to control which parts of their site may be crawled.
The crawl tool is actually a front end to other, lower-level tools, so it is possible to get the same results by running the lower-level tools in a particular sequence. Here is a breakdown of what crawl does, with the lower-level tool names in parentheses:
- Create a new WebDB (
admin db -create). - Inject root URLs into the WebDB (
inject). - Generate a fetchlist from the WebDB in a new segment (
generate). - Fetch content from URLs in the fetchlist (
fetch). - Update the WebDB with links from fetched pages (
updatedb). - Repeat steps 3-5 until the required depth is reached.
- Update segments with scores and links from the WebDB (
updatesegs). - Index the fetched pages (
index). - Eliminate duplicate content (and duplicate URLs) from the indexes (
dedup). - Merge the indexes into a single index for searching (
merge).
After creating a new WebDB (step 1), the generate/fetch/update cycle (steps 3-6) is bootstrapped by populating the WebDB with some seed URLs (step 2). When this cycle has finished, the crawler goes on to create an index from all of the segments (steps 7-10). Each segment is indexed independently (step 8), before duplicate pages (that is, pages at different URLs with the same content) are removed (step 9). Finally, the individual indexes are combined into a single index (step 10).
The dedup tool can remove duplicate URLs from the segment indexes. This is not to remove multiple fetches of the same URL because the URL has been duplicated in the WebDB--this cannot happen, since the WebDB does not allow duplicate URL entries. Instead, duplicates can arise if a URL is re-fetched and the old segment for the previous fetch still exists (because it hasn't been deleted). This situation can't arise during a single run of the crawl tool, but it can during re-crawls, so this is why dedup also removes duplicate URLs.
While the crawl tool is a great way to get started with crawling websites, you will need to use the lower-level tools to perform re-crawls and other maintenance on the data structures built during the initial crawl. We shall see how to do this in the real-world example later, in part two of this series. Also, crawl is really aimed at intranet-scale crawling. To do a whole web crawl, you should start with the lower-level tools. (See the "Resources" section for more information.)
Configuration and Customization
All of Nutch's configuration files are found in the conf subdirectory of the Nutch distribution. The main configuration file is conf/nutch-default.xml. As the name suggests, it contains the default settings, and should not be modified. To change a setting you create conf/nutch-site.xml, and add your site-specific overrides.
Nutch defines various extension points, which allow developers to customize Nutch's behavior by writing plugins, found in the plugins subdirectory. Nutch's parsing and indexing functionality is implemented almost entirely by plugins--it is not in the core code. For instance, the code for parsing HTML is provided by the HTML document parsing plugin, parse-html. You can control which plugins are available to Nutch with the plugin.includes and plugin.excludes properties in the main configuration file.
With this background, let's run a crawl on a toy site to get a feel for what the Nutch crawler does.