CSNews's blog

Thursday, January 10, 2008

Nutch:Introduction-Part1

Nutch is an open source Java implementation of a search engine. It provides all of the tools you need to run your own search engine. But why would anyone want to run their own search engine? After all, there's always Google. There are at least three reasons.

Transparency. Nutch is open source, so anyone can see how the ranking algorithms work. With commercial search engines, the precise details of the algorithms are secret so you can never know why a particular search result is ranked as it is. Furthermore, some search engines allow rankings to be based on payments, rather than on the relevance of the site's contents. Nutch is a good fit for academic and government organizations, where the perception of fairness of rankings may be more important.
Understanding. We don't have the source code to Google, so Nutch is probably the best we have. It's interesting to see how a large search engine works. Nutch has been built using ideas from academia and industry: for instance, core parts of Nutch are currently being re-implemented to use the Map Reduce distributed processing model, which emerged from Google Labs last year. And Nutch is attractive for researchers who want to try out new search algorithms, since it is so easy to extend.
Extensibility. Don't like the way other search engines display their results? Write your own search engine--using Nutch! Nutch is very flexible: it can be customized and incorporated into your application. For developers, Nutch is a great platform for adding search to heterogeneous collections of information, and being able to customize the search interface, or extend the out-of-the-box functionality through the plugin mechanism. For example, you can integrate it into your site to add a search capability.

Nutch installations typically operate at one of three scales: local filesystem, intranet, or whole web. All three have different characteristics. For instance, crawling a local filesystem is reliable compared to the other two, since network errors don't occur and caching copies of the page content is unnecessary (and actually a waste of disk space). Whole-web crawling lies at the other extreme. Crawling billions of pages creates a whole host of engineering problems to be solved: which pages do we start with? How do we partition the work between a set of crawlers? How often do we re-crawl? How do we cope with broken links, unresponsive sites, and unintelligible or duplicate content? There is another set of challenges to solve to deliver scalable search--how do we cope with hundreds of concurrent queries on such a large dataset? Building a whole-web search engine is a major investment. In " Building Nutch: Open Source Search," authors Mike Cafarella and Doug Cutting (the prime movers behind Nutch) conclude that:

... a complete system might cost anywhere between $800 per month for two-search-per-second performance over 100 million pages, to $30,000 per month for 50-page-per-second performance over 1 billion pages.

This series of two articles shows you how to use Nutch at the more modest intranet scale (note that you may see this term being used to cover sites that are actually on the public internet--the point is the size of the crawl being undertaken, which ranges from a single site to tens, or possibly hundreds, of sites). This first article concentrates on crawling: the architecture of the Nutch crawler, how to run a crawl, and understanding what it generates. The second looks at searching, and shows you how to run the Nutch search application, ways to customize it, and considerations for running a real-world system.

Nutch Vs. Lucene

Nutch is built on top of Lucene, which is an API for text indexing and searching. A common question is: "Should I use Lucene or Nutch?" The simple answer is that you should use Lucene if you don't need a web crawler. A common scenario is that you have a web front end to a database that you want to make searchable. The best way to do this is to index the data directly from the database using the Lucene API, and then write code to do searches against the index, again using Lucene. Erik Hatcher and Otis Gospodnetić's Lucene in Action gives all of the details. Nutch is a better fit for sites where you don't have direct access to the underlying data, or it comes from disparate sources.

Architecture

Nutch divides naturally into two pieces: the crawler and the searcher. The crawler fetches pages and turns them into an inverted index, which the searcher uses to answer users' search queries. The interface between the two pieces is the index, so apart from an agreement about the fields in the index, the two are highly decoupled. (Actually, it is a little more complicated than this, since the page content is not stored in the index, so the searcher needs access to the segments described below in order to produce page summaries and to provide access to cached pages.)

The main practical spin-off from this design is that the crawler and searcher systems can be scaled independently on separate hardware platforms. For instance, a highly trafficked search page that provides searching for a relatively modest set of sites may only need a correspondingly modest investment in the crawler infrastructure, while requiring more substantial resources for supporting the searcher.

We will look at the Nutch crawler here, and leave discussion of the searcher to part two.

The Crawler

The crawler system is driven by the Nutch crawl tool, and a family of related tools to build and maintain several types of data structures, including the web database, a set of segments, and the index. We describe all of these in more detail next.

The web database, or WebDB, is a specialized persistent data structure for mirroring the structure and properties of the web graph being crawled. It persists as long as the web graph that is being crawled (and re-crawled) exists, which may be months or years. The WebDB is used only by the crawler and does not play any role during searching. The WebDB stores two types of entities: pages and links. A page represents a page on the Web, and is indexed by its URL and the MD5 hash of its contents. Other pertinent information is stored, too, including the number of links in the page (also called outlinks); fetch information (such as when the page is due to be refetched); and the page's score, which is a measure of how important the page is (for example, one measure of importance awards high scores to pages that are linked to from many other pages). A link represents a link from one web page (the source) to another (the target). In the WebDB web graph, the nodes are pages and the edges are links.

A segment is a collection of pages fetched and indexed by the crawler in a single run. The fetchlist for a segment is a list of URLs for the crawler to fetch, and is generated from the WebDB. The fetcher output is the data retrieved from the pages in the fetchlist. The fetcher output for the segment is indexed and the index is stored in the segment. Any given segment has a limited lifespan, since it is obsolete as soon as all of its pages have been re-crawled. The default re-fetch interval is 30 days, so it is usually a good idea to delete segments older than this, particularly as they take up so much disk space. Segments are named by the date and time they were created, so it's easy to tell how old they are.

The index is the inverted index of all of the pages the system has retrieved, and is created by merging all of the individual segment indexes. Nutch uses Lucene for its indexing, so all of the Lucene tools and APIs are available to interact with the generated index. Since this has the potential to cause confusion, it is worth mentioning that the Lucene index format has a concept of segments, too, and these are different from Nutch segments. A Lucene segment is a portion of a Lucene index, whereas a Nutch segment is a fetched and indexed portion of the WebDB.

The crawl tool

Now that we have some terminology, it is worth trying to understand the crawl tool, since it does a lot behind the scenes. Crawling is a cyclical process: the crawler generates a set of fetchlists from the WebDB, a set of fetchers downloads the content from the Web, the crawler updates the WebDB with new links that were found, and then the crawler generates a new set of fetchlists (for links that haven't been fetched for a given period, including the new links found in the previous cycle) and the cycle repeats. This cycle is often referred to as the generate/fetch/update cycle, and runs periodically as long as you want to keep your search index up to date.

URLs with the same host are always assigned to the same fetchlist. This is done for reasons of politeness, so that a web site is not overloaded with requests from multiple fetchers in rapid succession. Nutch observes the Robots Exclusion Protocol, which allows site owners to control which parts of their site may be crawled.

The crawl tool is actually a front end to other, lower-level tools, so it is possible to get the same results by running the lower-level tools in a particular sequence. Here is a breakdown of what crawl does, with the lower-level tool names in parentheses:

Create a new WebDB (admin db -create).
Inject root URLs into the WebDB (inject).
Generate a fetchlist from the WebDB in a new segment (generate).
Fetch content from URLs in the fetchlist (fetch).
Update the WebDB with links from fetched pages (updatedb).
Repeat steps 3-5 until the required depth is reached.
Update segments with scores and links from the WebDB (updatesegs).
Index the fetched pages (index).
Eliminate duplicate content (and duplicate URLs) from the indexes (dedup).
Merge the indexes into a single index for searching (merge).

After creating a new WebDB (step 1), the generate/fetch/update cycle (steps 3-6) is bootstrapped by populating the WebDB with some seed URLs (step 2). When this cycle has finished, the crawler goes on to create an index from all of the segments (steps 7-10). Each segment is indexed independently (step 8), before duplicate pages (that is, pages at different URLs with the same content) are removed (step 9). Finally, the individual indexes are combined into a single index (step 10).

The dedup tool can remove duplicate URLs from the segment indexes. This is not to remove multiple fetches of the same URL because the URL has been duplicated in the WebDB--this cannot happen, since the WebDB does not allow duplicate URL entries. Instead, duplicates can arise if a URL is re-fetched and the old segment for the previous fetch still exists (because it hasn't been deleted). This situation can't arise during a single run of the crawl tool, but it can during re-crawls, so this is why dedup also removes duplicate URLs.

While the crawl tool is a great way to get started with crawling websites, you will need to use the lower-level tools to perform re-crawls and other maintenance on the data structures built during the initial crawl. We shall see how to do this in the real-world example later, in part two of this series. Also, crawl is really aimed at intranet-scale crawling. To do a whole web crawl, you should start with the lower-level tools. (See the "Resources" section for more information.)

Configuration and Customization

All of Nutch's configuration files are found in the conf subdirectory of the Nutch distribution. The main configuration file is conf/nutch-default.xml. As the name suggests, it contains the default settings, and should not be modified. To change a setting you create conf/nutch-site.xml, and add your site-specific overrides.

Nutch defines various extension points, which allow developers to customize Nutch's behavior by writing plugins, found in the plugins subdirectory. Nutch's parsing and indexing functionality is implemented almost entirely by plugins--it is not in the core code. For instance, the code for parsing HTML is provided by the HTML document parsing plugin, parse-html. You can control which plugins are available to Nutch with the plugin.includes and plugin.excludes properties in the main configuration file.

With this background, let's run a crawl on a toy site to get a feel for what the Nutch crawler does.

Sunday, January 6, 2008

Google Search

Saturday, January 5, 2008

Getting Started With Silverlight

Introduction

One of the most talked-about aspects of Silverlight is its undeniable technical accomplishments. In an age of 50 megabyte downloads, Microsoft managed to squeeze large parts of WPF and a mini-CLR into a tiny browser plug-in, currently under 5 MB. Not only that, but it runs in the three major browsers: Firefox, Safari and IE. More amazingly, it runs on the three major operating systems, with Microsoft supporting Windows and Mac OS X and Novell's Mono project providing a Linux runtime.

This article explains a little of the history behind Silverlight and what you can use it for. You'll see how easy it is to get started and then you'll actually practice using an InnerWorkings coding challenge featuring a sample app that demonstrates how to use and transform video in Silverlight.

Silver-what?

Windows Presentation Foundation Everywhere (WPF/E) was announced two years ago at the 2005 PDC (Professional Developer Conference), but kept a low profile for a long time. At Mix '07 in May, however, the first 1.0 beta was released under the Silverlight name, along with (a little confusingly) a preview of the 1.1 release.

Version 1.0 has now been officially released and can display vector graphics, animations and high quality video (up to 720p, the low-end of HD). You can program it using a subset of WPF's XAML syntax, using JavaScript to provide interactivity.

Version 1.1, however – still in the technical preview stage – is what's got everyone excited. Much more than a minor revision, this adds a bite-sized CLR to the plug-in capable of executing compiled C#, VB, Python and Ruby code. Although it is cut-down, the Core CLR still provides a just-in-time compiler and works with the exact same assemblies as the full-blown desktop CLR. If you can restrict yourself to the smaller set of classes included with Silverlight's CLR, there is complete compatibility between the two, with no re-compilation necessary.

How Does Silverlight Work?

From a user's point of view, the experience is designed to be as smooth as possible. On visiting a Silverlight-enabled site for the first time, a message appears inviting the user to download the plug-in from the Microsoft site. The install is quick and painless, although it can require a browser restart. After that, any Silverlight content is displayed automatically and the plug-in even updates itself as required.

The Silverlight SDK makes this experience easy by providing a JavaScript file that detects the plug-in's presence and either just displays the content or prompts the user to install. The JavaScript also provides a way to display Silverlight content without worrying about the different expectations of the supported browsers.

The actual "scene" is defined in XAML, an XML-based format where elements and attributes correspond to .NET objects and their properties. This is the same XAML that powers WPF, just with some of the elements missing.

What's left, at least in Silverlight 1.0, is enough to draw shapes, images, text and video. Some of the flexible layout options from WPF (like the Grid, DockPanel and StackPanel) are missing, along with any kind of prebuilt UI control: there are no buttons, menus, listboxes or anything. Although you can use HTML controls instead, this is a major shortcoming. The Silverlight team has promised some of these controls for a future release of version 1.1, along with a set of panels to make layout easier.

To build an app that does more than just look pretty, you can attach event handlers to Silverlight objects and write code that manipulates the scene, starting animations, controlling video or audio playback and updating the properties of objects. Silverlight really does behave like a mini-WPF, so if you're familiar with its big brother (or Windows Forms or ASP.NET, for that matter), the programming model is easy to follow.

Hello, Silverlight

When you strip away the helper JavaScript files, very little is required to display Silverlight content in a browser that has the plug-in installed. The snippet below uses inline XAML to reference content defined in the HTML file, but you can just as easily use a file located on your web server (or one generated dynamically by ASP.NET). Other than the XAML script tag, the snippet contains an object tag that actually creates an instance of the Silverlight control, passing it a reference to the XAML it should use (in this case preceded by the # symbol to specify the ID of the HTML element with the content). That's it.

Displaying Video with Silverlight

One of the strengths of WPF (and so Silverlight) is its composability: objects can be nested within each other easily to create composite objects. A button, for example, doesn't have to contain just a text label; it could contain an image as well (or a video or a tiny interactive game or whatever you want). While many of these uses aren't, well, useful, the simplicity that composability affords helps with all sorts of more common scenarios. If you want to display an image or video inside a circle or a rounded rectangle, you don't have to set some obscure property of the image; you just pop it inside a suitably shaped container.

In the same way, Silverlight's animation and transformation features can apply to video as easily as primitive shapes. This makes it easy to render video on objects that can be dragged around or to zoom into or clip video. The key to this is the VideoBrush, which can be used to paint the interior of a shape with the output of a MediaElement. The MediaElement itself can then be hidden, leaving only the VideoBrush-painted shape visible.

Try It Out

Silverlight is one of those technologies where words alone can't possibly do it justice. Have a look at the tutorials at Silverlight.net to find out more or skip ahead to the showcase to see what can be done.

Next, download the free Silverlight challenges from InnerWorkings and get to grips with some real Silverlight projects. You'll learn more about setting up Silverlight in your own apps, as well as practice using a VideoBrush to render and control video clipped and magnified according to your own specifications.

What's an InnerWorkings Coding Challenge?

An InnerWorkings coding challenge is a sample of code in Visual Studio that has some key pieces missing. Each challenge includes selected reference links chosen specifically to help you find out how to fill in the blanks, complete the sample app and learn about a new technology at the same time. Once you're finished, InnerWorkings Developer automatically checks your code so you can be sure you've solved the challenge correctly and that you really understand what you've learned.

Our coding challenges are designed to take you to the heart of the technology you want to learn more about, focusing on the most important, practical javascript:void(0)
Publish Postfeatures. Because everything has been set up for you, you can dive straight in and start coding.

InnerWorkings has a library of hundreds of challenges on diverse topics from ASP.NET to Windows Communication Foundation. For more information, have a look at our catalog.

Posted by Michael O'Brien

Google Search

Monday, December 31, 2007

Mozilla Firefox 3 Beta 1 Web Browser

The first beta release of the much-anticipated Firefox 3 Web browser offers some nice enhancements over the previous version, such as additional security and new tools for storing and accessing bookmarks and browsing history, but it doesn't differ much from Firefox 2 in looks or functionality.

Most of the changes expected in version 3 (due for final release in early 2008), such as stability and performance enhancements for the Gecko 1.9 rendering engine, will be under the hood and weren't apparent in the beta we tested. Most of the work for the new Places feature, which stores bookmarks and history in a database instead of in regular HTML files, is likewise invisible.

A few nice, though not earth-shattering, additions do reveal themselves. A new star icon next to the site URL allows for quickly adding new bookmarks; click it once to add a bookmark to the default folder or twice to choose the destination. You can also add tags to your bookmarks and then view them by those tags, or easily create bookmark backups that you can copy to other computers.

Mozilla is also working on a number of security enhancements, which again were not all available in this beta. I was able to test a revamp of the saved-password feature, which lets you postpone saving site credentials until after you've successfully logged in. The final release will block known malicious sites that attempt to install Trojan horses or other malware (the blacklist of such sites isn't yet in place). Overall the extra security should help make for safer browsing, but none of the upgrades will prove a major deterrent for malware pushers.

Other updates include a new downloads manager that allows for resuming downloads after browser restarts, a full-page zoom, and security and usage improvements for handling browser add-ons. Be sure to see the full list of changes in Firefox 3.

If you're interested in trying out version 3's new features, keep in mind that this beta release has known bugs. You can't log in to Yahoo Mail's slick new interface, for instance, though you can read Yahoo Mail via the old interface. Also, many popular add-ons, including Foxmarks (for bookmark syncing) and SiteAdvisor (for Web surfing security), don't yet work with the new Firefox. You can install and uninstall Beta 1 alongside Firefox 2, and in our tests the old version--including bookmarks, add-ons, and settings--was unharmed.

This beta, because of its bugs, is not well suited for everyday browsing. Using it, however, makes clear that the final Firefox 3 will include some nice extras but won't push the boundaries for browser upgrades.

Posted by Erik Larkin

Thursday, December 27, 2007

Windows XP Service Pack 3: Try It If You Dare

Microsoft posts the release candidate of Windows XP Service Pack 3 to its download site. However, the software giant warns that SP3 isn't for everyone.

The release candidate of Windows XP Service Pack 3 is now available at Microsoft's download site.

The move marks the first opportunity for all users of the six-year-old operating system to try out its final upgrade. Previously, several thousand users were given access to test builds of SP3 only by Microsoft's invitation.

According to a company spokeswoman, Windows XP SP3 RC (release candidate) will be available only from the Microsoft Download Center. Unlike Vista SP1, which debuted last week, XP SP3 will not be soon added to Windows Update. In fact, the spokeswoman seemed to say SP3 wouldn't be offered to users via Microsoft's update service before the service pack is finished next year.

"XP SP3 2ill be added to WU [Windows Update] in 1H '08," she said in an e-mail late Tuesday.

The download weighed in at about 336MB, but when SP3 is offered through Windows Update, the installation file will be much smaller -- around 70MB.

Even though the release candidate can be installed by anyone running XP SP2, Microsoft warned off casual users from trying the preview. "As this is a release candidate, we strongly encourage only those who are comfortable installing pre-release code to download Windows XP SP3," said the spokeswoman.

The spokeswoman also confirmed that the final version of Windows XP SP3 remains slated for delivery in the first half of 2008.

Significance of SP3

Recently, Microsoft downplayed the significance of Windows XP SP3. In a white paper posted to its Web site, the company praised Windows Vista at XP's expense, reminding users that Vista boasted beefed-up security, for instance. The spokeswoman also chimed in. "Windows XP SP3 does not bring significant portions of Windows Vista functionality to Windows XP," she said.

That may be so, but according to a Florida performance testing software developer, XP SP3 is not only 10% faster than XP SP2, but more than twice as fast as Vista SP1, claims that Microsoft disparaged within days.

XP SP3, in fact, is the newest version of Vista's biggest rival, according to Forrester Research. U.S. and European businesses will delay Vista deployment, Forrester analyst Benjamin Gray said a month ago, in part because of application incompatibility problems unheard of in XP. "That's causing a lot of XP shops to take a wait-and-see approach to Vista," said Gray then.

Windows XP debuted in October 2001 and was last updated as SP2 in August 2004; SP3 will be the final major upgrade of the operating system.

Posted by Gregg Keizer.

CSNews's blog

Adbrite

Thursday, January 10, 2008

Nutch:Introduction-Part1

Nutch Vs. Lucene

Architecture

The Crawler

The crawl tool

Configuration and Customization

Sunday, January 6, 2008

Google Search

Saturday, January 5, 2008

Getting Started With Silverlight

Introduction

Silver-what?

How Does Silverlight Work?

Hello, Silverlight

Displaying Video with Silverlight

Try It Out

What's an InnerWorkings Coding Challenge?

Google Search

Monday, December 31, 2007

Top Underreported IT Stories in 2007

Mozilla Firefox 3 Beta 1 Web Browser

Thursday, December 27, 2007

Windows XP Service Pack 3: Try It If You Dare

Microsoft posts the release candidate of Windows XP Service Pack 3 to its download site. However, the software giant warns that SP3 isn't for everyone.

Significance of SP3

Bidvertiser

BidVertiser

Blog Archive

BidVertiser

About Me

Adbrite