Wednesday, December 24, 2008

Terrier 2.2. released, with support for Hadoop Map Reduce indexing

I am pleased to announce that Terrier 2.2 was released, just before Christmas. While I have chosen only to increase the minor version number for this release, it a is substantial update, consisting of new support for Hadoop, a Hadoop Map Reduce indexing system, and various minor improvements and bug fixes. (I reserve major version numbers bumps for index format changes).

Our Map Reduce distributed indexing strategy builds upon the single-pass indexing strategy first released in Terrier 2.0. In deployment with a Hadoop cluster, Terrier can index large collections of data in a distributed fashion, splitting the indexing process across various Map and Reduce tasks, which can be run on various nodes in the cluster.

In particular, the input data files for the collection are split across many Map tasks. Each Map task indexes its allocated data files using a normal Collection implementation. Postings lists are built, compressed, in memory. Each time memory is exhausted, these miniature posting lists are emitted from the Map task.

The Reduce task is responsible for aggregating the posting lists for the various terms. Firstly, the Reduce input keys are sorted by term, and the values are sorted by source Map task, to ensure that the posting lists for a given term are processed in the correct order. For each term, the temporary posting lists (the reduce input values) are merged into the final compressed inverted index.

The indices created using the Map Reduce indexer are standard Terrier indices. Moreover, by controlling the number of Reduce tasks, the final index can be partitioned into separate indices, in the local inverted file layout (document partitioning). With a different partitioning scheme, global inverted file layout (term partitioning) would also be possible.

You can see the detailed list of changes for Terrier 2.2. in the documentation.

Wednesday, December 10, 2008

Mining query logs

It is often reported in the literature how search engines can use their query logs to improve document ranking. However, the query logs could also be used for various mining activities. For example, an article in The New York Times described how a power cut in the New York area was reflected in the Google's query logs within 2 seconds after its occurrence, while it took about 15 minutes for newswire services to report the same event.

Relatedly, Abdur Chowdhury in his position talk at the SSM 2008 Workshop mentioned that news about a major earthquake in China were reported on Twitter well before the newswire services. A BBC blog post commented on the same issue.

Finally, the BBC recently reported that Google has developed a system to detect flu outbreaks in the USA by analysing the query logs and identifying the location of people issuing flu-related queries.

Unfortunately, query logs are scarcely available to researchers in academia, especially after the AOL data debacle. This limits scientific work in the field, as most current research results using query logs are not reproducible due to lack of publicly shared data. As a consequence, I very much welcome the forthcoming Workshop on Web Search Click Data (WSCD 2009), where the issue of publicly releasing query logs is being addressed as one of the objectives of the workshop.