Sunday, February 22, 2009

Correlator launched

Yahoo! Research has launched a new search engine called Correlator. It uses advanced techniques from Natural Language Processing and Computational Linguistics to locate entities within text and to group sentences about these entities from different documents. In his talk at the ECIR 2008 Industry Day, Hugo Zaragoza who championed the Correlator project at Yahoo! Research Barcelona, described some of the system's underlying approaches and technologies. In a blog post introducing the search engine, he states:
The core of Correlator is a search engine capable of returning not only relevant documents, but also relevant sentences and entities.
Currently, Correlator uses Wikipedia as the underlying document collection. However, the Correlator team contends that this can be extended to other collections and types of documents such as blogs.

I have quickly tried Correlator this morning. My first impression of the system is that it does extremely well on many queries - e.g. results for queries such as "precision and recall" are pretty good and informative. However, there are several areas for improvement when it comes to identifying relationships between entities. For example, for the query "Tony Blair", when searching for names, the system suggests many entities as 'probably related to Tony Blair', however the precise nature of the relationship between the two entities is not stated, e.g. Cherie Blair should be presented as the definite wife of Tony Blair. Indeed, it is left to the user the task of browsing through various possible suggested relationships between the named entities. However, this might be a design choice by the designers of the system, favouring high coverage over high precision.

Relatedly, it is of note that TREC 2009 will include a new Entity track. One of the currently proposed search tasks is the identification of relationships between entities.

Friday, February 20, 2009

Choosing an issue tracking system

I recently posted about our deployment of an issue tracking system for use by the Terrier platform.

When we recently made the decision process to an issue tracking system, I reviewed several alternative products, namely Scarab, Mantis, and JIRA. Firstly, I should say that I was already familiar with JIRA through its extensive usage by the Apache Software Foundation, e.g. by Hadoop and Pig.

When I evaluated all three products, I found Scarab to be the nearest to JIRA, however it required much work to setup an issue tracking system that was full of "attributeGroup1" etc. labels. Scarab also lacked querying features: e.g. it had an advanced SQL inspired querying interface, but no easy way to see "All Issues" or "Most Recently Updated Issues". For me, this is a killer problem, as Bugzilla faces the same issue. It should be easy to see what the current problems are, or what issues people are working on. At the end of the day, as a developer, you spend more time querying an issue tracker than you do filing issues.

I'm no expert at HTML & CSS, so for Mantis, I was pleased when it installed with a cleaner default theme. However, with Mantis, the fields for an issue were hard-coded in the submission and display pages. This was ultimately the downfall for Mantis, as I'm keen on minimising the unnecessary fields, and Mantis had many fields that were not appropriate for Terrier. Another downside for Mantis was that issues were numbered 0000001. In JIRA and Scarab, issues have a project prefix, and no leading zeros, making them instantly recognisable outside of the issue tracker - e.g. if I write TR-5 (e.g. in a SVN commit message), then people are likely to know what I'm on about. In contrast 0000005 is not something that using a search engine they would find quickly.

Finally, I installed a trial for Atlassian's JIRA. We found JIRA easy to browse and query existing issues. The dashboard functionality it also useful, and customisable. Finally, we also liked the user/SE friendly URLs used by JIRA for issues: e.g. for issue TR-1. I'm pleased with JIRA, which exhibits an overall very polished UI. Atlassian have also very kindly provided Terrier with an open source license for JIRA. Thanks Atlassian!

Wednesday, February 18, 2009

WSDM 09 : Your mileage may vary

So, for the second year WSDM arrives, attendance has remained constant (despite the economic downturn) and we're all packing t-shirts rather than bags. However, unwrapping the packaging, what do we find?

Initially, the conference seemed promising, beginning with an excellent speech by Jeff Dean. A 101 on efficiency at Google since 1999, well rounded with explanations and statistics in equal measure. Unfortunately, this staring performance seemed to overshadow the rest of the conference - a benchmark never surpassed.

If I was to use one word to describe WSDM'09, then it would be inconsistent. There were some nice speeches, Eytan Adar's talk on detecting how the web changes over time springs to mind (also best student paper), and Songhua Xu gets bonus points for turning up with an interface paper to a predominantly text-based conference. However, many were poorly presented, insubstantial or both.

The dominating topic of the conference was unsurprisingly Wikipedia, with well over 40% of papers giving it a mention. Ignoring the proliferation of Wikipedia papers over the last year, high point here was Eytan Adar's paper on Information Arbitrage Across Multi-lingual Wikipedia, for coming up with something which might actually be useful in practice.

The videos from the conference should be up soon on, and I would recommend Jeff Deans opening speech - for those of you who can survive watching in tiny eye-strain-o vision which comes with flash. As for the rest remember - your mileage may vary.

Tuesday, February 17, 2009

WSDM 2009 highlights

Richard and I went to Barcelona last week to attend the WSCD 2009 workshop and the WSDM 2009 conference. Craig was also there on Monday to present an interesting poster on the usefulness of click-through data for training.

Besides being held in an exciting city (!), WSDM 2009 kept up with its previous edition in bringing together industry and academia to a common, quality forum for Web IR and data mining, with papers covering a wide range of trendy topics -- fairly well summarised by the tag cloud printed on the t-shirts given to the participants! -- from query intent detection, through search results diversification, to tagging-based clustering and classification, and social network-driven marketing analysis, to name a few.

The best paper award went to Fernando Diaz for his work on the selective integration of news content into Web results based on the classification of the newsworthiness of each query. Eytan Adar et al. received the best student paper award for their study of the dynamics of the content and structure of Web documents of varying popularity over a fine-grained timescale. In the new late breaking results session, the award went to Irem Arikan et al.'s paper on applying a language model approach for improving the retrieval effectiveness for queries with temporal expressions. The invited talks by Jeff Dean and Gerhard Weikum were also insightful -- we couldn't attend Ravi Kumar's though. All talks should be available soon from

Overall, WSDM is rapidly moving towards establishing itself among the major IR conferences. In 2010, it will probably be held in Los Angeles, CA, USA.

Monday, February 16, 2009

Twitter and CEOs

Thanks to Theo Huibers for pointing out to an article in Forbes about Why Europe's CEOs should Twitter.

The article reports that unlike their counterpart in the USA, the CEOs of European companies are being slow in embracing the Twitter tool. In general, the article argues that European chief executives are not very aware of the benefits of social networking tools to their businesses, missing out on opportunities to engage with their customers.

If this is true, than this is rather worrying. Indeed, I can easily see many scenarios where social networking tools such as Twitter could be helpful for businesses. The Forbes article mentions several of these. For example, a case where the public relations office of General Motors has used Twitter to clamp down on rumours affecting the company. In his blog, Daniel Tunkelang reported a first-hand experience, when one of his technical questions posted on Twitter received care from the president and COO of, albeit with a degree of attention that goes beyond what Daniel bargained for.

It is of interest to note that the Forbes article suggests that the Twitter's interface is still too complex for a widespread adoption by end-users and businesses. While I have only been an occasional user of Twitter, I have never had the feeling that the tool was difficult to use. However, I'm happy to stand corrected by HCI experts!

Wednesday, February 11, 2009

Grid@CLEF track : a framework for IR experimentation

Don't be put of by the title, this isn't a post about Grid Computing. Instead, I'm going to talk about the Grid@CLEF task, which defines a framework and TREC-style track for experimentation with various components of IR systems. Disclaimer: I'm pleased to be on the advisory committee of the Grid@CLEF task.

Firstly, I'll give a bit of background. Cross-Language Evaluation Forum (CLEF) is a spin-off from TREC which concentrates on the evaluation of mono-lingual (non-English) and cross-lingual retrieval. CLEF has been running since 2000, and attracts a wide spread of participating research groups from across the globe, reaching 130 for CLEF 2008.

The tracks have now been defined for CLEF 2009, which includes the Grid track. Nicola Ferro (Univ. of Padova) and Donna Harman (NIST) are the big-wigs for this task, with suggestions from the advisory committee. So what does Grid mean in this context? Well, the idea (in my own words) is that the components of an IR system that have effect can be roughly categorised as follows: tokeniser, stopword list, word-decompounder, stemmer, and ranking function. In the Grid track, the concept is that these components can be interchanged, and a fuller understanding of their impact derived. The Grid framework facilitates such interchanges, by defining a way to allow various mixes of components to be attempted, thus creating a "grid" of experimental results.

However, the problem with such an experiment is that often each of these components is tied to an IR system, and that having the IR system itself can have an impact on the results. Instead, the idea behind the Grid track is that the output from each component (tokeniser, stopword list etc) of a given IR system is saved in an XML format, and shared among participants. In this way, every combination of each component can be investigated.

The Grid@CLEF site describes more the intuitions of the task, including an example of how results will be presented.

Here in Glasgow, we like the concept behind the Grid track. Indeed, it has some similarities to the way we ran the opinion finding task in the TREC 2008 Blog track. In the opinion finding task (where the aim is to retrieve relevant and opinionated blog posts about the target topic), the retrieval performance of opinion identification approaches appears to be linked to the ability of the underlying "topical relevance" retrieval approach. To investigate this in TREC 2008, we provided 5 standard topical relevance baselines, which participants were able to use as input to their opinion finding technique(s). You can read more in the Overview of the TREC 2008 Blog track (Iadh Ounis, Craig Macdonald and Ian Soboroff), which should be released in a few weeks time.

I have committed to implementing Terrier support for the Grid@CLEF track. The XML specification is being agreed by the Grid@CLEF organisers and advisers. However, if you are interested in using Terrier on this task, you can follow the progress on the TR-9 issue concerning Terrier's Grid@CLEF support. The exact specification for the Grid@CLEF XML interchange format is still in flux, but once its settled down, Terrier support should be forthcoming.

Building Terrier by Open Collaboration

An important benefit of having an open source IR platform, like Terrier, is that users of the platform can contribute code to the platform, and overall, everyone gains. IR platforms which are not open source may be popular, but can stagnate if it does not evolve to meet modern needs. Open source is a good way of building such a critical mass of people to evolve a project.

To facilitate the task of our users who contribute to Terrier, we are in the process of making changes that will also make the development process easier:
An issue tracker allows issues (bugs or feature requests) to be named, discussed, and patches proposed. Other contributors may review and discuss these patches before they are committed. All development work on the Terrier open source platform will now be done via the issue tracker. In deciding to deploy JIRA, we did take some time to review several issue trackers. I'll describe these and how we came to our decision in a future post.
The goal of opening our source code repository is that patches submitted by contributors can be made against the latest (trunk) Terrier source, thus ensuring that no stale patches are received. As a committer this will make my job easier.

I recently announced these changes in Rome at the New challenges in Information Retrieval and Text Mining in an open source initiative workshop. You can see my slides from the workshop below: