Thursday, June 4, 2009

CIKM 2011 in Glasgow!

We are delighted that our bid to host the ACM Conference on Information and Knowledge Management (CIKM 2011) in Glasgow has been successful.

After the highly successful ESSIR 2007 and ECIR 2008 events, we are excited at the prospect of hosting the prestigious ACM CIKM Conference in Glasgow in 2011. We look forward to having our colleagues gather in Glasgow, and to surpassing their expectations.

Further information about the conference (dates, venues, etc.) will be available in due course.

CIKM 2009 will be held on November 2-6, 2009, in Hong Kong. Hope to see you there!

Wednesday, April 29, 2009

TREC Blog track 2009

We have just released a draft of the guidelines for the TREC 2009 Blog track.

Compared to previous years, the Blog track 2009 aims to investigate more refined and complex search scenarios. In particular, we propose to run two tasks in TREC 2009:
  • Faceted blog distillation: a more refined version of the blog distillation task that addresses the quality aspect of the retrieved blogs and mimics an exploratory search task. The task can be summarised as "Find me a good blog with a principal, recurring interest in X". We propose several facets for the TREC 2009 blog distillation task, which may be of varying difficulty to identify for the participant systems.
  • Top stories identification: a new pilot task that addresses the news dimension in the blogosphere. Systems are asked to identify the top news stories of a given day, and to provide a list of relevant blog posts discussing each news story. The ranked list of blog posts should have a diverse nature, covering different/diverse aspects, perspectives or opinions of the news story.
The new Blogs08 collection, an up-to-date and large sample of the blogosphere from January 2008 to February 2009, will be used for both tasks.

We welcome feedback. Please feel free to post feedback and comments about the proposed tasks for 2009.

Thursday, April 9, 2009

Blogs08 Collection Released

We are pleased to announce that the Blogs08 collection is now ready for distribution. As announced before, Blogs08 is one order of magnitude bigger than Blogs06, and samples the blogosphere from January 2008 to February 2009. The uncompressed permalink size is approx 1.3TB, while including feeds, this amounts to over 2TB of data. As usual, the data is shipped compressed on a SATA hard drive.

The distribution mechanism will be the same as for Blogs06. There is specific information about the size of the collection here, while the instructions for obtaining the collection are here.

If you intend on participating in the TREC 2009 Blog track, please start working on the paperwork right away, so that you can get the collection as soon as possible. Due to the larger size of the collection, we will operate a queuing system for shipping the data. Moreover, if you haven't done so already, respond to the TREC 2009 Call for Participation.

Blog track co-ordinators are finalising the guidelines for this year's tasks and will continue to update the TREC Blog wiki, the TREC blog track mailing list and this blog.

Tuesday, March 3, 2009

Craig's Thesis Available

Following up from my successful defence, I'm pleased to announce that my thesis, titled The Voting Model for People Search is now available online.

My thesis proposes the Voting Model for various people search problems, such as expert search in enterprise settings (find me someone who knows about...) , or blog(ger) search (find me a blog about the general topic...). I also examine the reviewer assignment problem (suggest for me reviewers for this paper...). In general, the Voting Model is concerned with the ranking of aggregates of documents.

Experimental chapters are mainly carried out using TREC Enterprise track and Blog track test collections.

Sunday, February 22, 2009

Correlator launched

Yahoo! Research has launched a new search engine called Correlator. It uses advanced techniques from Natural Language Processing and Computational Linguistics to locate entities within text and to group sentences about these entities from different documents. In his talk at the ECIR 2008 Industry Day, Hugo Zaragoza who championed the Correlator project at Yahoo! Research Barcelona, described some of the system's underlying approaches and technologies. In a blog post introducing the search engine, he states:
The core of Correlator is a search engine capable of returning not only relevant documents, but also relevant sentences and entities.
Currently, Correlator uses Wikipedia as the underlying document collection. However, the Correlator team contends that this can be extended to other collections and types of documents such as blogs.

I have quickly tried Correlator this morning. My first impression of the system is that it does extremely well on many queries - e.g. results for queries such as "precision and recall" are pretty good and informative. However, there are several areas for improvement when it comes to identifying relationships between entities. For example, for the query "Tony Blair", when searching for names, the system suggests many entities as 'probably related to Tony Blair', however the precise nature of the relationship between the two entities is not stated, e.g. Cherie Blair should be presented as the definite wife of Tony Blair. Indeed, it is left to the user the task of browsing through various possible suggested relationships between the named entities. However, this might be a design choice by the designers of the system, favouring high coverage over high precision.

Relatedly, it is of note that TREC 2009 will include a new Entity track. One of the currently proposed search tasks is the identification of relationships between entities.