Wednesday, December 24, 2008

Terrier 2.2. released, with support for Hadoop Map Reduce indexing

I am pleased to announce that Terrier 2.2 was released, just before Christmas. While I have chosen only to increase the minor version number for this release, it a is substantial update, consisting of new support for Hadoop, a Hadoop Map Reduce indexing system, and various minor improvements and bug fixes. (I reserve major version numbers bumps for index format changes).

Our Map Reduce distributed indexing strategy builds upon the single-pass indexing strategy first released in Terrier 2.0. In deployment with a Hadoop cluster, Terrier can index large collections of data in a distributed fashion, splitting the indexing process across various Map and Reduce tasks, which can be run on various nodes in the cluster.

In particular, the input data files for the collection are split across many Map tasks. Each Map task indexes its allocated data files using a normal Collection implementation. Postings lists are built, compressed, in memory. Each time memory is exhausted, these miniature posting lists are emitted from the Map task.

The Reduce task is responsible for aggregating the posting lists for the various terms. Firstly, the Reduce input keys are sorted by term, and the values are sorted by source Map task, to ensure that the posting lists for a given term are processed in the correct order. For each term, the temporary posting lists (the reduce input values) are merged into the final compressed inverted index.

The indices created using the Map Reduce indexer are standard Terrier indices. Moreover, by controlling the number of Reduce tasks, the final index can be partitioned into separate indices, in the local inverted file layout (document partitioning). With a different partitioning scheme, global inverted file layout (term partitioning) would also be possible.

You can see the detailed list of changes for Terrier 2.2. in the documentation.

Wednesday, December 10, 2008

Mining query logs

It is often reported in the literature how search engines can use their query logs to improve document ranking. However, the query logs could also be used for various mining activities. For example, an article in The New York Times described how a power cut in the New York area was reflected in the Google's query logs within 2 seconds after its occurrence, while it took about 15 minutes for newswire services to report the same event.

Relatedly, Abdur Chowdhury in his position talk at the SSM 2008 Workshop mentioned that news about a major earthquake in China were reported on Twitter well before the newswire services. A BBC blog post commented on the same issue.

Finally, the BBC recently reported that Google has developed a system to detect flu outbreaks in the USA by analysing the query logs and identifying the location of people issuing flu-related queries.

Unfortunately, query logs are scarcely available to researchers in academia, especially after the AOL data debacle. This limits scientific work in the field, as most current research results using query logs are not reproducible due to lack of publicly shared data. As a consequence, I very much welcome the forthcoming Workshop on Web Search Click Data (WSCD 2009), where the issue of publicly releasing query logs is being addressed as one of the objectives of the workshop.

Sunday, November 30, 2008

The TREC 2008 Blog track workshop

We just came back from Gaithersburg a few days ago. It was a nice (and cold!) week at the TREC 2008 conference. Besides presenting the main results of our participation in the Blog, Enterprise, and Relevance Feedback tracks, we had fruitful discussions at the Blog track workshop regarding the directions of the track for 2009.

It was a consensus among the attendees that opinion retrieval and polarity detection are still open, relevant problems. Yet a few groups managed to deploy interesting techniques that achieved consistent opinion retrieval performances across several strongly performing baselines in the track this year, polarity detection approaches looked rather naive. It was suggested that polarity detection be investigated at a finer granularity (e.g., at the sentence rather than the document level). This, however, could result in crossing the boundaries with respect to the TAC conference.

Nonetheless, believing that, after three years, the Blog track has contributed a comprehensive experimental setting for those who wish to continue investigating these search scenarios, the organisers decided to discontinue the opinion finding and polarity tasks, at least in their current format. Instead, they propose to investigate the opinionated nature of blogs as one of many interesting facets of a broader search task. This task extends the current blog distillation task by moving beyond topic relevance and introducing different requirements in order to qualify "good" blogs, i.e., blogs that have a recurrent interest in a given topic and that also fulfil a set of predefined "facets". This way, for instance, one could search for humorous blogs about the government, or opinionated blogs about whisky.

Besides this faceted blog distillation task, a second task was considered relevant and worth investigating by the workshop attendees, namely, tracking stories on the blogosphere. The aim is to investigate how stories emerge and evolve along the time frame of the blog corpus. It was also noted that this task could be linked to a news search task so as to draw a connection between stories published on the blogosphere and on the mainstream media.

As pointed out, however, the 11-weeks time frame of the Blogs06 collection does not adequately support the story tracking task. Furthermore, the availability of a more representative sample of the blogosphere is an important step towards addressing blog search as a social media problem. For such, a new corpus will be used in 2009, with a much larger size and time frame.

For those who did not attend the Blog track workshop at TREC, please feel free to post your comments about the proposed tasks for 2009.

Hope you all join us in the TREC 2009 Blog track!

Saturday, November 15, 2008

TREC 2008

Shortly, we will be travelling to attend the TREC 2008 conference in Gaithersburg, Maryland (18-21 November 2008). We have been very busy analysing the sheer volume of data that was collected in the Blog track this year. Indeed, this year, we ran a very large-scale experiment with the aim to draw a better understanding of the most effective and stable opinion-finding techniques. Moreover, we also tightened up the blog distillation task (feed search task), so as it truly runs as a distillation task. Following the traditional TREC conference cycle, the Blog track 2008 results will be first presented to the TREC 2008 participating groups next year. They will then be made available to all interested parties around February 2009 when the TREC 2008 final Proceedings go online.

Plans for the TREC 2009 Blog track will be discussed and refined during the TREC Blog track workshop in the afternoon of Thursday 20th November.

In addition to our involvement in the organisation of the Blog track, we will be giving a presentation on the work we did this year in the newly introduced Relevance Feedback track. We have also prepared two posters summarising our results in the Enteprise and Blog tracks.

It looks like we are set for a very exciting and busy week. We hope to see many of you in TREC.

Monday, November 10, 2008

SEMAST 2009

We are continuing organising events in Glasgow. After the ESSIR2007 summer school and the ECIR2008 conference, we will be organising the second Practical Semantic Astronomy Workshop (SEMAST 2009) from 2nd to 5th March 2009.

The Practical Semantic Astronomy 2009 is the second in a series of workshops first held at Caltech in February 2008. The workshop brings together experts from a broad range of disciplines using semantic technologies, alongside practitioners experimenting with these techniques, to address current problems in astroinformatics.

Our involvement in the organisation of this workshop is under the auspices of the Explicator project, where we have been working with astronomers and physicists on developing techniques to provide intelligent access to multiple sources. The Explicator project supports the efforts of the Virtual Observatory community.

The Virtual Observatory is a loose planet-wide collaboration of astronomy computing projects, aiming to make available the high-volume and rich data of astronomy. Although astronomical data is generally well-described, it is very dispersed, so that there is a substantial data-discovery problem, making it fertile ground for the sorts of semantic approaches applied with such success in other disciplines.

The Explicator project aims to bridge the gap between information retrieval and semantic web technologies in a domain-specific application. The SEMAST 2009 workshop is a continuation of this effort. We hope to see many of you in Glasgow.

Tuesday, October 21, 2008

Blogging is also about branding

Recently, Daniel Tunkelang wrote a blog post about why he was blogging. In a nutshell, he considers blogging to be fun and highlights how it can increase the "reputation capital" of the blogger. Daniel Lemire made a follow-up, stressing the networking benefit of blogging. I very much concur with both views. On the other hand, while blogging, both bloggers do not shun away from expressing their thoughts and opinions on various topics ranging from enterprise and blog search to peer reviewing or the benefit of pure theoretical research, through opinions on a search engine such as Duck Duck Go!

Such perspectives and opinions are not only informative and valuable to readers like myself, but they are also extremely important for various organisations. Indeed, according to an article on the BCS news website, blogging is very important for brands. The article quotes Rachel Hawkes, co-founder and editor of the Social Media Portal (SMP):

Blogs provide an opportunity for a two-way interaction to take place between business and consumer. This allows customers to provide 'incredibly valuable' feedback on how the brand is doing in the real world, which can help guide improvements and sales strategies.

The above scenario is one of the motivations for the opinion-finding search task that we have been investigating in the TREC Blog track since three years. The task addresses a search scenario where a user aims to uncover what the bloggers/consumers are saying or thinking about X. If the "user" is a business, and X is one of its products, then “taking the pulse of the blogosphere” is very important for this business's branding. In fact, the opinion-finding task can naturally be associated with settings such tracking consumer-generated content, brand monitoring, and, more generally, media analysis. Findings and insights gained from 3-years of the opinion-finding search task at the TREC Blog track will be discussed in the furthcoming TREC Conference (18-21 November 2008), held in NIST, USA.

Monday, October 20, 2008

CIKM 2008

We will shortly be travelling to attend the CIKM 2008 conference in Napa Valley. The organisers are announcing that it will be the biggest ever CIKM conference, and hope that it will be the most memorable one.

Following the ECIR 2008 conference example, I'm pleased to note that the organisers are making CIKM 2008 a green conference, through optimal usage of logistics and resources.

The conference has a very exciting scientific program, and an impressive social program, including a Halloween party.

We will be presenting two full papers in the Blog session on Wednesday 29th October 2008, 10:15-11:45am:. Both papers tackle search tasks investigated within the TREC Blog track:

  • Key Blog Distillation: Ranking Aggregate. Craig Macdonald, Iadh Ounis (University of Glasgow, UK) - The paper addresses the blog distillation task, as task characterised as “Find me a blog with a principle, recurring interest in X.”
  • An Effective Statistical Approach to Blog Post Opinion Retrieval. Ben He, Craig Macdonald, Jiyin He, Iadh Ounis (University of Glasgow, UK) - The paper tackles the opinion-finding task in the blogosphere, a task characterised by “What do people think about X?”
The third paper of the Blog session is from UMass, a regular participant in the TREC Blog track. It also investigates the blog distillation search task:
  • Blog Site Search Using Resource Selection. Jangwon Seo, Bruce Croft (University of Massachusetts Amherst, USA)
We hope to see you in CIKM!

Sunday, October 19, 2008

TREC Blog track will run in 2009

Following our previous post, I'm pleased to report that we have just heard that the TREC program committee has accepted our proposal for the blog track to continue in 2009.

The intention is to use a larger Blog collection, and to have at least one search task that goes beyond topical relevance by taking into account a facet representing an attribute of required "quality".

There will be a workshop to discuss the proposed blog search tasks at the TREC 2008 conference on the afternoon of Thursday 20th November 2008.

If you cannot attend TREC, and wish to make any comments or suggestions, please feel free to post your thoughts in this post, or to email them privately, if you wish so.

Friday, September 12, 2008

Conference Deadline Traffic Jam

I noted today that Matthew Hurst has posted the ICWSM 2009 Call for Papers. Unfortunately the submission deadline is on the 21st January. This is a full 6 weeks later than for ICWSM 2008. Moreover, this falls four days before the SIGIR full paper deadline.

As an IR researcher, we have to target certain conferences. While I'd like to have multiple papers ready for several conferences with similar deadlines in advance, various pressures and reasons don't make that possible (e.g. I'd like a holiday at Christmas!).

The conference deadlines in January and February now look like:
  • 11th January: WWW 2009 Posters due
  • 19th January: SIGIR 2009 Abstracts due
  • 21st January: ICWSM 2009 Papers/Posters/Demos dues
  • 25th January: SIGIR 2009 Papers due
  • 9th February: NAACL-HLT 2009 Short Papers due
  • 22nd February: ACL-IJCNLP Papers due
  • 23rd February: SIGIR 2009 Posters due
Happy writing!

Wednesday, September 10, 2008

About Blog Search Tasks

We have been very busy recently with the TREC 2008 Blog track. Now that all runs have been submitted and that the relevance assessments are on-going, it is the time of the year where we start planning for the future of the track at TREC 2009! Indeed, TREC operates a policy where existing tracks are renewed on an annual basis, and following the submission of a proposal.

Back in 2006, when we first proposed the Blog track, our aim was to have a long-term objective for the track, recognising that the richness of the blogosphere and its peculiarities will require several years of investigation before reaching a full understanding of the different blog search tasks, and how they should be effectively addressed. In particular, we proposed to adopt an incremental approach, where we begin with basic blog search tasks and progressively move to more complex search scenarios.

In the first three years of the track (2006-2008), we addressed two main blog search tasks:
  • Opinion finding: involves locating blog posts that express an opinion about a given target.
  • Blog distillation: involves locating blogs that are principally devoted to a topic X over the timespan of the feed.
The first task tackles an important aspect of blogs, namely their opinionated/subjective nature, and the tendency of bloggers to express views, thoughts and feelings towards named-entities. This tasks helps users to find out what the bloggers think about X. The second search task addresses a scenario where the user would like to find a blog to follow or read in their RSS reader. Our main findings and conclusions from the first two years of the Blog track at TREC are summarised in the ICWSM 2008 paper, entitled On the Trec Blog Track. The Blog track 2006 and 2007 overview papers provide further detailed analysis and results.

We are now proposing to move to a second phase of the Blog track, where more refined and complex search scenarios should be investigated. In particular, we are thinking to use a new and larger collection of blogs, which has a much longer timespan than the 11-weeks period covered in the Blog06 collection. This allows investigating another important characteristic of the blogosphere, namely the temporal/chronological aspect of blogging, and various related search tasks such as story identification and tracking.

While we were thinking about such possible future tasks, we came across a position paper by Marti Hearst, Matthew Hurst and Susan Dumais, entitled "What Should Blog Search Look Like?", which will be presented in the forthcoming Search in Social Media (SSM 2008) workshop at CIKM 2008.

In particular, Hearst et al. propose that the blog distillation task should be further refined by taking into account a number of dimensions or attributes such as the authority of the blog, the trustworthiness of its authors, the genre of the blog and its style of writing. For example, a user might be interested in blogs to read about a topic X, but where the blogger expresses in-depth viewpoints, backed up by a scientific methodology or evidence. The Cranfield evaluation paradigm adopted by TREC requires deeper thoughts about how relevance assessments should be conducted in such a scenario.

Unsurprisingly for a strong advocate of the importance of user interfaces and visualisation tools for information retrieval, Hearst together with her co-authors propose a faceted blog search interface to help the user explore the attributes of the blogs before choosing those they wish to follow or read, i.e. exploratory search at its best! The conclusion of the paper provides a good summary of Hearst et al.'s views:
For the problem of selecting a blog to read, we propose a faceted interface which highlights different attributes of interest, with a focus on people and on matching the taste preferences of the reader. For the task of “taking the pulse of the blogosphere,” we suggest that blog data be integrated with other social media and that the existing work on tracking trends and aggregating views is heading in the right direction.
As we are trying to wrap up our proposal for TREC 2009, we would like to hear other suggestions and comments about what blog search should look like. Please feel free to post your thoughts and comments in this post, or to email them privately, if you wish so.

Monday, September 1, 2008

From Expert Search to Commoditising Workers

While I'm putting the finishing touches to my PhD thesis (titled The Voting Model for Expert and Blog Search), I thought I'd pick up on a recent related article.

An excerpt from The Numerati has been published on BusinessWeek.com. In the excerpt, Stephen Baker interviews the scientist Samer Takriti while he was working at IBM . Samer, who is a specialist in Operations Research, is working on commoditising workers. Similar to how supply chains and production lines have been modelled and improved, Samer believes that people can be assigned to projects using combinations of their availability, their scost, and their skills/expertise. The idea is to optimise the use of co-workers, leading to a better productivity within an organisation.

What's really interesting here is that this is a real application of expert search technology, being applied not just to satisfy occasional expertise needs ("I'm stuck, who should I ask for help?"), but in daily use to determine work assignments and to increase productivity. A fusion of search technology with constraint optimisation. Tools like these are likely to become invaluable in assigning jobs in global consultancy companies, where managers are unlikely to know everyone at their disposal. Such tools could even be used to identify the best training path for a co-worker to become skilled and productive in a particular area.
Imagine, says Aleksandra Mojsilovic, one of Takriti's close colleagues, that the company has a superior worker named Joe Smith. Management could really benefit from two or three others just like him, or even a dozen. Once the company has built rich mathematical profiles of Smith and his fellow workers, it might be possible to identify at least a few of the experiences or routines that make Joe Smith so good. "If you had the full employment history, you could even compute the steps to become a Joe Smith," she says.
Van drivers have been having their routes assigned automatically for many years. Why shouldn't consultants at IBM be any different? However, Baker points out that some people may be left out by systems (his example, a senior consultant left out because of his high cost, which Takriti counteracts by allowing senior staff members more "time on the bench" than junior staff, because when senior consultants are utilised they get larger cheques). Even still, the concern is this reliance on an expert search system to assign jobs when "expertise relevance" is an even vaguer concept than "document relevance", and expert search systems are not yet (and might never be) as accurate as a travelling salesman solution or a program to optimise a supply chain.

(Via Slashdot)

Tuesday, August 5, 2008

SIGIR 2008

We are just back from Singapore, where we have attended the extremely well organised SIGIR'08 conference. We presented one full paper and three posters.

Craig presented our full paper entitled Retrieval Sensitivity Under Training Using Different Measures. Through a large-scale empirical evaluation, the paper addresses an important practical issue, when deploying a search engine, namely whether it matters which evaluation measure is used during training, especially when the available training data is very incomplete. The paper shows among other results that it is not necessarily appropriate to train by directly optimising the target evaluation measure (e.g. MAP) . In particular, the paper shows that bPref, infAP and nDCG are all superior training measures than MAP when the training dataset is incomplete and when the evaluation measure is MAP. Interestingly, the same research question has been addressed by Stephen Robertson, albeit more theoretically, in his keynote talk at the SIGIR'08 LR4IR workshop, where he justified and illustrated why optimising directly the evaluation measure on the training set is not often a good approach (as we say, "Great minds think alike"!).


The Terrier Team also presented three posters at the conference:

Ranking Opinionated Blog Posts Using OpinionFinder (Presented by Ben): The paper proposes an approach to use and integrate an NLP opinion-identification toolkit, OpinionFinder, into the retrieval process of an IR system, such that opinionated, relevant documents are retrieved in response to a query. This is one of the very few opinion finding detection approaches that were shown to be effective in the TREC Blog Track.

Limits of Opinion-Finding Baseline Systems (Presented by Craig/Iadh): The paper investigates how the underlying baseline retrieval system performance affects the overall opinion-finding performance. Two effective opinion-finding techniques are applied to all the baseline runs submitted to the TREC 2007 Blog track, leading to interesting insights and conclusions.

Automatic Document Prior Feature Selection for Web Retrieval (Presented by PJ): The paper investigates whether the retrieval performance of a Web search engine can be further enhanced by selecting the best document prior feature (e.g. PageRank, URL-Depth, etc.) on a per-query basis. The paper proposes a novel method for selecting the best document prior feature on a per-query basis.

Ps: Photos are from the SIGIR'08 website.

Monday, August 4, 2008

Welcome to the Terrier Team Blog

It has been a while since we started thinking about having a blog for the Terrier Team. Actually, since we have been involved in the organisation of a TREC blog track in 2006.

Recently, we have been encouraged by the very informative and interesting information retrieval-related discussions, taking place in blogs such as

From mere regular readers of information retrieval blogs, we thought that it is now the right time to become more actively involved in blogging. Hence the creation of this new forum, where we intend to post news about our research work and activities. We hope to share our thoughts on information retrieval research, and to engage in a dialogue with our fellow colleagues and friends.

We do hope that many of you will join us in this forum.