TerrierTeam: SIGIR 2012

Thursday, March 21, 2013

Learning to Rank Research using Terrier - The Importance of the Sample (Part 1)

This is the first of two blog posts addressing some of our recent research in learning to rank. In particular, in recent years, the information retrieval (IR) field has experienced a paradigm shift in the application of machine learning techniques to achieve effective ranking models. A few years ago, we were using hill-climbing optimisation techniques such as simulated annealing to optimise the parameters in weighting models, such as BM25 or PL2, or latterly BM25F or PL2F. Instead, driven first by commercial search engines, IR is increasingly adopting a feature-based approach, where various mini-hypothesis are represented as numerical features, and learning to rank techniques are deployed to decide their importance in the final ranking formulae.

The typical approach for ranking is described in the following figure from our recently presented WSDM 2013 paper:

Phases of a retrieval system deploying learning to rank, taken from Tonellotto et al, WSDM 2013.

In particular, there are typically three phases:

Top K Retrieval, where a number of top-ranked documents are identified, which is known as the sample.
Feature Extraction - various features are calculated for each of the sample documents.
Learned Model Application - the learned model obtained from a learning to rank technique re-ranks the sample documents to better satisfy the user.

The Sample

The set of top K documents selected within the first retrieval phase is called the sample by Liu, even though the selected documents are not iid. Indeed, in selecting the sample, Liu suggested that the top K documents ranked by a simple weighting model such as BM25 is not the best, but is sufficient for effective learning to rank. However, the size of the sample - i.e. the number of documents to be re-ranked by the learned model - is an important parameter: with less documents, the first pass retrieval can be made more efficient by the use of dynamic pruning strategies (e.g. WAND); on the other hand, too few documents may result in insufficient relevant documents being retrieved, and hence effectiveness being degraded.

Our article The Whens and Hows of Learning to Rank in the Information Retrieval Journal studied the sample size parameter for many topic sets and learning to rank techniques - for the mixed information needs on the TREC ClueWeb09 collection, we found that while a sample size of 20 documents was sufficient for effective performance according to ERR@20, larger sample sizes of thousands of documents were needed for effective NDCG@20; for navigational information needs, predominantly larger samples sizes (upto 5000 documents) were needed; Moreover, the particular document representations that used to identify the sample was shown to have an impact on effectiveness - indeed, navigational queries were found to be considerably easier (requiring smaller samples) when anchor text was used, but for informational queries, the opposite was observed. In the article, we examined these issues in detail, across a number of test collections and learning to rank techniques, as well as investigating the role of the evaluation measure and its rank cutoff for listwise techniques - for in depth details and conclusions, see the IR Journal article.

Dynamic pruning strategies such as WAND are generally configured to be safe-to-rank-K, which means that the effectiveness of the sample is not degraded. Alternatively, they can be configured to prune in an unsafe, more aggressive manner, which can degrade effectiveness by changing the retrieved documents. While the safety of WAND has previously been shown not to have great impact on the effectiveness of the retrieved (sample) documents, in our SIGIR 2012 poster, we showed that the impact on the effectiveness of the documents after re-ranking by application of a learned model could be marked. Moreover, this poster also investigated biases in the retrieved documents that are manifest in WAND when configured for unsafe pruning. For further details, please see our SIGIR 2012 poster.

How many documents that are necessary in the sample clearly varies from query to query. In our WSDM 2013 paper, we proposed selective pruning, whereby the size of the sample and the aggressiveness of the WAND pruning strategy used to create it is altered on a per-query basis. This permits retrieval that is both effective and efficient. Indeed, by using selective pruning, we showed that mean response time could be improved by 36%, the response times experienced by the slowest 10% of queries could be reduced by 50%, while still maintaining significantly high effectiveness. The full paper investigates the effect of unsafe pruning on both efficiency and effectiveness, as well as different ways to make the decision for selective pruning - see the WSDM 2013 paper for more details.

In the next blog post (Part 2), I'll be looking at more details about the choice of features within learning to rank.

Key References

Efficient and effective retrieval using selective pruning. Nicola Tonellotto, Craig Macdonald and Iadh Ounis. In Proceedings of WSDM 2013.

The Whens and Hows of Learning to Rank. Craig Macdonald, Rodrygo Santos and Iadh Ounis. Information Retrieval Journal, 2012.

Effect of Dynamic Pruning Safety on Learning to Rank Effectiveness. Craig Macdonald, Nicola Tonellotto, and Iadh Ounis. In Proceedings of SIGIR 2012.

Monday, June 25, 2012

Efficiency, Effectiveness, Medical Search, Dataset Development and Crowdsourcing at SIGIR 2012

The TerrierTeam will be well represented at SIGIR 2012 this year with a full paper, four posters, a demonstration and a workshop, covering a wide range of disciplines within the field of information retrieval. For those of you interested in Web search efficiency, we have a number of contributions to look for. Our full paper Learning to Predict Response Times for Online Query Scheduling defines the new area of query efficiency prediction. In particular, it postulates that not every query takes the same time to complete, particularly where efficient dynamic pruning strategies such as WAND are used to reduce retrieval latency. In our paper, we show and explain why queries with similar properties (e.g. posting list lengths) can have markedly different response times. We use these explanations to propose a learned approach for query efficiency prediction that can accurately predict the response time of a query before it is executed. Furthermore, we show that using query efficiency prediction can markedly increase the efficiency of query routing within a search engine that uses multiple replicated indices. Relatedly, our poster Scheduling Queries Across Replicas builds upon our work on query efficiency prediction, to show how a replicated and distributed search engine can be improved by the application of response time predictions. In particular, the response time predictions are used to estimate the workload of each replica of each index shard. Then each newly arrived query can be routed to the replica of each index shard that will be ready to process the query earliest.

At SIGIR this year we also present recent work examining both efficiency and effectiveness. Dynamic pruning strategies, such as WAND, can increase efficiency by omitting the scoring of documents that can be guaranteed not to make the top-K retrieved set - a feature known as safeness. Broder et al. showed how WAND could be made more efficient by relaxing the safeness guarantee, with little impact on the top-ranked documents. Through experiments on the TREC ClueWeb09 corpus and 33 query dependent and query independent features, our poster Effect of Dynamic Pruning Safety on Learning to Rank Effectiveness shows that relaxing safeness to aid efficiency can have an unexpectedly large impact on retrieval effectiveness when combined with modern learning to rank models, in contrast to the earlier work by Broder et al. In particular, we show that inherent biases by unsafe WAND towards documents with lower docids can markedly impact the effectiveness of learned models.

Those interested in the Medical search domain, in particular participants in the TREC Medical track, will be interested in our paper entitled Exploiting Term Dependence while Handling Negation in Medical Search. We show that it is important to handle negation in medical records - in particular, when searching for cohorts (groups of patients) with specific symptoms, our approach ensures that patients known not to have exhibited particular symptoms are not retrieved. Our results demonstrate that appropriate negation handling can increase retrieval effectiveness, particularly when the dependence between negated terms are considered using a term dependence model from the Divergence From Randomness framework.

Our poster On Building a Reusable Twitter Corpus tackles an important issue raised during the creation of the Tweets11 dataset as part of the TREC Micoblog track, namely how reusable Tweets11 is, given the dynamics of Twitter. Our poster shows that corpus degradation due to deleted tweets does not effect the ranking of systems that participated in the TREC 2011 Microblog track. Meanwhile, we are also demonstrating the first release of a new extension to our Terrier IR platform, namely CrowdTerrier, which enables relevance assessments to be created in a fast semi-automatic manner using crowdsourcing. CrowdTerrier is an infrastructure addition to Terrier that enables relevance assessments to be created in a fast semi-automatic manner using crowdsourcing. CrowdTerrier will be made available for download soon.

Finally, together with a group representing six open source IR systems, we are involved in the organisation of a SIGIR'12 workshop on Open Source Information Retrieval. The workshop aims to provide a forum for users and authors of open source IR tools to get together, and to work together to build OpenSearchLab, an open source, live and functioning, online web search engine for research purposes and discuss the joint future.