Showing posts with label TREC. Show all posts
Showing posts with label TREC. Show all posts

Tuesday, March 26, 2013

Learning to Rank Research Using Terrier - The Role of Features (Part 2)

This is the second post in a two-part series addressing our recent research in learning to rank. While the previous blog post addressed the role of the sample within learning to rank, and its impact on effectiveness and efficiency, in this blog post, I'll be discussing the role of different features within the learning to rank process.

Features

With various IR approaches being proposed over the years, these have naturally formed features within learned approaches. Features can be calculated on the documents, in a query independent (e.g. URL length, PageRank) or query dependent (e.g. weighting or proximity models) manner. For instance, the LETOR learning to rank test collections deploy various (query dependent) weighting models calculated on different fields (body, title, anchor text, and URL). We have successfully been using the same four fields for representing Web documents in our participations to the TREC Web tracks.

The role of different weighting models - particularly those calculated on different fields - within learning to rank intrigued us and formed an article About Learning Models with Multiple Query Dependent Features that we recently published in Transactions in Information Systems. In particular, some learned models take the form of linear combinations of feature scores. In contrast, Robertson [CIKM 2004] warned against the linear combination of weighting model scores. Yet the LETOR test collections deploy weighting models deployed on each field (e.g. BM25 body, BM25 title, BM25 anchor text). Hence, among other research questions, we re-examined the role of field-based weighting models in the learning to rank era. Our findings on the TREC ClueWeb09 corpus showed that field-based models such as BM25F and PL2F are still important for effective learned models.

Our TOIS paper also shows how to efficiently calculate multiple query dependent features within in IR system. In particular, as the postings of an inverted index are compressed, it is expensive to calculate additional query dependent features once the top K sample has been identified, due to cost of decompressing the relevant parts of the inverted index posting lists again. Instead, we show how the postings of documents that are inserted into the top K documents within a DAAT retrieval strategy can be "cloned", and retained decompressed in memory, such that additional query dependent features can be calculated in the Feature Extraction phase. We call this the fat framework, as it "fattens" the result set with the postings of the query terms for those documents. We have implemented this framework within the Terrier IR platform, and it will be released as part of the next major release of Terrier, as described in our OSIR 2012 paper.

Finally, query features are an increasingly important aspect within learning to rank. In contrast to (query independent or query dependent) document features, query features have the same value for each document ranked for a query. In this way, query features can be said to be document independent. Our CIKM 2010 and SIGIR 2011 papers on diversification used query features to decide on the ambiguity of a query, or to decide on the likely type of information need underlying different aspects of an ambiguous query. On the other hand, the role of query features is to allow learners based on regression trees (e.g. GBRT, LambdaMART) to customise branches of the learned model for different types of query. For instance, if query length is a query feature, then the learner might recognise that a query dependent proximity (document) feature is important for two terms queries, but not for one term queries. In our CIKM 2012 short paper, we recognised the lack of a comprehensive study on the usefulness of query features for learning to rank. Our experiments combined the LambdaMART learning to rank technique with 187 different query features that were grouped into four types: pre-retrieval Query Performance Prediction, Query Concept Identification, Query Log Mining, and Query Type Classification. We found that over a quarter of the 187 query features could significantly improve the effectiveness of a learned model. The most promising query features were Query Type Classification, which identified the presence of entities in the query, suggesting that such features are useful for triggering sub-trees that promote entity homepages. Overall, we found query features could be employed to customise learned ranking models for queries with different popularity, length, difficulty, ambiguity, and related entities.

Summary

There remains a great deal of black magic involved in the effective and efficient application of learning to rank within information retrieval. Indeed, my colleagues and I strongly believe that empirically dervied best practices are an important part of information retrieval research. This series of blog posts has been aimed at addressing some of the aspects missing in the literature, and provide insights into our recent research within this area.

Acknowledgements

This body of research would not have been possible without a number of co-authors and contributors: Iadh Ounis, Rodrygo Santos, Nicola Tonellotto (CNR, Italy) and Ben He (University of the Chinese Academy of Science).

Key References

About Learning Models with Multiple Query Dependent Features. Craig Macdonald, Rodrygo L.T. Santos, Iadh Ounis and Ben He. Transactions in Information Systems, 2013, in press.

On the Usefulness of Query Features for Learning to Rank. Craig Macdonald, Rodrygo Santos and Iadh Ounis. In Proceedings of CIKM 2012.

Efficient and effective retrieval using selective pruning. Nicola Tonellotto, Craig Macdonald and Iadh Ounis. In Proceedings of WSDM 2013.

The Whens and Hows of Learning to Rank. Craig Macdonald, Rodrygo Santos and Iadh Ounis. Information Retrieval Journal, 2012.

Effect of Dynamic Pruning Safety on Learning to Rank Effectiveness. Craig Macdonald, Nicola Tonellotto, and Iadh Ounis. In Proceedings of SIGIR 2012.

Thursday, March 21, 2013

Learning to Rank Research using Terrier - The Importance of the Sample (Part 1)

This is the first of two blog posts addressing some of our recent research in learning to rank. In particular, in recent years, the information retrieval (IR) field has experienced a paradigm shift in the application of machine learning techniques to achieve effective ranking models. A few years ago, we were using hill-climbing optimisation techniques such as simulated annealing to optimise the parameters in weighting models, such as BM25 or PL2, or latterly BM25F or PL2F. Instead, driven first by commercial search engines, IR is increasingly adopting a feature-based approach, where various mini-hypothesis are represented as numerical features, and learning to rank techniques are deployed to decide their importance in the final ranking formulae.

The typical approach for ranking is described in the following figure from our recently presented WSDM 2013 paper:
Phases of a retrieval system deploying learning to rank, taken from Tonellotto et al, WSDM 2013.
In particular, there are typically three phases:
  1. Top K Retrieval, where a number of top-ranked documents are identified, which is known as the sample.
  2. Feature Extraction - various features are calculated for each of the sample documents.
  3. Learned Model Application - the learned model obtained from a learning to rank technique re-ranks the sample documents to better satisfy the user.

The Sample

The set of top K documents selected within the first retrieval phase is called the sample by Liu, even though the selected documents are not iid. Indeed, in selecting the sample, Liu suggested that the top K documents ranked by a simple weighting model such as BM25 is not the best, but is sufficient for effective learning to rank. However, the size of the sample - i.e. the number of documents to be re-ranked by the learned model - is an important parameter: with less documents, the first pass retrieval can be made more efficient by the use of dynamic pruning strategies (e.g. WAND); on the other hand, too few documents may result in insufficient relevant documents being retrieved, and hence effectiveness being degraded.

Our article The Whens and Hows of Learning to Rank in the Information Retrieval Journal studied the sample size parameter for many topic sets and learning to rank techniques - for the mixed information needs on the TREC ClueWeb09 collection, we found that while a sample size of 20 documents was sufficient for effective performance according to ERR@20, larger sample sizes of thousands of documents were needed for effective NDCG@20; for navigational information needs, predominantly larger samples sizes (upto 5000 documents) were needed; Moreover, the particular document representations that used to identify the sample was shown to have an impact on effectiveness - indeed, navigational queries were found to be considerably easier (requiring smaller samples) when anchor text was used, but for informational queries, the opposite was observed. In the article, we examined these issues in detail, across a number of test collections and learning to rank techniques, as well as investigating the role of the evaluation measure and its rank cutoff for listwise techniques - for in depth details and conclusions, see the IR Journal article. 

Dynamic pruning strategies such as WAND are generally configured to be safe-to-rank-K, which means that the effectiveness of the sample is not degraded. Alternatively, they can be configured to prune in an unsafe, more aggressive manner, which can degrade effectiveness by changing the retrieved documents. While the safety of WAND has previously been shown not to have great impact on the effectiveness of the retrieved (sample) documents, in our SIGIR 2012 poster, we showed that the impact on the effectiveness of the documents after re-ranking by application of a learned model could be marked. Moreover, this poster also investigated biases in the retrieved documents that are manifest in WAND when configured for unsafe pruning. For further details, please see our SIGIR 2012 poster.

How many documents that are necessary in the sample clearly varies from query to query. In our WSDM 2013 paper, we proposed selective pruning, whereby the size of the sample and the aggressiveness of the WAND pruning strategy used to create it is altered on a per-query basis. This permits retrieval that is both effective and efficient. Indeed, by using selective pruning, we showed that mean response time could be improved by 36%,  the response times experienced by the slowest 10% of queries could be reduced by 50%, while still maintaining significantly high effectiveness. The full paper investigates the effect of unsafe pruning on both efficiency and effectiveness, as well as different ways to make the decision for selective pruning - see the WSDM 2013 paper for more details.

In the next blog post (Part 2), I'll be looking at more details about the choice of features within learning to rank.

Key References

Efficient and effective retrieval using selective pruning. Nicola Tonellotto, Craig Macdonald and Iadh Ounis. In Proceedings of WSDM 2013.

The Whens and Hows of Learning to Rank. Craig Macdonald, Rodrygo Santos and Iadh Ounis. Information Retrieval Journal, 2012.

Effect of Dynamic Pruning Safety on Learning to Rank Effectiveness. Craig Macdonald, Nicola Tonellotto, and Iadh Ounis. In Proceedings of SIGIR 2012.

Wednesday, September 10, 2008

About Blog Search Tasks

We have been very busy recently with the TREC 2008 Blog track. Now that all runs have been submitted and that the relevance assessments are on-going, it is the time of the year where we start planning for the future of the track at TREC 2009! Indeed, TREC operates a policy where existing tracks are renewed on an annual basis, and following the submission of a proposal.

Back in 2006, when we first proposed the Blog track, our aim was to have a long-term objective for the track, recognising that the richness of the blogosphere and its peculiarities will require several years of investigation before reaching a full understanding of the different blog search tasks, and how they should be effectively addressed. In particular, we proposed to adopt an incremental approach, where we begin with basic blog search tasks and progressively move to more complex search scenarios.

In the first three years of the track (2006-2008), we addressed two main blog search tasks:
  • Opinion finding: involves locating blog posts that express an opinion about a given target.
  • Blog distillation: involves locating blogs that are principally devoted to a topic X over the timespan of the feed.
The first task tackles an important aspect of blogs, namely their opinionated/subjective nature, and the tendency of bloggers to express views, thoughts and feelings towards named-entities. This tasks helps users to find out what the bloggers think about X. The second search task addresses a scenario where the user would like to find a blog to follow or read in their RSS reader. Our main findings and conclusions from the first two years of the Blog track at TREC are summarised in the ICWSM 2008 paper, entitled On the Trec Blog Track. The Blog track 2006 and 2007 overview papers provide further detailed analysis and results.

We are now proposing to move to a second phase of the Blog track, where more refined and complex search scenarios should be investigated. In particular, we are thinking to use a new and larger collection of blogs, which has a much longer timespan than the 11-weeks period covered in the Blog06 collection. This allows investigating another important characteristic of the blogosphere, namely the temporal/chronological aspect of blogging, and various related search tasks such as story identification and tracking.

While we were thinking about such possible future tasks, we came across a position paper by Marti Hearst, Matthew Hurst and Susan Dumais, entitled "What Should Blog Search Look Like?", which will be presented in the forthcoming Search in Social Media (SSM 2008) workshop at CIKM 2008.

In particular, Hearst et al. propose that the blog distillation task should be further refined by taking into account a number of dimensions or attributes such as the authority of the blog, the trustworthiness of its authors, the genre of the blog and its style of writing. For example, a user might be interested in blogs to read about a topic X, but where the blogger expresses in-depth viewpoints, backed up by a scientific methodology or evidence. The Cranfield evaluation paradigm adopted by TREC requires deeper thoughts about how relevance assessments should be conducted in such a scenario.

Unsurprisingly for a strong advocate of the importance of user interfaces and visualisation tools for information retrieval, Hearst together with her co-authors propose a faceted blog search interface to help the user explore the attributes of the blogs before choosing those they wish to follow or read, i.e. exploratory search at its best! The conclusion of the paper provides a good summary of Hearst et al.'s views:
For the problem of selecting a blog to read, we propose a faceted interface which highlights different attributes of interest, with a focus on people and on matching the taste preferences of the reader. For the task of “taking the pulse of the blogosphere,” we suggest that blog data be integrated with other social media and that the existing work on tracking trends and aggregating views is heading in the right direction.
As we are trying to wrap up our proposal for TREC 2009, we would like to hear other suggestions and comments about what blog search should look like. Please feel free to post your thoughts and comments in this post, or to email them privately, if you wish so.