Tuesday, March 26, 2013

Learning to Rank Research Using Terrier - The Role of Features (Part 2)

This is the second post in a two-part series addressing our recent research in learning to rank. While the previous blog post addressed the role of the sample within learning to rank, and its impact on effectiveness and efficiency, in this blog post, I'll be discussing the role of different features within the learning to rank process.

Features

With various IR approaches being proposed over the years, these have naturally formed features within learned approaches. Features can be calculated on the documents, in a query independent (e.g. URL length, PageRank) or query dependent (e.g. weighting or proximity models) manner. For instance, the LETOR learning to rank test collections deploy various (query dependent) weighting models calculated on different fields (body, title, anchor text, and URL). We have successfully been using the same four fields for representing Web documents in our participations to the TREC Web tracks.

The role of different weighting models - particularly those calculated on different fields - within learning to rank intrigued us and formed an article About Learning Models with Multiple Query Dependent Features that we recently published in Transactions in Information Systems. In particular, some learned models take the form of linear combinations of feature scores. In contrast, Robertson [CIKM 2004] warned against the linear combination of weighting model scores. Yet the LETOR test collections deploy weighting models deployed on each field (e.g. BM25 body, BM25 title, BM25 anchor text). Hence, among other research questions, we re-examined the role of field-based weighting models in the learning to rank era. Our findings on the TREC ClueWeb09 corpus showed that field-based models such as BM25F and PL2F are still important for effective learned models.

Our TOIS paper also shows how to efficiently calculate multiple query dependent features within in IR system. In particular, as the postings of an inverted index are compressed, it is expensive to calculate additional query dependent features once the top K sample has been identified, due to cost of decompressing the relevant parts of the inverted index posting lists again. Instead, we show how the postings of documents that are inserted into the top K documents within a DAAT retrieval strategy can be "cloned", and retained decompressed in memory, such that additional query dependent features can be calculated in the Feature Extraction phase. We call this the fat framework, as it "fattens" the result set with the postings of the query terms for those documents. We have implemented this framework within the Terrier IR platform, and it will be released as part of the next major release of Terrier, as described in our OSIR 2012 paper.

Finally, query features are an increasingly important aspect within learning to rank. In contrast to (query independent or query dependent) document features, query features have the same value for each document ranked for a query. In this way, query features can be said to be document independent. Our CIKM 2010 and SIGIR 2011 papers on diversification used query features to decide on the ambiguity of a query, or to decide on the likely type of information need underlying different aspects of an ambiguous query. On the other hand, the role of query features is to allow learners based on regression trees (e.g. GBRT, LambdaMART) to customise branches of the learned model for different types of query. For instance, if query length is a query feature, then the learner might recognise that a query dependent proximity (document) feature is important for two terms queries, but not for one term queries. In our CIKM 2012 short paper, we recognised the lack of a comprehensive study on the usefulness of query features for learning to rank. Our experiments combined the LambdaMART learning to rank technique with 187 different query features that were grouped into four types: pre-retrieval Query Performance Prediction, Query Concept Identification, Query Log Mining, and Query Type Classification. We found that over a quarter of the 187 query features could significantly improve the effectiveness of a learned model. The most promising query features were Query Type Classification, which identified the presence of entities in the query, suggesting that such features are useful for triggering sub-trees that promote entity homepages. Overall, we found query features could be employed to customise learned ranking models for queries with different popularity, length, difficulty, ambiguity, and related entities.

Summary

There remains a great deal of black magic involved in the effective and efficient application of learning to rank within information retrieval. Indeed, my colleagues and I strongly believe that empirically dervied best practices are an important part of information retrieval research. This series of blog posts has been aimed at addressing some of the aspects missing in the literature, and provide insights into our recent research within this area.

Acknowledgements

This body of research would not have been possible without a number of co-authors and contributors: Iadh Ounis, Rodrygo Santos, Nicola Tonellotto (CNR, Italy) and Ben He (University of the Chinese Academy of Science).

Key References

About Learning Models with Multiple Query Dependent Features. Craig Macdonald, Rodrygo L.T. Santos, Iadh Ounis and Ben He. Transactions in Information Systems, 2013, in press.

On the Usefulness of Query Features for Learning to Rank. Craig Macdonald, Rodrygo Santos and Iadh Ounis. In Proceedings of CIKM 2012.

Efficient and effective retrieval using selective pruning. Nicola Tonellotto, Craig Macdonald and Iadh Ounis. In Proceedings of WSDM 2013.

The Whens and Hows of Learning to Rank. Craig Macdonald, Rodrygo Santos and Iadh Ounis. Information Retrieval Journal, 2012.

Effect of Dynamic Pruning Safety on Learning to Rank Effectiveness. Craig Macdonald, Nicola Tonellotto, and Iadh Ounis. In Proceedings of SIGIR 2012.

Monday, March 25, 2013

Explicit web search result diversification


A couple of weeks ago I successfully defended my PhD thesis at the School of Computing Science of the University of Glasgow. The thesis, entitled “Explicit web search result diversification”, was unconditionally approved with no corrections by the examination board.

The thesis tackles the problem of ambiguity in web search queries. In particular, with the enormous size of the Web, a misunderstanding of the information need underlying an ambiguous query can misguide the search engine, ultimately leading the user to abandon the originally submitted query. To overcome this problem, a sensible approach is to diversify the documents retrieved for the user's query. As a result, the likelihood that at least one of these documents will satisfy the user's actual information need is increased.

In the thesis, we argue that an ambiguous query should be seen as representing not one, but multiple information needs. Based upon this premise, we propose xQuAD – Explicit Query Aspect Diversification, a novel probabilistic framework for search result diversification. In particular, the xQuAD framework naturally models several dimensions of the search result diversification problem in a principled yet practical manner. To this end, the framework represents the possible information needs underlying a query as a set of keyword-based sub-queries. Moreover, xQuAD accounts for the overall coverage of each retrieved document with respect to the identified sub-queries, so as to rank highly diverse documents first. In addition, it accounts for how well each sub-query is covered by the other retrieved documents, so as to promote novelty – and hence penalise redundancy – in the ranking. The framework also models the importance of each of the identified sub-queries, so as to appropriately cater for the interests of the user population when diversifying the retrieved documents. Finally, since not all queries are equally ambiguous, the xQuAD framework caters for the ambiguity level of different queries, so as to appropriately trade-off relevance for diversity on a per-query basis.

The xQuAD framework is general and can be used to instantiate several diversification models, including the most prominent models described in the literature. In particular, within xQuAD, each of the aforementioned dimensions of the search result diversification problem can be tackled in a variety of ways. In this thesis, as additional contributions besides the xQuAD framework, we introduce novel machine learning approaches for addressing each of these dimensions. These include a learning to rank approach for identifying effective sub-queries as query suggestions mined from a query log, an intent-aware approach for choosing the ranking models most likely to be effective for estimating the coverage and novelty of multiple documents with respect to a sub-query, and a selective approach for automatically predicting how much to diversify the documents retrieved for each individual query. In addition, we perform the first empirical analysis of the role of novelty as a diversification strategy for web search.

As demonstrated throughout the thesis, the principles underlying the xQuAD framework are general, sound, and effective. In particular, to validate the contributions of this thesis, we thoroughly assess the effectiveness of xQuAD under the standard experimentation paradigm provided by the diversity task of the TREC 2009, 2010, and 2011 Web tracks. The results of this investigation demonstrate the effectiveness of our proposed framework. Indeed, xQuAD attains consistent and significant improvements in comparison to the most effective diversification approaches in the literature, and across a range of experimental conditions, comprising multiple input rankings, multiple sub-query generation and coverage estimation mechanisms, as well as queries with multiple levels of ambiguity.

These investigations led to the publication of 12 peer-reviewed research papers and 5 evaluation forum reports directly related to the thesis. Moreover, the thesis opened up directions for other researchers, who deployed and extended the xQuAD framework for different applications, and inspired a series of workshops on Diversity in Document Retrieval as well as a research track at the internationally renown NTCIR forum. From a practical perspective, xQuAD has been subjected to scrutiny from the research community as a regular contender in both TREC and NTCIR. As the winning entry in all editions of the diversity task of the TREC Web track (best cat. B submission in TREC 2009 and TREC 2010; best overall submission in TREC 2011 and TREC 2012), we believe that the xQuAD framework has secured its place in the state-of-the-art.

The thesis is now available online at http://theses.gla.ac.uk/4106/. In addition, a reference implementation of the xQuAD framework will feature in the next major release of the open-source Terrier Information Retrieval Platform.

Thursday, March 21, 2013

Learning to Rank Research using Terrier - The Importance of the Sample (Part 1)

This is the first of two blog posts addressing some of our recent research in learning to rank. In particular, in recent years, the information retrieval (IR) field has experienced a paradigm shift in the application of machine learning techniques to achieve effective ranking models. A few years ago, we were using hill-climbing optimisation techniques such as simulated annealing to optimise the parameters in weighting models, such as BM25 or PL2, or latterly BM25F or PL2F. Instead, driven first by commercial search engines, IR is increasingly adopting a feature-based approach, where various mini-hypothesis are represented as numerical features, and learning to rank techniques are deployed to decide their importance in the final ranking formulae.

The typical approach for ranking is described in the following figure from our recently presented WSDM 2013 paper:
Phases of a retrieval system deploying learning to rank, taken from Tonellotto et al, WSDM 2013.
In particular, there are typically three phases:
  1. Top K Retrieval, where a number of top-ranked documents are identified, which is known as the sample.
  2. Feature Extraction - various features are calculated for each of the sample documents.
  3. Learned Model Application - the learned model obtained from a learning to rank technique re-ranks the sample documents to better satisfy the user.

The Sample

The set of top K documents selected within the first retrieval phase is called the sample by Liu, even though the selected documents are not iid. Indeed, in selecting the sample, Liu suggested that the top K documents ranked by a simple weighting model such as BM25 is not the best, but is sufficient for effective learning to rank. However, the size of the sample - i.e. the number of documents to be re-ranked by the learned model - is an important parameter: with less documents, the first pass retrieval can be made more efficient by the use of dynamic pruning strategies (e.g. WAND); on the other hand, too few documents may result in insufficient relevant documents being retrieved, and hence effectiveness being degraded.

Our article The Whens and Hows of Learning to Rank in the Information Retrieval Journal studied the sample size parameter for many topic sets and learning to rank techniques - for the mixed information needs on the TREC ClueWeb09 collection, we found that while a sample size of 20 documents was sufficient for effective performance according to ERR@20, larger sample sizes of thousands of documents were needed for effective NDCG@20; for navigational information needs, predominantly larger samples sizes (upto 5000 documents) were needed; Moreover, the particular document representations that used to identify the sample was shown to have an impact on effectiveness - indeed, navigational queries were found to be considerably easier (requiring smaller samples) when anchor text was used, but for informational queries, the opposite was observed. In the article, we examined these issues in detail, across a number of test collections and learning to rank techniques, as well as investigating the role of the evaluation measure and its rank cutoff for listwise techniques - for in depth details and conclusions, see the IR Journal article. 

Dynamic pruning strategies such as WAND are generally configured to be safe-to-rank-K, which means that the effectiveness of the sample is not degraded. Alternatively, they can be configured to prune in an unsafe, more aggressive manner, which can degrade effectiveness by changing the retrieved documents. While the safety of WAND has previously been shown not to have great impact on the effectiveness of the retrieved (sample) documents, in our SIGIR 2012 poster, we showed that the impact on the effectiveness of the documents after re-ranking by application of a learned model could be marked. Moreover, this poster also investigated biases in the retrieved documents that are manifest in WAND when configured for unsafe pruning. For further details, please see our SIGIR 2012 poster.

How many documents that are necessary in the sample clearly varies from query to query. In our WSDM 2013 paper, we proposed selective pruning, whereby the size of the sample and the aggressiveness of the WAND pruning strategy used to create it is altered on a per-query basis. This permits retrieval that is both effective and efficient. Indeed, by using selective pruning, we showed that mean response time could be improved by 36%,  the response times experienced by the slowest 10% of queries could be reduced by 50%, while still maintaining significantly high effectiveness. The full paper investigates the effect of unsafe pruning on both efficiency and effectiveness, as well as different ways to make the decision for selective pruning - see the WSDM 2013 paper for more details.

In the next blog post (Part 2), I'll be looking at more details about the choice of features within learning to rank.

Key References

Efficient and effective retrieval using selective pruning. Nicola Tonellotto, Craig Macdonald and Iadh Ounis. In Proceedings of WSDM 2013.

The Whens and Hows of Learning to Rank. Craig Macdonald, Rodrygo Santos and Iadh Ounis. Information Retrieval Journal, 2012.

Effect of Dynamic Pruning Safety on Learning to Rank Effectiveness. Craig Macdonald, Nicola Tonellotto, and Iadh Ounis. In Proceedings of SIGIR 2012.

Friday, July 27, 2012

SMART: An open source framework for searching the physical world

Some of our readers are probably aware of our new project SMART, which aims to develop a new technology for the real-time indexing and retrieval of sensor and social streams. This three-year project is funded by the European Commission under the Seventh Framework Programme (grant number 287583). The project, which has started in November 2011, has already received a large national and international press coverage in online and print news over the last month. The BBC will shortly be broadcasting a piece of television about the project.

The name of the project and the resulting search engine, SMART, acknowledges the vision of the Internet of Things in general, and the concept of smart cities in particular. Indeed, SMART builds on the growing trend of smart cities, where in addition to physical infrastructure (roads, buildings), digital knowledge infrastructure is deployed to serve the needs of the citizens and local governments. The backbone of the digital knowledge infrastructure is mainly composed of sensors such as cameras, microphone arrays, or other environmental sensors, from weather to parking sensors. For example, in "smart cities", drivers can be notified where it is good to park their car or where to avoid traffic jams in the city centre at any time of the day. The main idea of the SMART project is to connect these sensors to the Internet and have search technologies to allow citizens to benefit from the information that these sensors can provide in real-time.

The SMART search engine builds upon the Terrier Information Retrieval platform, and exemplifies our recent move towards building new, separate and tailored products on top of the Terrier platform. In particular, Terrier has been enhanced and expanded with real-time indexing and a scalable distributed architecture allowing to process and handle a large volume of continuous and parallel streams.

SMART is a multi-disciplinary project in nature, encompassing state-of-the-art technologies from audio & video processing, social search and reasoning. Building upon these technologies, SMART analyses the input from sensors in real-time, for example to detect large crowds, or if live music can be heard. These can be compared with recent posts on social networks from the same area, to see whether the system can learn more about what is happening in the area around the sensors. By analysing the sensors across multiple locations within the city, when a user asks “what’s happening near me”, the system has some idea of which locations have the most interesting events.

Clearly, making real-world events searchable can have privacy/ethics implications. In fact, never before in our research have we been confronted with such a dichotomy between what is technologically feasible and what we conceive to be ethical. That's why we and our partners in the project are carefully considering privacy issues in our research. Indeed, we are closely working with various national Data Protection Authorities (DPAs) (i) to ensure that we don’t overstep the legal or ethical boundaries of privacy and (ii) to provide guidelines for the ethical implications of the SMART technologies and help prospective deployers to use/deploy SMART in a legal, ethical, and friendly manner. Interested readers can consult the first issue of the SMART Newsletter for further details about our ongoing efforts towards the privacy issue.

While we will be trialling the SMART search technology in The City of Santander (Spain), the key infrastructure of SMART (including the search components based on Terrier) will be made available as open source, encapsulating a vision whereby other smart cities can easily become involved and benefit from the project's outcomes. We expect the first release of the SMART search technology to become available as open source under the Mozilla Public License (MPL) 2.0 by the end of 2012. By releasing parts of SMART as open source, we aim to allow the formation of a community of early adopters that will be key for evaluating and sustaining the project.

With this in mind, we have just published a paper in the SIGIR 2012 Open Source Information Retrieval (OSIR 2012) workshop describing our current progress in the project as well as the open source vision of the project:

SMART: An open source framework for searching the physical world. M-Dyaa Albakour, Craig Macdonald, Iadh Ounis, Aristodemos Pnevmatikakis and John Soldatos. In Proceedings of the SIGIR 2012 Workshop on Open Source Information Retrieval. Portland, Oregon, USA. August 2012.

As always, we welcome comments and contributions from smart cities, community members and developers to the SMART vision.

Wednesday, July 25, 2012

From Puppy to Maturity: Experiences in Developing Terrier

We will be taking part in the SIGIR 2012 Workshop on Open Source Information Retrieval. In particular, we have published a paper on the Terrier open source information retrieval platform, detailing the vision behind the platform, some recent developments in Terrier, as well as a roadmap for future releases.

As always, our vision for the Terrier platform is to continue empowering researchers and practitioners in information retrieval (IR) with up-to-date, easily adaptable, effective and scalable indexing and search approaches, allowing them to build and evaluate the next generation IR applications. 

In particular, Terrier will be moving towards feature-based retrieval, in line with the increasing importance of the learning-to-rank paradigm in modern information retrieval where machine-learned ranking functions combining multiple features are deployed. To do so, Terrier will be supporting the efficient and effective extraction of query-independent and query-dependent features.

To support scalability and efficiency, Terrier's data structures have undergone a major enhancement to support advanced dynamic pruning techniques, as well as the development of applications requiring distributed and real-time indexing and retrieval such as Twitter search.

Finally, the growth of the Terrier platform over the past decade into exciting new areas such as MapReduce indexing and crowdsourcing entails increased functionality, but also platform complexity. To avoid software bloat, we are moving from a monolithic release structure, to a system of periodic core releases and timely plugin expansions. The first such release will be the CrowdTerrier plugin, providing  researchers with an out-of-the-box tool to achieve fast and cheap relevance assessments.

A more comprehensive account of the forthcoming Terrier releases is detailed in our paper below:

From Puppy to Maturity: Experiences in Developing Terrier. Craig Macdonald, Richard McCreadie, Rodrygo Santos and Iadh Ounis. In Proceedings of the SIGIR 2012 Workshop on Open Source Information Retrieval. Portland, Oregon, USA. August 2012

We hope to see many colleagues joining us to work towards the objectives of the platform and enriching its functionalities. As always, we welcome suggestions and any feedback on the roadmap in the run up to the forthcoming Terrier 4.0.