TerrierTeam

Terrier IR Platform v4.0 Released

2014-06-19T13:00:00.000+01:00

Terrier 4.0, the next version of the open source IR platform from the University of Glasgow (Scotland) has been released on 18th June 2014.

Terrier 4.0 represents a major update over the previous 3.6 release, adding significant new features, including:

Real-time index structures to facilitate incremental indexing of new documents as over time
Pluggable state-of-the-art index compression reduces the size of Terrier's index structures
Learning-to-rank support enables out-of-the-box supervised ranking models, trained using state-of-the-art approaches such as LambdaMART
A website search application is now provided, illustrating real-time crawling, indexing and retrieval within Terrier.

Terrier can be downloaded for free from http://terrier.org.

A full change log can be found at
http://terrier.org/docs/current/whats_new.html.

Learning to Rank Research Using Terrier - The Role of Features (Part 2)

2013-03-26T10:44:00.001+00:00

This is the second post in a two-part series addressing our recent research in learning to rank. While the previous blog post addressed the role of the sample within learning to rank, and its impact on effectiveness and efficiency, in this blog post, I'll be discussing the role of different features within the learning to rank process.

Features

With various IR approaches being proposed over the years, these have naturally formed features within learned approaches. Features can be calculated on the documents, in a query independent (e.g. URL length, PageRank) or query dependent (e.g. weighting or proximity models) manner. For instance, the LETOR learning to rank test collections deploy various (query dependent) weighting models calculated on different fields (body, title, anchor text, and URL). We have successfully been using the same four fields for representing Web documents in our participations to the TREC Web tracks.

The role of different weighting models - particularly those calculated on different fields - within learning to rank intrigued us and formed an article About Learning Models with Multiple Query Dependent Features that we recently published in Transactions in Information Systems. In particular, some learned models take the form of linear combinations of feature scores. In contrast, Robertson [CIKM 2004] warned against the linear combination of weighting model scores. Yet the LETOR test collections deploy weighting models deployed on each field (e.g. BM25 body, BM25 title, BM25 anchor text). Hence, among other research questions, we re-examined the role of field-based weighting models in the learning to rank era. Our findings on the TREC ClueWeb09 corpus showed that field-based models such as BM25F and PL2F are still important for effective learned models.

Our TOIS paper also shows how to efficiently calculate multiple query dependent features within in IR system. In particular, as the postings of an inverted index are compressed, it is expensive to calculate additional query dependent features once the top K sample has been identified, due to cost of decompressing the relevant parts of the inverted index posting lists again. Instead, we show how the postings of documents that are inserted into the top K documents within a DAAT retrieval strategy can be "cloned", and retained decompressed in memory, such that additional query dependent features can be calculated in the Feature Extraction phase. We call this the fat framework, as it "fattens" the result set with the postings of the query terms for those documents. We have implemented this framework within the Terrier IR platform, and it will be released as part of the next major release of Terrier, as described in our OSIR 2012 paper.

Finally, query features are an increasingly important aspect within learning to rank. In contrast to (query independent or query dependent) document features, query features have the same value for each document ranked for a query. In this way, query features can be said to be document independent. Our CIKM 2010 and SIGIR 2011 papers on diversification used query features to decide on the ambiguity of a query, or to decide on the likely type of information need underlying different aspects of an ambiguous query. On the other hand, the role of query features is to allow learners based on regression trees (e.g. GBRT, LambdaMART) to customise branches of the learned model for different types of query. For instance, if query length is a query feature, then the learner might recognise that a query dependent proximity (document) feature is important for two terms queries, but not for one term queries. In our CIKM 2012 short paper, we recognised the lack of a comprehensive study on the usefulness of query features for learning to rank. Our experiments combined the LambdaMART learning to rank technique with 187 different query features that were grouped into four types: pre-retrieval Query Performance Prediction, Query Concept Identification, Query Log Mining, and Query Type Classification. We found that over a quarter of the 187 query features could significantly improve the effectiveness of a learned model. The most promising query features were Query Type Classification, which identified the presence of entities in the query, suggesting that such features are useful for triggering sub-trees that promote entity homepages. Overall, we found query features could be employed to customise learned ranking models for queries with different popularity, length, difficulty, ambiguity, and related entities.

Summary

There remains a great deal of black magic involved in the effective and efficient application of learning to rank within information retrieval. Indeed, my colleagues and I strongly believe that empirically dervied best practices are an important part of information retrieval research. This series of blog posts has been aimed at addressing some of the aspects missing in the literature, and provide insights into our recent research within this area.

Acknowledgements

This body of research would not have been possible without a number of co-authors and contributors: Iadh Ounis, Rodrygo Santos, Nicola Tonellotto (CNR, Italy) and Ben He (University of the Chinese Academy of Science).

Key References

About Learning Models with Multiple Query Dependent Features. Craig Macdonald, Rodrygo L.T. Santos, Iadh Ounis and Ben He. Transactions in Information Systems, 2013, in press.

On the Usefulness of Query Features for Learning to Rank. Craig Macdonald, Rodrygo Santos and Iadh Ounis. In Proceedings of CIKM 2012.

Efficient and effective retrieval using selective pruning. Nicola Tonellotto, Craig Macdonald and Iadh Ounis. In Proceedings of WSDM 2013.

The Whens and Hows of Learning to Rank. Craig Macdonald, Rodrygo Santos and Iadh Ounis. Information Retrieval Journal, 2012.

Effect of Dynamic Pruning Safety on Learning to Rank Effectiveness. Craig Macdonald, Nicola Tonellotto, and Iadh Ounis. In Proceedings of SIGIR 2012.

Explicit web search result diversification

2013-03-25T15:12:00.000+00:00

A couple of weeks ago I successfully defended my PhD thesis at the School of Computing Science of the University of Glasgow. The thesis, entitled “Explicit web search result diversification”, was unconditionally approved with no corrections by the examination board.

The thesis tackles the problem of ambiguity in web search queries. In particular, with the enormous size of the Web, a misunderstanding of the information need underlying an ambiguous query can misguide the search engine, ultimately leading the user to abandon the originally submitted query. To overcome this problem, a sensible approach is to diversify the documents retrieved for the user's query. As a result, the likelihood that at least one of these documents will satisfy the user's actual information need is increased.

In the thesis, we argue that an ambiguous query should be seen as representing not one, but multiple information needs. Based upon this premise, we propose xQuAD – Explicit Query Aspect Diversification, a novel probabilistic framework for search result diversification. In particular, the xQuAD framework naturally models several dimensions of the search result diversification problem in a principled yet practical manner. To this end, the framework represents the possible information needs underlying a query as a set of keyword-based sub-queries. Moreover, xQuAD accounts for the overall coverage of each retrieved document with respect to the identified sub-queries, so as to rank highly diverse documents first. In addition, it accounts for how well each sub-query is covered by the other retrieved documents, so as to promote novelty – and hence penalise redundancy – in the ranking. The framework also models the importance of each of the identified sub-queries, so as to appropriately cater for the interests of the user population when diversifying the retrieved documents. Finally, since not all queries are equally ambiguous, the xQuAD framework caters for the ambiguity level of different queries, so as to appropriately trade-off relevance for diversity on a per-query basis.

The xQuAD framework is general and can be used to instantiate several diversification models, including the most prominent models described in the literature. In particular, within xQuAD, each of the aforementioned dimensions of the search result diversification problem can be tackled in a variety of ways. In this thesis, as additional contributions besides the xQuAD framework, we introduce novel machine learning approaches for addressing each of these dimensions. These include a learning to rank approach for identifying effective sub-queries as query suggestions mined from a query log, an intent-aware approach for choosing the ranking models most likely to be effective for estimating the coverage and novelty of multiple documents with respect to a sub-query, and a selective approach for automatically predicting how much to diversify the documents retrieved for each individual query. In addition, we perform the first empirical analysis of the role of novelty as a diversification strategy for web search.

As demonstrated throughout the thesis, the principles underlying the xQuAD framework are general, sound, and effective. In particular, to validate the contributions of this thesis, we thoroughly assess the effectiveness of xQuAD under the standard experimentation paradigm provided by the diversity task of the TREC 2009, 2010, and 2011 Web tracks. The results of this investigation demonstrate the effectiveness of our proposed framework. Indeed, xQuAD attains consistent and significant improvements in comparison to the most effective diversification approaches in the literature, and across a range of experimental conditions, comprising multiple input rankings, multiple sub-query generation and coverage estimation mechanisms, as well as queries with multiple levels of ambiguity.

These investigations led to the publication of 12 peer-reviewed research papers and 5 evaluation forum reports directly related to the thesis. Moreover, the thesis opened up directions for other researchers, who deployed and extended the xQuAD framework for different applications, and inspired a series of workshops on Diversity in Document Retrieval as well as a research track at the internationally renown NTCIR forum. From a practical perspective, xQuAD has been subjected to scrutiny from the research community as a regular contender in both TREC and NTCIR. As the winning entry in all editions of the diversity task of the TREC Web track (best cat. B submission in TREC 2009 and TREC 2010; best overall submission in TREC 2011 and TREC 2012), we believe that the xQuAD framework has secured its place in the state-of-the-art.

The thesis is now available online at http://theses.gla.ac.uk/4106/. In addition, a reference implementation of the xQuAD framework will feature in the next major release of the open-source Terrier Information Retrieval Platform.

Learning to Rank Research using Terrier - The Importance of the Sample (Part 1)

2013-03-21T16:34:00.001+00:00

This is the first of two blog posts addressing some of our recent research in learning to rank. In particular, in recent years, the information retrieval (IR) field has experienced a paradigm shift in the application of machine learning techniques to achieve effective ranking models. A few years ago, we were using hill-climbing optimisation techniques such as simulated annealing to optimise the parameters in weighting models, such as BM25 or PL2, or latterly BM25F or PL2F. Instead, driven first by commercial search engines, IR is increasingly adopting a feature-based approach, where various mini-hypothesis are represented as numerical features, and learning to rank techniques are deployed to decide their importance in the final ranking formulae.

The typical approach for ranking is described in the following figure from our recently presented WSDM 2013 paper:

Phases of a retrieval system deploying learning to rank, taken from Tonellotto et al, WSDM 2013.

In particular, there are typically three phases:

Top K Retrieval, where a number of top-ranked documents are identified, which is known as the sample.
Feature Extraction - various features are calculated for each of the sample documents.
Learned Model Application - the learned model obtained from a learning to rank technique re-ranks the sample documents to better satisfy the user.

The Sample

The set of top K documents selected within the first retrieval phase is called the sample by Liu, even though the selected documents are not iid. Indeed, in selecting the sample, Liu suggested that the top K documents ranked by a simple weighting model such as BM25 is not the best, but is sufficient for effective learning to rank. However, the size of the sample - i.e. the number of documents to be re-ranked by the learned model - is an important parameter: with less documents, the first pass retrieval can be made more efficient by the use of dynamic pruning strategies (e.g. WAND); on the other hand, too few documents may result in insufficient relevant documents being retrieved, and hence effectiveness being degraded.

Our article The Whens and Hows of Learning to Rank in the Information Retrieval Journal studied the sample size parameter for many topic sets and learning to rank techniques - for the mixed information needs on the TREC ClueWeb09 collection, we found that while a sample size of 20 documents was sufficient for effective performance according to ERR@20, larger sample sizes of thousands of documents were needed for effective NDCG@20; for navigational information needs, predominantly larger samples sizes (upto 5000 documents) were needed; Moreover, the particular document representations that used to identify the sample was shown to have an impact on effectiveness - indeed, navigational queries were found to be considerably easier (requiring smaller samples) when anchor text was used, but for informational queries, the opposite was observed. In the article, we examined these issues in detail, across a number of test collections and learning to rank techniques, as well as investigating the role of the evaluation measure and its rank cutoff for listwise techniques - for in depth details and conclusions, see the IR Journal article.

Dynamic pruning strategies such as WAND are generally configured to be safe-to-rank-K, which means that the effectiveness of the sample is not degraded. Alternatively, they can be configured to prune in an unsafe, more aggressive manner, which can degrade effectiveness by changing the retrieved documents. While the safety of WAND has previously been shown not to have great impact on the effectiveness of the retrieved (sample) documents, in our SIGIR 2012 poster, we showed that the impact on the effectiveness of the documents after re-ranking by application of a learned model could be marked. Moreover, this poster also investigated biases in the retrieved documents that are manifest in WAND when configured for unsafe pruning. For further details, please see our SIGIR 2012 poster.

How many documents that are necessary in the sample clearly varies from query to query. In our WSDM 2013 paper, we proposed selective pruning, whereby the size of the sample and the aggressiveness of the WAND pruning strategy used to create it is altered on a per-query basis. This permits retrieval that is both effective and efficient. Indeed, by using selective pruning, we showed that mean response time could be improved by 36%, the response times experienced by the slowest 10% of queries could be reduced by 50%, while still maintaining significantly high effectiveness. The full paper investigates the effect of unsafe pruning on both efficiency and effectiveness, as well as different ways to make the decision for selective pruning - see the WSDM 2013 paper for more details.

In the next blog post (Part 2), I'll be looking at more details about the choice of features within learning to rank.

Key References

Efficient and effective retrieval using selective pruning. Nicola Tonellotto, Craig Macdonald and Iadh Ounis. In Proceedings of WSDM 2013.

The Whens and Hows of Learning to Rank. Craig Macdonald, Rodrygo Santos and Iadh Ounis. Information Retrieval Journal, 2012.

Effect of Dynamic Pruning Safety on Learning to Rank Effectiveness. Craig Macdonald, Nicola Tonellotto, and Iadh Ounis. In Proceedings of SIGIR 2012.

SMART: An open source framework for searching the physical world

2012-07-27T14:39:00.000+01:00

Some of our readers are probably aware of our new project SMART, which aims to develop a new technology for the real-time indexing and retrieval of sensor and social streams. This three-year project is funded by the European Commission under the Seventh Framework Programme (grant number 287583). The project, which has started in November 2011, has already received a large national and international press coverage in online and print news over the last month. The BBC will shortly be broadcasting a piece of television about the project.

The name of the project and the resulting search engine, SMART, acknowledges the vision of the Internet of Things in general, and the concept of smart cities in particular. Indeed, SMART builds on the growing trend of smart cities, where in addition to physical infrastructure (roads, buildings), digital knowledge infrastructure is deployed to serve the needs of the citizens and local governments. The backbone of the digital knowledge infrastructure is mainly composed of sensors such as cameras, microphone arrays, or other environmental sensors, from weather to parking sensors. For example, in "smart cities", drivers can be notified where it is good to park their car or where to avoid traffic jams in the city centre at any time of the day. The main idea of the SMART project is to connect these sensors to the Internet and have search technologies to allow citizens to benefit from the information that these sensors can provide in real-time.

The SMART search engine builds upon the Terrier Information Retrieval platform, and exemplifies our recent move towards building new, separate and tailored products on top of the Terrier platform. In particular, Terrier has been enhanced and expanded with real-time indexing and a scalable distributed architecture allowing to process and handle a large volume of continuous and parallel streams.

SMART is a multi-disciplinary project in nature, encompassing state-of-the-art technologies from audio & video processing, social search and reasoning. Building upon these technologies, SMART analyses the input from sensors in real-time, for example to detect large crowds, or if live music can be heard. These can be compared with recent posts on social networks from the same area, to see whether the system can learn more about what is happening in the area around the sensors. By analysing the sensors across multiple locations within the city, when a user asks “what’s happening near me”, the system has some idea of which locations have the most interesting events.

Clearly, making real-world events searchable can have privacy/ethics implications. In fact, never before in our research have we been confronted with such a dichotomy between what is technologically feasible and what we conceive to be ethical. That's why we and our partners in the project are carefully considering privacy issues in our research. Indeed, we are closely working with various national Data Protection Authorities (DPAs) (i) to ensure that we don’t overstep the legal or ethical boundaries of privacy and (ii) to provide guidelines for the ethical implications of the SMART technologies and help prospective deployers to use/deploy SMART in a legal, ethical, and friendly manner. Interested readers can consult the first issue of the SMART Newsletter for further details about our ongoing efforts towards the privacy issue.

While we will be trialling the SMART search technology in The City of Santander (Spain), the key infrastructure of SMART (including the search components based on Terrier) will be made available as open source, encapsulating a vision whereby other smart cities can easily become involved and benefit from the project's outcomes. We expect the first release of the SMART search technology to become available as open source under the Mozilla Public License (MPL) 2.0 by the end of 2012. By releasing parts of SMART as open source, we aim to allow the formation of a community of early adopters that will be key for evaluating and sustaining the project.

With this in mind, we have just published a paper in the SIGIR 2012 Open Source Information Retrieval (OSIR 2012) workshop describing our current progress in the project as well as the open source vision of the project:

SMART: An open source framework for searching the physical world. M-Dyaa Albakour, Craig Macdonald, Iadh Ounis, Aristodemos Pnevmatikakis and John Soldatos. In Proceedings of the SIGIR 2012 Workshop on Open Source Information Retrieval. Portland, Oregon, USA. August 2012.

As always, we welcome comments and contributions from smart cities, community members and developers to the SMART vision.

From Puppy to Maturity: Experiences in Developing Terrier

2012-07-25T16:29:00.001+01:00

We will be taking part in the SIGIR 2012 Workshop on Open Source Information Retrieval. In particular, we have published a paper on the Terrier open source information retrieval platform, detailing the vision behind the platform, some recent developments in Terrier, as well as a roadmap for future releases.

As always, our vision for the Terrier platform is to continue empowering researchers and practitioners in information retrieval (IR) with up-to-date, easily adaptable, effective and scalable indexing and search approaches, allowing them to build and evaluate the next generation IR applications.

In particular, Terrier will be moving towards feature-based retrieval, in line with the increasing importance of the learning-to-rank paradigm in modern information retrieval where machine-learned ranking functions combining multiple features are deployed. To do so, Terrier will be supporting the efficient and effective extraction of query-independent and query-dependent features.

To support scalability and efficiency, Terrier's data structures have undergone a major enhancement to support advanced dynamic pruning techniques, as well as the development of applications requiring distributed and real-time indexing and retrieval such as Twitter search.

Finally, the growth of the Terrier platform over the past decade into exciting new areas such as MapReduce indexing and crowdsourcing entails increased functionality, but also platform complexity. To avoid software bloat, we are moving from a monolithic release structure, to a system of periodic core releases and timely plugin expansions. The first such release will be the CrowdTerrier plugin, providing researchers with an out-of-the-box tool to achieve fast and cheap relevance assessments.

A more comprehensive account of the forthcoming Terrier releases is detailed in our paper below:

From Puppy to Maturity: Experiences in Developing Terrier. Craig Macdonald, Richard McCreadie, Rodrygo Santos and Iadh Ounis. In Proceedings of the SIGIR 2012 Workshop on Open Source Information Retrieval. Portland, Oregon, USA. August 2012

We hope to see many colleagues joining us to work towards the objectives of the platform and enriching its functionalities. As always, we welcome suggestions and any feedback on the roadmap in the run up to the forthcoming Terrier 4.0.

Efficiency, Effectiveness, Medical Search, Dataset Development and Crowdsourcing at SIGIR 2012

2012-06-25T20:28:00.000+01:00

The TerrierTeam will be well represented at SIGIR 2012 this year with a full paper, four posters, a demonstration and a workshop, covering a wide range of disciplines within the field of information retrieval. For those of you interested in Web search efficiency, we have a number of contributions to look for. Our full paper Learning to Predict Response Times for Online Query Scheduling defines the new area of query efficiency prediction. In particular, it postulates that not every query takes the same time to complete, particularly where efficient dynamic pruning strategies such as WAND are used to reduce retrieval latency. In our paper, we show and explain why queries with similar properties (e.g. posting list lengths) can have markedly different response times. We use these explanations to propose a learned approach for query efficiency prediction that can accurately predict the response time of a query before it is executed. Furthermore, we show that using query efficiency prediction can markedly increase the efficiency of query routing within a search engine that uses multiple replicated indices. Relatedly, our poster Scheduling Queries Across Replicas builds upon our work on query efficiency prediction, to show how a replicated and distributed search engine can be improved by the application of response time predictions. In particular, the response time predictions are used to estimate the workload of each replica of each index shard. Then each newly arrived query can be routed to the replica of each index shard that will be ready to process the query earliest.

At SIGIR this year we also present recent work examining both efficiency and effectiveness. Dynamic pruning strategies, such as WAND, can increase efficiency by omitting the scoring of documents that can be guaranteed not to make the top-K retrieved set - a feature known as safeness. Broder et al. showed how WAND could be made more efficient by relaxing the safeness guarantee, with little impact on the top-ranked documents. Through experiments on the TREC ClueWeb09 corpus and 33 query dependent and query independent features, our poster Effect of Dynamic Pruning Safety on Learning to Rank Effectiveness shows that relaxing safeness to aid efficiency can have an unexpectedly large impact on retrieval effectiveness when combined with modern learning to rank models, in contrast to the earlier work by Broder et al. In particular, we show that inherent biases by unsafe WAND towards documents with lower docids can markedly impact the effectiveness of learned models.

Those interested in the Medical search domain, in particular participants in the TREC Medical track, will be interested in our paper entitled Exploiting Term Dependence while Handling Negation in Medical Search. We show that it is important to handle negation in medical records - in particular, when searching for cohorts (groups of patients) with specific symptoms, our approach ensures that patients known not to have exhibited particular symptoms are not retrieved. Our results demonstrate that appropriate negation handling can increase retrieval effectiveness, particularly when the dependence between negated terms are considered using a term dependence model from the Divergence From Randomness framework.

Our poster On Building a Reusable Twitter Corpus tackles an important issue raised during the creation of the Tweets11 dataset as part of the TREC Micoblog track, namely how reusable Tweets11 is, given the dynamics of Twitter. Our poster shows that corpus degradation due to deleted tweets does not effect the ranking of systems that participated in the TREC 2011 Microblog track. Meanwhile, we are also demonstrating the first release of a new extension to our Terrier IR platform, namely CrowdTerrier, which enables relevance assessments to be created in a fast semi-automatic manner using crowdsourcing. CrowdTerrier is an infrastructure addition to Terrier that enables relevance assessments to be created in a fast semi-automatic manner using crowdsourcing. CrowdTerrier will be made available for download soon.

Finally, together with a group representing six open source IR systems, we are involved in the organisation of a SIGIR'12 workshop on Open Source Information Retrieval. The workshop aims to provide a forum for users and authors of open source IR tools to get together, and to work together to build OpenSearchLab, an open source, live and functioning, online web search engine for research purposes and discuss the joint future.

A SMART way to Search your City

2012-06-19T15:43:00.000+01:00

TerrierTeam is currently expanding its outreach into social and sensor-based search systems as part of the ongoing SMART EU-funded project (FP7 287583). SMART aims to develop an open source search framework for multimedia data stemming from the physical world and social streams such as Twitter. The end-goal is to be able to answer location and time-sensitive queries such as “where can I go to listen to live music in the city centre tonight?” or “where are my friends hanging out in the city?” by augmenting social media signals with live city sensor information.

Our role in the SMART project is to develop fast and effective real-time search from the flood of information provided by social and city sensor streams on top of our open-source Terrier information retrieval platform. Indeed, a real-time Twitter search demo illustrating incremental and distributed indexing and real-time retrieval in Terrier is now available. Try it at: http://demos.terrier.org/SMART/twittersearch/

SMART has seen wide-ranging national, European and international coverage in online and print news media over the last week. Indeed, we are tracking over 100 articles and counting! Some sample articles can be found below:

A more detailed list of recent press coverage can be found at

http://www.smartfp7.eu/content/media-coverage-smart.

Terrier 3.5 released

2011-06-16T17:41:00.009+01:00

Today, we are proud to announce a brand new release of Terrier, our state-of-the-art open source information retrieval platform. Terrier 3.5 represents a significant update over its previous version (Terrier 3.0), including:

Document-at-a-time (DAAT) retrieval for large indices
Refactored tokenisation for enhanced multi-language support
Upgraded Hadoop support to version 0.20
Synonym support in query language and retrieval
Out-of-the box indexing support for query-biased summaries and improved example web-based interface
New, 2nd generation DFR models as well as other recent effective information-theoretic models
Fully revised and improved documentation
Many more JUnit tests (now 300+)

Check out the full change log for this release and upgrade to Terrier 3.5!

Many thanks to everyone at the TerrierTeam and all Terrier contributors for their hard work making this release possible!

ECIR 2011 + DDR 2011 in Dublin

2011-04-27T12:42:00.010+01:00

Last week, a few of us attended ECIR 2011 in Dublin. The conference was a resounding success both in terms of its program and organisation. Compared to last year, the event was very well attended with about 250 delegates registered to the conference and/or its satellite events. The majority of delegates were from Ireland and the United Kingdom.

Workshops

The kick-off was on Monday, with a selection of workshops and tutorials at the fabulous Guinness Storehouse. We attended the Diversity in Document Retrieval (DDR 2011) workshop, jointly organised by Craig Macdonald, Jun Wang, and Charlie Clarke.

The DDR workshop was sometimes a standing-room only event and appeared to be the largest workshop of the conference. It was structured around three broad themes: evaluation, modelling, and applications. Besides good keynotes by Tetsuya Sakai and Alessandro Moschitti, the workshop featured technical and position paper presentations, as well as a poster session and a breakout group discussion on all three workshop themes. While there was no agreement on a possible "killer application" for diversity, there was a consensus that diversity is best described or seen as the lack of context. In addition, a few key points arose across the boundaries of the tackled themes:

Representing diversity
How to best represent the possible multiple information needs underlying a query? Should this representation reflect the interests of the user population, or should it be itself diverse?
Measuring diversity
What does diversity mean and how should it be promoted in different scenarios? The workshop featured some ideas for applications, including expert search, geographical IR, and graph summarisation.
Unifying diversity
How to diversify across multiple search scenarios (e.g., multiple verticals of a search engine)? How to convey a summary relevant to multiple information needs in a single page of results?

Some of these ideas are currently being investigated as part of the NTCIR-9 Intent task. Charlie was also keen to consider these questions in future incarnations of the diversity task in the TREC Web track. During the workshop, Rodrygo presented our position paper entitled "Diversifying for multiple information needs". The full DDR workshop proceedings are available online.

While we haven't attended it, it was of note that the Information Retrieval Over Query Sessions workshop, which was held at the same time as DDR, also received very good and positive feedback from its attendees.

The workshops were followed by an excellent welcome reception where the least we could say is that Guinness was not in shortage.

Conference

On Tuesday, the main conference took over with a diverse (no pun intended) program. The conference started with a thoughtful keynote by Kalervo Järvelin who urged the information retrieval community to see beyond the [search] box. The keynote led to some very interesting discussions about whether IR is a science or a technology (i.e. mostly about engineering). We would like to believe that it is science, although some delegates argued (sadly) for the opposite.

The second keynote was given by Evgeniy Gabrilovich, winner of this year's KSJ Award. Evgeniy provided a very comprehensive overview of the fascinating computational advertising field, highlighting the current state-of-the-art and possible future research directions. We were encouraged to hear about the Yahoo! Faculty Research and Engagement Program (FREP), which might allow academics to access the necessary datasets to conduct research in a field that has been thus far the sole territory of researchers based in industry.

The last keynote talk was superbly given by Thorsten Joachims about the value of user feedback. Thorsten convincingly argued for the importance of collecting user feedback as an intrinsic part of both the retrieval and learning processes. The talk highlighted how user feedback could improve the quality of retrieval and by how much. We wish that the slides will be made publicly available at some point.

As for the rest of the program, there were two types of papers/presentations: full papers were presented in 30 min, while short papers had only 15 min. As usual, the quality of papers (or at least the presentations) varied from the outstanding to the less good. One suggestion for future ECIR conferences is to limit all the talks to at most 20 min, encouraging conciseness and pushing the speakers to focus on the "message out of the bottle". Indeed, some talks appeared to be exceedingly long with respect to their informative content. While we see the value of giving a 30 min slot to a 10-pages long ACM-style paper, there does not seem to be a valid reason for giving that much time for a (comparatively much shorter) 12-pages LNCS-style paper.

It was interesting to see several Twitter-related papers in the program, suggesting that the community will find the upcoming new TREC 2011 Microblog track and its corresponding shared dataset particularly useful/helpful. The theme of crowdsourcing was also highly featured in the conference, with several papers showing how cheap and reliable relevance assessments could be obtained through the Amazon Mechanical Turk or similar services. Finally, we were very pleased to see many presented papers using our open source Terrier software in their experiments.

Overall, a few papers caught our attention and were particularly interesting:

On the contributions of topics to system evaluation
Steve Robertson
Caching for realtime search - in our opinion by far the best paper/presentation of the conference
Edward Bortnikov, Ronny Lempel and Kolman Vornovitsky
Are semantically related links effective for retrieval?
Marijn Koolen and Jaap Kamps
A methodology for evaluating aggregated search results - Excellent paper/presentation that was also awarded the best student paper award
Jaime Arguello, Fernando Diaz, Jamie Callan and Ben Carterette
Design and implementation of relevance assessments using crowdsourcing
Omar Alonso and Ricardo Baeza-Yates
The power of peers
Nick Craswell, Dennis Fetterly and Marc Najork
Automatic people tagging for expertise profiling in the enterprise
Pavel Serdyukov, Mike Taylor, Vishwa Vinary, Matthew Richardson and Ryen W. White
What makes re-finding information difficult? A study of email re-finding
David Elsweiler, Mark Baillie and Ian Ruthven

Of course, we also recommend our own paper, which was nominated for best paper award, and for which we received excellent feedback:

Learning models for ranking aggregates
Craig Macdonald and Iadh Ounis

The program also featured a busy poster and demo session. We liked the work of Gerani Keikha, Carman and Crestani concerning identifying personal blogs using the TREC Blog track, and that of Perego, Silvestri and Tonellotto, which suggests that document length can be quantized from docids without loss of retrieval effectiveness. There were also several interesting demos that caught our eye:

ARES - A retrieval engine based on sentiments: Sentiment-based search result annotation and diversification - which used our xQuAD framework for diversifying sentiments
Gianluca Demartini
Conversation Retrieval from Twitter
Matteo Magnani, Danilo Montesi, Gabriele Nnziante and Luca Rossi
Finding Useful Users on Twitter: Twittomender the Followee Recommender - addressed the Who to Follow (WTF?) task on Twitter
John Hannon, Kevin McCarthy and Barry Smyth

The ECIR organisers hosted a particularly sumptuous conference banquet at the impressive, unique and beautiful venue of The Village at Lyons Demesne in County Kildare. The journey to the village was a welcome break from the hotel setting of the conference and its technical program.

On the last day of the conference, and concurrently to the technical research sessions, an Industry Day event was under way. However, we only had the chance to go and see the excellent talk by Flavio Junqueira on the practical aspects of caching in search engine deployments. There is a comprehensive summary of the whole Industry program in this blog post. We believe that the planning of the Industry Day event in parallel to the technical sessions was detrimental to attendance. Next year, the Industry Day will be held after the conference ends.

Finally, we would like to thank the organisers of ECIR 2011 for a very enjoyable conference, and a great stay in Dublin. ECIR 2012 will be held in Barcelona, Spain, between 1st and 5th April 2012. We hope to see you all there.

TREC 2010 Roundup

2010-11-26T12:39:00.026+00:00

Back from another successful TREC conference on the NIST campus. 2010 is a transition year, with the end of old tracks and the proposition of new ones. Indeed, TREC is moving with the times, looking at new data sources and test collections, as well as new evaluation strategies.

Outwith the old . . .

For example, TREC 2010 marks the end of the Relevance Feedback and Blog tracks. While TREC 2010 will be the last year of the Relevance Feedback track, the Blog track, which has been running for the last 5 years, is now morphing into a new Microblog track, investigating real-time and social search tasks in Twitter. A brand new test collection possibly containing 2 months of tweets is planned, with linked web-pages and a partial follower graph. Join the Microblog track googlegroup to obtain the latest updates and follow the Microblog track on Twitter.

TREC 2011 will also witness the initiation of the new Medical Records track, dedicated to investigating approaches to access free-text fields of electronic medical records.

On the test collection front, the Web track is also forward planning a new large-scale dataset to replace ClueWeb09. Indications are that this new dataset will be about the same scale as ClueWeb09 but might provide more temporal information (multiple versions of a page or site over time). Moreover, we have suggested that this might be the heart of a larger dataset comprised of multiple parallel/aligned corpora, for example blogs and news feeds covering the same timeframe.

TREC Assessors, Relevant?

In terms of evaluation, 2010 marks the first year where evaluation judgments were crowdsourced using an online worker marketplace, as opposed to relying on TREC assessors, the participants themselves, or a select group of experts. Indeed, both the Blog track and the Relevance Feedback track crowdsourced some of their evaluation (although the Relevance Feedback track suffered many setbacks and its crowdsourcing process is still incomplete). Furthermore, to investigate the challenges in this new field of crowdsourcing, a specific Crowdsourcing track has been created and will run in 2011. More details can be found here.

Themes

As usual, themes emerged within the various tracks. Learned approaches were far more prevalent this year, now that training data was available for the ClueWeb09 dataset. Indeed, the Web track was dominated by trained models mostly based on link and proximity search features. Diversification, on the other hand, remains a challenging task, with the top groups leaving their initial rankings as is. An outstanding exception is our own approach using the xQuAD framework under a selective diversification regime, which further improves our strongly performing adhoc baseline. Craig Macdonald presented our work in the Web track plenary session.

In the Blog track, voting model-based and language modeling approaches proved popular for blog distillation. For faceted blog ranking, participants employed variants of facet dictionaries to either train a classifier or as features for learning. For the top news task, participants deployed a wide variety of methods to rank news stories in a real-time setting, from probabilistic modeling to blog post voting with historical evidence. Richard Mccreadie presented our work on the blog track as a poster during TREC 2010, which attracted very interesting discussions.

During the TREC conference, Iadh Ounis, Richard Mccreadie and others have done a fair amount of tweeting. You can follow some bits of the TREC conference through the #trec2010 hashtag.

CIKM 2010 in Toronto, ON, Canada

2010-11-03T18:03:00.006+00:00

I'm back from Toronto, where a few of us attended the CIKM 2010 conference last week. On Friday, I presented our paper on "Selectively diversifying Web search results", a joint work with Craig Macdonald and Iadh Ounis. This work extends our successful participation in the diversity task of the TREC 2009 Web track, by investigating the need for search result diversification in the first place. In particular, we proposed a novel supervised learning approach to predict not only whether promoting diversity is beneficial, but also how much diversification should be applied to attain an effective retrieval performance on a per-query basis. After thorough, large-scale experiments with over 900 query features, we found that our selective approach can substantially improve existing diversification approaches, including our state-of-the-art xQuAD framework. Nonetheless, we believe the significance of our contribution goes beyond these successful results. Indeed, it was with great pleasure that we heard from the NTCIR organisers that NTCIR-9 will run an Intent task, aimed---among other things---at selectively diversifying search results, an area where we are proud to be pioneers.

Besides our own paper, a few other papers caught my attention:

Web Search Solved? All Result Rankings the Same? by Hugo Zaragoza, B. Barla Cambazoglu and Ricardo Baeza-Yates
Reverted Indexing for Feedback and Expansion, by Jeremy Pickens, Matthew Cooper and Gene Golovchinsky
Rank Learning for Factoid Question Answering with Linguistic and Semantic Constraints, by Matthew Bilotti, Jonathan Elsas, Jaime Carbonell and Eric Nyberg
Organizing Query Completions for Web Search, by Alpa Jain and Gilad Mishne
Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models, by Jianfeng Gao, Xiaodong He and Jian-Yun Nie

The conference also featured great keynotes, of which those by Jamie Callan and Susan Dumais deserve a particular mention. Jamie talked about his view for the future of search, in which search engines capable of fully leveraging the structure of queries and documents would enable more sophisticated applications built on top of them. Susan addressed the temporal evolution of Web content, how it impacts the way users access this content, and how test collections should account for it. For more details, have a look at the excellent posts by Gene Golovchinsky on Jamie and Susan's talks.

Last but not least, many of us were involved in promoting the next edition of CIKM, to be held here in Glasgow. There was a lot of excitement from the several people that visited our booth, and also during the hand-over talk at the end of the conference. Well done Jon, Mary, Craig, and Iadh for the hard work! The arrangements for CIKM 2011 are well advanced, and the call for papers is now online. You can also follow the latest news about CIKM 2011 on Twitter, Facebook, LinkedIn, and Lanyrd. We look forward to welcoming you all to Glasgow next year!

Terrier Team at SIGIR 2010 in Geneva

2010-07-20T08:49:00.006+01:00

SIGIR 2010 has just started in Geneva. From the TerrierTeam, Richard and myself are attending.

On Monday, Richard presented his PhD topic, Leveraging User-generated Content for News Search at the doctoral consortium.

Later, at the Web Ngram workshop, I'll be presenting a paper on Global Statistics in Proximity Weighting Models.

About the same time, Richard will be presenting at the Crowdsourcing for Search Evaluation workshop. His paper on Crowdsourcing a News Query Classification Dataset examines the effectiveness of different interfaces for having Mechanical Turkers classify queries as news-related or not.

Last but not least, and continuing on our proximity theme, Nicola Tonellotto from CNR is presenting our joint work titled Efficient Dynamic Pruning with Proximity Support at the Large Scale & Distributed Systems workshop.

Meanwhile, please say hello if you see us at the conference, or stay up to date by following #sigir2010. And remember, if you are near the registration desk, please pick up flyers for Terrier and CIKM 2011.

Top Authors in Information Retrieval

2010-07-19T13:04:00.017+01:00

Thanks to Sérgio Nunes who alerted us to this ranking by Microsoft Academic Search of the Top Authors in Information Retrieval, in the past 5 years.

According to this recent ranking, two members of the TerrierTeam, namely Iadh Ounis and Craig Macdonald, are in the top 5 authors in Information Retrieval in the past 5 years (position #1 and #4, respectively). The ranking is based on in-domain citations.

This good news comes just at the start of the SIGIR 2010 Conference, which will be held in Geneva, Switzerland this week (19-23 July 2010). Several members of the team will be in attendance.

WWW 2010 in Raleigh, NC, USA

2010-05-04T01:48:00.014+01:00

I am back from the sunny Raleigh, NC, USA. Besides the nice weather, I had a great time last week attending the 19th International World Wide Web Conference (WWW 2010), where I presented our paper on Exploiting query reformulations for Web search result diversification, a joint work with Craig Macdonald and Iadh Ounis. The paper introduces a probabilistic formulation of our xQuAD framework for search result diversification, and analyses the effectiveness of query reformulations provided by three commercial search engines for the diversification task. My talk was very well received, with lots of questions from the audience, and subsequent chatting with many people from both academia and industry.

The blend academia-industry was indeed a signature of WWW. I was also impressed with the multidisciplinary nature of the conference—with up to five parallel sessions, there was always something for everyone! In particular, from the sessions I attended, a few papers caught my attention:

Clustering query refinements by user intent, by Eldar Sadikov et al. (Stanford University and Google)
Optimal rare query suggestion with implicit user feedback, by Yang Song and Li-wei He (Microsoft Research)
Building taxonomy of Web search intents for name entity queries, by Xiaoxin Yin and Sarthak Shah (Microsoft Research)
Exploring Web scale language models for search query processing, by Jian Huang et al. (Microsoft Research Asia, Facebook, and Penn State University)
Classification-enhanced ranking, by Paul N. Bennett et al. (Microsoft Research)
Ranking specialization for Web search: A divide-and-conquer approach by using topical RankSVM, by Jiang Bian et al. (Georgia Tech and Yahoo! Labs)
Generalized distances between rankings, by Ravi Kumar and Sergei Vassilvitskii (Yahoo! Research)
Relational duality: Unsupervised extraction of semantic relations between entities on the Web, by Danushka T. Bollegala et al. (University of Tokyo)

The conference also featured three passionate keynotes:

Vint Cerf discussed a broad range of topics of interest on today's Web, where everything is connected: 1.8 billion users, around a billion Web-enabled mobile devices, and still a large room for growth in developing countries. Touched points included the implications of the explosion of data production on mobility, accessibility, security and privacy, intellectual property, digital preservation, as well as new technologies (e.g., cloud computing).
dannah boyd discussed privacy implications of the availability of "big data". Her keynote revolved around common misconceptions associated with the analysis of data produced by online social activities, as well as ethical concerns related to using this data in the first place, "just because it is accessible".
Carl Malamud from public.resource.org described his experiences trying to convince seven bureaucratic institutions to make public data publicly accessible. His keynote was organised around "10 rules for radicals", a guide on how to break the barriers towards negotiating with bureaucrats.

On Thursday night, the conference banquet featured an exciting performance by the North Carolina string band Carolina Chocolate Drops. Check out Snowden's Jig (Genuine Negro Jig) and Don't get trouble in your mind for a taste.

Friday held the closing ceremony, with the announcement of the award winners.
Best Paper:

Factorizing personalized Markov chains for next-basket recommendation, by Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme (Osaka University and University of Hildesheim)

Best Student Paper:

Privacy wizards for social networking sites, by Lujun Fang and Kristen LeFevre (University of Michigan)

Best Posters:

How much is your personal recommendation worth, by Paul Dütting, Monika Henzinger and Ingmar Weber (EPFL Lausanne, University of Vienna, and Yahoo! Research)
SourceRank: Relevance and trust assessment for deep Web sources based on inter-source agreement, by Raju Balakrishnan and Subbarao Kambhampati (Arizona State University)

The closing ceremony also featured a short presentation of WWW 2011, to be held in Hyderabad, India. WWW 2012 will take place in Lyon, France.

Finally, on Saturday, the IW3C2 announced the Brazilian bid as the winner to host WWW 2013, which I was very glad to hear about!

RIAO 2010 in Paris, France.

2010-04-28T13:47:00.016+01:00

The 9th International RIAO Conference has started in Paris, France (28-30 April, 2010). It is unfortunate that it is being held concurrently with WWW 2010 in Raleigh.

The first RIAO conference was held in Grenoble in 1985. RIAO is currently a triennial conference, addressing Information Retrieval research topics of interest to both Academia and Industry. This year, the conference focuses on Adaptivity, Personalization and Fusion of Heterogeneous Information.

The following papers have caught my eyes, while browsing the RIAO 2010 program:

Boiling down information retrieval test collections. T. Sakai et al. (Microsoft Research Asia, CMU)
Improving tag recommendation using social networks. A. Rae et al. (The Open University, Yahoo! Research Barcelona).
Analysis of robustness in trust-based recommender systems. Z. Cheng and N. Hurley (UCD)
Opinion-finding in blogs: A passage-based language modelling approach. M. Saad Missen et al (IRIT)
Predicting query performance using query, result, and user interaction features. Q. Guo et al. (Emory University/Microsoft Research)
Towards a collection-based results diversification. J.A. Akinyemi et al. (University of Waterloo)

In addition, the TerrierTeam has two full papers, which are being presented today at the conference (hopefully, the slides will follow shortly):

Voting for Related Entities by R.L.T. Santos, C. Macdonald and I. Ounis. The paper addresses the problem of entity search, where the goal is to rank not documents, but entities in response to a given query. The paper proposes to tackle this problem as a voting process, by considering the occurrence of an entity among the top ranked documents for a given query as a vote for the existence of a relationship between this and the entity in the query. The approach led to high precision and unparalleled recall compared to TREC 2009 systems.
News Article Ranking: Leveraging the Wisdom of Bloggers by R. McCreadie, C.Macdonald and I. Ounis. The paper investigates how news article ranking can be performed automatically, so as to assist editors in selecting the articles, which should make the front page of their newspaper. In particular, the paper investigates the blogosphere as a prime source of evidence, on the intuition that bloggers, and by extension their blog posts, can indicate interest in one news article or another. The paper proposes to model the automatic news article ranking task as a voting process, where each relevant blog post acts as a vote for one or more news articles. The approach led to the best TREC 2009 retrieval performance in the Blog track.

Craig Macdonald is tweeting the conference, pending an appropriate wireless signal. You can follow some bits of the RIAO conference through the #riao2010 hashtag.

ECIR 2010 in Milton Keynes: A Report

2010-04-07T13:04:00.043+01:00

Last week, five of us attended the ECIR 2010 conference in Milton Keynes. The conference was fairly well-organised, although it markedly lacked the lustre of the previous three editions of the conference. In terms of attendance, only about 170 delegates have registered, much less than Glasgow 2008 (210+), and Toulouse 2009 (180+). Perhaps, the exotic town of Milton Keynes was not deemed to be a very attractive venue for a conference. In fact, apart from attending the conference, there was not much else to do -- e.g. the nearest proper pub was at about 2 miles from the conference venue.

The ECIR 2010 conference has suffered from a new and previously unseen problem: several authors and presenters did not make it to the conference, preferring to give their presentation by proxy or using a pre-recorded talk. No less than 5 no-shows were recorded during the conference. Even the keynote speaker and winner of the first BCS IRSG Karen Sparck Jones award, Mirella Lapata, did not show up and gave her presentation through a pre-recorded video. While Lapata certainly had a valid reason (as probably did the other speakers) not to show up, it is clear that ECIR should concretely deal with such a problem, e.g., by making it compulsory that at least one author of each accepted paper be present during the conference.

In addition, the organisers decided not to have parallel sessions (because of lack of facilities?) during ECIR 2010. Therefore, several full papers were turned into poster presentations, which were held during the short lunch period. This was a very bad move, as because of the setting, these papers received much less attention and credit, even compared to the actual posters, the session of which was rather successful. Some delegates argued that some of the full-papers-turned-posters should have been given a full presentation slot, in lieu of those full papers with a no-show author.

Other than the problems mentioned above, the conference program was generally of a very good quality. In the first day, we enjoyed an excellent tutorial by two MSR researchers on Machine Learning for IR. The tutorial was given by Paul Bennett and Kevyn Collins-Thompson. We also enjoyed an equally excellent tutorial on Crowdsourcing by Omar Alonso from Bing.

In the next days, there were also several good papers that are worth reading:

A language modeling approach for temporal information needs (from Max-Planck)
The role of query sessions in extracting instance attributes from web search queries (from Google)
Interpreting user inactivity on search results (from Univ. of Washington, Univ. of Patras)
Learning to distribute queries onto Web search nodes (from Yahoo!)
Temporal shingling for version identification in Web archives (from Max-Planck)
Evaluation and user preference study on spatial diversity (University of Sheffield)

The best paper award was jointly awarded to:

Promoting ranking diversity for biomedical information retrieval using Wikipedia. Jimmy Huang and Xiaoshi Yin (York University)
Evaluation of an adaptive search suggestion system. Sascha Kriewel and Norbert Fuhr (University of Duisburg-Essen, Germany)

We have also had the chance to present our two full-papers on search result diversification, and learning to select:

Explicit search result diversification through sub-queries by Rodrygo L. T. Santos, Jie Peng, Craig Macdonald, and Iadh Ounis. Rodrygo presented our xQuAD search results diversification framework, and the talk was very well received by the delegates, leading to several questions, and many comments that this was arguably the best presentation of the conference.
Learning to select a ranking function by Jie Peng, Craig Macdonald and Iadh Ounis. This was one of the full-paper-turned-poster presentations. Jie presented the poster, which attracted a lot of attention and led to some very interesting discussions.

Finally, during the posters/demos session, two good contributions particularly caught our attention:

An Empirical Study of Query Specificity (Poster) - Avi Arampatzis and Jaap Kamps
NEAT :News Exploration Along Time (Demo) - Omar Alonso, Klaus Berberich, Srikanta Bedathur and Gerhard Weikum

The conference had also an Industry day, which we missed. You can see a report on the Industry day in the following blog post. During the conference, a few of us actively twittered the conference sessions. You can look at the archived ecir2010 hashtag for more details.

One of the most exciting moments of the conference was our visit to the Bletchley Park as part of the ECIR 2010 social dinner. This was an excellent venue with a lot of history, and the food was also good! During the dinner, we were given an impossible quiz to answer. Despite the wine, and a long day, some delegates did manage to find the answers.

Usually, when ECIR is held in the UK, the last day of the conference is the venue for Annual General Meeting of the BCS IRSG - the umbrella group for ECIR. However, in 2010, there was no AGM. We can only suppose that this was because the 2009 AGM was only held in October, co-located with Search Solutions 2009 at BCS HQ. We say suppose, because at the time of writing, the 2009 AGM minutes are not yet available!

Finally, we would like to thank the organisers for their hard work during the conference, for the idea of the ball-bouncer game during the session breaks, which was really cool/fun and for an overall reasonably organised conference. We look forward to ECIR 2011 in Dublin!

Terrier 3.0 released

2010-03-10T18:56:00.004+00:00

Firstly, we have a new website for Terrier: http://terrier.org

Also, we have just released Terrier 3.0!

This is a major update to Terrier, including:

support for indexing WARC collections (such as ClueWeb09)
improved MapReduce mode indexing
improved and more scalable index structures
added field-based and proximity term dependence models, such as BM25F, PL2F and Markov Random Fields
new Web-based retrieval interface

Fuller changelog at http://terrier.org/docs/current/whats_new.html

If your looking for our team publications, etc., please see our new team website: http://terrierteam.dcs.gla.ac.uk/

Thanks are due to everyone in the Terrier Team for their hard work to make this release, as well as the contributions and feedback about Terrier from our users and collaborators.

TREC Blog Track 2010

2010-02-23T10:31:00.021+00:00

The TREC Blog track will be continuing in 2010. In  2009,  the  Blog  track  has  been  markedly  revamped , addressing  more  reﬁned  Blog  search  scenarios  using  the new Blogs08 collection, a  large  sample  of  the  blogosphere covering the period of 14th January 2008 to 10th February 2009.

A summary of the TREC Blog track 2009 edition has been presented by Iadh Ounis at the main TREC conference (Slides). The Blog track 2009 overview paper will be available on the TREC website shortly, once it is updated and reviewed.

The details of the TREC 2010 Blog track are still being finalised by the organisers. However, following the discussions at the TREC 2009 Blog track workshop, here are some salient details (see also the TREC 2009 Wrap-up Slides):

1. Faceted blog search task will run again in 2010: The task addresses  the  quality aspect  of  the  retrieved blogs . It is a feed search task.

We will adopt a two-stage submission procedure: (1) a participating group submits "topically-relevant"blogs for each query; (2) a few standard baselines will be distributed to participants, so that they can re-rank them with respect to various facet inclinations (e.g. opinionated, in-depth, personal).
Groups can participate in stage 2 without stage 1, and vice-versa. Stage 1 is akin to an adhoc blog search task.
More topics for various facet inclinations.

2. Top news story identification task will run again in 2010: The task addresses the  news‐related  dimension  of  the  blogosphere. In particular, it investigates whether the blogosphere can be used to identify the most important news stories of the day. 

Real-time news search task rather than retrospective.
Much larger and a more comprehensive headlines sample, provided by a major news organisation.
A two-stage submission procedure: (1) Groups submit a ranking of top stories for some days per-category (e.g. sport, politics, business, etc.) (2) We will then select some top relevant stories, for which we will ask the participating groups to identify the related blog posts, in a manner that covers the various/diverse aspects of each story.
Groups can participate in stage 2 without stage 1. In the latter case, its is an adhoc diversity blog post search task, where the headline is the query.

We welcome any feedback and comments on the tasks above to trecblog-organisers (at) dcs.gla.ac.uk

Finally, note that if you wish to participate in TREC 2010, you should answer the TREC 2010 call for participation. We will update the Blog track wiki as things become more refined - keep following the Blog track developments as they happen on our dedicated Wiki web site.

AcademTech: Faceted People Search

2009-08-04T14:16:00.017+01:00

AcademTech is a Computing Science-specific expert search engine based on the Terrier IR Platform. Persons working at Computing Science departments in Scottish Universities are considered as candidate experts by the system. Profiles of their expertise evidence are then mined from their homepages, publicly available digital libraries (e.g. DBLP) and related information found on the Web through Yahoo! BOSS. The ranking of experts is provided by a variant of the Voting Model expert search approach.

The system is integrated with a novel faceted search interface to allow users to browse and explore the results using a number of categories such as Location or Conference/Journal publications. Each expert in the system has a profile page containing a number of elements including query specific supporting publications, most informative associated terms displayed as a tag cloud, co-authors and web links. Although the system is currently applied in the context of Scottish Computing Science Academia, it can easily be expanded to go beyond its current Scottish scope, cover other academic fields, and people in general.

I was lucky enough to be able to demo AcademTech at SIGIR 2009 in Boston on July 20th. Thankfully, I spoke to a large number of attendees receiving largely very helpful feedback.

A popular suggestion was to utilize AcademTech's core system in the scope of biology. This would meet the medical field's need for finding related organisms, diseases etc. Possible facets in the area would likely be biological classifications such as species and genus.

Daniel Tunkelang from The Noisy Channel suggested providing profile page-located facets, allowing filtering of search results by features present in a selected expert's page such as co-authors. This would satisfy an example scenario such as "Show me co-authors of this expert who work for the University of Glasgow." Profile facets could also allow the experts publications list to be filtered by a number of fields such as co-author location, conference etc.

Much of the feedback mirrored that of intended future work. Name disambiguation is a high priority update as a current problem with AcademTech is the publication mismatch when multiple experts have the same name. In fact, the system is specifically designed to allow for expansion of facets, and name disambiguation. With a large amount of publication collaborators working in industry a useful move would be to expand to accommodate these experts.

AcademTech is now publicly accessible from http://www.terrier.org/academtech

A description of the system is available in the SIGIR'09 proceedings.

Thank you to all those who spoke to me and gave me some great feedback.

SIGIR 2009: Expert Search from Glasgow

2009-07-21T16:08:00.005+01:00

A short update from SIGIR09 to announce our recently published work on expert search. This should hopefully be the first of a series of a few posts about SIGIR this year.

In On Perfect Document Rankings for Expert Search (Craig Macdonald & Iadh Ounis), we examine the effect of the document ranking to an expert search engine. Intuitively, improving the topical relevance properties of the document ranking usually leads to an improvement in the performance of the generated ranking of documents. In this poster, we examine the extreme case, by making the document ranking component perfect with respect to topical relevance.

In Usefulness of Click-through data in Expert Search (Craig Macdonald & Ryen White), we examine how user clicks on an intranet search engine can be used as features by an expert search engine. The proposed techniques are based on the voting techniques from the Voting Model, but examine documents clicks instead of weighting model scores. To our knowledge, this is the first work examining how clicks can be integrated into expert search.

Finally, the Voting Model was show-cased in the Expertise Search in Academia using Facets (Duncan McDougall & Craig Macdonald), which demoed AcademTech, a faceted search interface for expert search in academia.

CIKM 2011 in Glasgow!

2009-06-04T10:38:00.003+01:00

We are delighted that our bid to host the ACM Conference on Information and Knowledge Management (CIKM 2011) in Glasgow has been successful.

After the highly successful ESSIR 2007 and ECIR 2008 events, we are excited at the prospect of hosting the prestigious ACM CIKM Conference in Glasgow in 2011. We look forward to having our colleagues gather in Glasgow, and to surpassing their expectations.

Further information about the conference (dates, venues, etc.) will be available in due course.

CIKM 2009 will be held on November 2-6, 2009, in Hong Kong. Hope to see you there!

TREC Blog track 2009

2009-04-29T12:08:00.005+01:00

We have just released a draft of the guidelines for the TREC 2009 Blog track.

Compared to previous years, the Blog track 2009 aims to investigate more refined and complex search scenarios. In particular, we propose to run two tasks in TREC 2009:

Faceted blog distillation: a more refined version of the blog distillation task that addresses the quality aspect of the retrieved blogs and mimics an exploratory search task. The task can be summarised as "Find me a good blog with a principal, recurring interest in X". We propose several facets for the TREC 2009 blog distillation task, which may be of varying difficulty to identify for the participant systems.

Top stories identification: a new pilot task that addresses the news dimension in the blogosphere. Systems are asked to identify the top news stories of a given day, and to provide a list of relevant blog posts discussing each news story. The ranked list of blog posts should have a diverse nature, covering different/diverse aspects, perspectives or opinions of the news story.

The new Blogs08 collection, an up-to-date and large sample of the blogosphere from January 2008 to February 2009, will be used for both tasks.

We welcome feedback. Please feel free to post feedback and comments about the proposed tasks for 2009.

Blogs08 Collection Released

2009-04-09T20:05:00.003+01:00

We are pleased to announce that the Blogs08 collection is now ready for distribution. As announced before, Blogs08 is one order of magnitude bigger than Blogs06, and samples the blogosphere from January 2008 to February 2009. The uncompressed permalink size is approx 1.3TB, while including feeds, this amounts to over 2TB of data. As usual, the data is shipped compressed on a SATA hard drive.

The distribution mechanism will be the same as for Blogs06. There is specific information about the size of the collection here, while the instructions for obtaining the collection are here.

If you intend on participating in the TREC 2009 Blog track, please start working on the paperwork right away, so that you can get the collection as soon as possible. Due to the larger size of the collection, we will operate a queuing system for shipping the data. Moreover, if you haven't done so already, respond to the TREC 2009 Call for Participation.

Blog track co-ordinators are finalising the guidelines for this year's tasks and will continue to update the TREC Blog wiki, the TREC blog track mailing list and this blog.

Craig's Thesis Available

2009-03-03T16:52:00.007+00:00

Following up from my successful defence, I'm pleased to announce that my thesis, titled The Voting Model for People Search is now available online.

My thesis proposes the Voting Model for various people search problems, such as expert search in enterprise settings (find me someone who knows about...) , or blog(ger) search (find me a blog about the general topic...). I also examine the reviewer assignment problem (suggest for me reviewers for this paper...). In general, the Voting Model is concerned with the ranking of aggregates of documents.

Experimental chapters are mainly carried out using TREC Enterprise track and Blog track test collections.