Friday, July 27, 2012

SMART: An open source framework for searching the physical world

Some of our readers are probably aware of our new project SMART, which aims to develop a new technology for the real-time indexing and retrieval of sensor and social streams. This three-year project is funded by the European Commission under the Seventh Framework Programme (grant number 287583). The project, which has started in November 2011, has already received a large national and international press coverage in online and print news over the last month. The BBC will shortly be broadcasting a piece of television about the project.

The name of the project and the resulting search engine, SMART, acknowledges the vision of the Internet of Things in general, and the concept of smart cities in particular. Indeed, SMART builds on the growing trend of smart cities, where in addition to physical infrastructure (roads, buildings), digital knowledge infrastructure is deployed to serve the needs of the citizens and local governments. The backbone of the digital knowledge infrastructure is mainly composed of sensors such as cameras, microphone arrays, or other environmental sensors, from weather to parking sensors. For example, in "smart cities", drivers can be notified where it is good to park their car or where to avoid traffic jams in the city centre at any time of the day. The main idea of the SMART project is to connect these sensors to the Internet and have search technologies to allow citizens to benefit from the information that these sensors can provide in real-time.

The SMART search engine builds upon the Terrier Information Retrieval platform, and exemplifies our recent move towards building new, separate and tailored products on top of the Terrier platform. In particular, Terrier has been enhanced and expanded with real-time indexing and a scalable distributed architecture allowing to process and handle a large volume of continuous and parallel streams.

SMART is a multi-disciplinary project in nature, encompassing state-of-the-art technologies from audio & video processing, social search and reasoning. Building upon these technologies, SMART analyses the input from sensors in real-time, for example to detect large crowds, or if live music can be heard. These can be compared with recent posts on social networks from the same area, to see whether the system can learn more about what is happening in the area around the sensors. By analysing the sensors across multiple locations within the city, when a user asks “what’s happening near me”, the system has some idea of which locations have the most interesting events.

Clearly, making real-world events searchable can have privacy/ethics implications. In fact, never before in our research have we been confronted with such a dichotomy between what is technologically feasible and what we conceive to be ethical. That's why we and our partners in the project are carefully considering privacy issues in our research. Indeed, we are closely working with various national Data Protection Authorities (DPAs) (i) to ensure that we don’t overstep the legal or ethical boundaries of privacy and (ii) to provide guidelines for the ethical implications of the SMART technologies and help prospective deployers to use/deploy SMART in a legal, ethical, and friendly manner. Interested readers can consult the first issue of the SMART Newsletter for further details about our ongoing efforts towards the privacy issue.

While we will be trialling the SMART search technology in The City of Santander (Spain), the key infrastructure of SMART (including the search components based on Terrier) will be made available as open source, encapsulating a vision whereby other smart cities can easily become involved and benefit from the project's outcomes. We expect the first release of the SMART search technology to become available as open source under the Mozilla Public License (MPL) 2.0 by the end of 2012. By releasing parts of SMART as open source, we aim to allow the formation of a community of early adopters that will be key for evaluating and sustaining the project.

With this in mind, we have just published a paper in the SIGIR 2012 Open Source Information Retrieval (OSIR 2012) workshop describing our current progress in the project as well as the open source vision of the project:

SMART: An open source framework for searching the physical world. M-Dyaa Albakour, Craig Macdonald, Iadh Ounis, Aristodemos Pnevmatikakis and John Soldatos. In Proceedings of the SIGIR 2012 Workshop on Open Source Information Retrieval. Portland, Oregon, USA. August 2012.

As always, we welcome comments and contributions from smart cities, community members and developers to the SMART vision.

Wednesday, July 25, 2012

From Puppy to Maturity: Experiences in Developing Terrier

We will be taking part in the SIGIR 2012 Workshop on Open Source Information Retrieval. In particular, we have published a paper on the Terrier open source information retrieval platform, detailing the vision behind the platform, some recent developments in Terrier, as well as a roadmap for future releases.

As always, our vision for the Terrier platform is to continue empowering researchers and practitioners in information retrieval (IR) with up-to-date, easily adaptable, effective and scalable indexing and search approaches, allowing them to build and evaluate the next generation IR applications. 

In particular, Terrier will be moving towards feature-based retrieval, in line with the increasing importance of the learning-to-rank paradigm in modern information retrieval where machine-learned ranking functions combining multiple features are deployed. To do so, Terrier will be supporting the efficient and effective extraction of query-independent and query-dependent features.

To support scalability and efficiency, Terrier's data structures have undergone a major enhancement to support advanced dynamic pruning techniques, as well as the development of applications requiring distributed and real-time indexing and retrieval such as Twitter search.

Finally, the growth of the Terrier platform over the past decade into exciting new areas such as MapReduce indexing and crowdsourcing entails increased functionality, but also platform complexity. To avoid software bloat, we are moving from a monolithic release structure, to a system of periodic core releases and timely plugin expansions. The first such release will be the CrowdTerrier plugin, providing  researchers with an out-of-the-box tool to achieve fast and cheap relevance assessments.

A more comprehensive account of the forthcoming Terrier releases is detailed in our paper below:

From Puppy to Maturity: Experiences in Developing Terrier. Craig Macdonald, Richard McCreadie, Rodrygo Santos and Iadh Ounis. In Proceedings of the SIGIR 2012 Workshop on Open Source Information Retrieval. Portland, Oregon, USA. August 2012

We hope to see many colleagues joining us to work towards the objectives of the platform and enriching its functionalities. As always, we welcome suggestions and any feedback on the roadmap in the run up to the forthcoming Terrier 4.0.

Monday, June 25, 2012

Efficiency, Effectiveness, Medical Search, Dataset Development and Crowdsourcing at SIGIR 2012

The TerrierTeam will be well represented at SIGIR 2012 this year with a full paper, four posters, a demonstration and a workshop, covering a wide range of disciplines within the field of information retrieval. For those of you interested in Web search efficiency, we have a number of contributions to look for. Our full paper Learning to Predict Response Times for Online Query Scheduling defines the new area of query efficiency prediction. In particular, it postulates that not every query takes the same time to complete, particularly where efficient dynamic pruning strategies such as WAND are used to reduce retrieval latency. In our paper, we show and explain why queries with similar properties (e.g. posting list lengths) can have markedly different response times. We use these explanations to propose a learned approach for query efficiency prediction that can accurately predict the response time of a query before it is executed. Furthermore, we show that using query efficiency prediction can markedly increase the efficiency of query routing within a search engine that uses multiple replicated indices. Relatedly, our poster Scheduling Queries Across Replicas builds upon our work on query efficiency prediction, to show how a replicated and distributed search engine can be improved by the application of response time predictions. In particular, the response time predictions are used to estimate the workload of each replica of each index shard. Then each newly arrived query can be  routed to the replica of each index shard that will be ready to process the query earliest. 

At SIGIR this year we also present recent work examining both efficiency and effectiveness. Dynamic pruning strategies, such as WAND, can increase efficiency by omitting the scoring of documents that can be guaranteed not to make the top-K retrieved set - a feature known as safeness. Broder et al. showed how WAND could be made more efficient by relaxing the safeness guarantee, with little impact on the top-ranked documents. Through experiments on the TREC ClueWeb09 corpus and 33 query dependent and query independent features, our poster Effect of Dynamic Pruning Safety on Learning to Rank Effectiveness shows that relaxing safeness to aid efficiency can have an unexpectedly large impact on retrieval effectiveness when combined with modern learning to rank models, in contrast to the earlier work by Broder et al. In particular, we show that inherent biases by unsafe WAND towards documents with lower docids can markedly impact the effectiveness of learned models.

Those interested in the Medical search domain, in particular participants in the TREC Medical track, will be interested in our paper entitled Exploiting Term Dependence while Handling Negation in Medical Search. We show that it is important to handle negation in medical records - in particular, when searching for cohorts (groups of patients) with specific symptoms, our approach ensures that patients known not to have exhibited particular symptoms are not retrieved. Our results demonstrate that appropriate negation handling can increase retrieval effectiveness, particularly when the dependence between negated terms are considered using a term dependence model from the Divergence From Randomness framework

Our poster On Building a Reusable Twitter Corpus tackles an important issue raised during the creation of the Tweets11 dataset as part of the TREC Micoblog track, namely how reusable Tweets11 is, given the dynamics of Twitter. Our poster shows that corpus degradation due to deleted tweets does not effect the ranking of systems that participated in the TREC 2011 Microblog track. Meanwhile, we are also demonstrating the first release of a new extension to our Terrier IR platform, namely CrowdTerrier, which enables relevance assessments to be created in a fast semi-automatic manner using crowdsourcing. CrowdTerrier is an infrastructure addition to Terrier that enables relevance assessments to be created in a fast semi-automatic manner using crowdsourcing. CrowdTerrier will be made available for download soon.

Finally, together with a group representing six open source IR systems, we are involved in the organisation of a SIGIR'12 workshop on Open Source Information Retrieval. The workshop aims to provide a forum for users and authors of open source IR tools to get together, and to work together to build OpenSearchLab, an open source, live and functioning, online web search engine for research purposes and discuss the joint future.

Tuesday, June 19, 2012

A SMART way to Search your City

TerrierTeam is currently expanding its outreach into social and sensor-based search systems as part of the ongoing SMART EU-funded project (FP7 287583). SMART aims to develop an open source  search framework for multimedia data stemming from the physical world and social streams such as Twitter. The end-goal is to be able to answer location and time-sensitive queries such as “where can I go to listen to live music in the city centre tonight?” or “where are my friends hanging out in the city?” by augmenting social media signals with live city sensor information.

Our role in the SMART project is to develop fast and effective real-time search from the flood of information provided by social and city sensor streams on top of our open-source Terrier information retrieval platform. Indeed, a real-time Twitter search demo illustrating incremental and distributed indexing and real-time retrieval in Terrier is now available. Try it at:

SMART has seen wide-ranging national, European and international coverage in online and print news media over the last week. Indeed, we are tracking over 100 articles and counting! Some sample articles can be found below:
A more detailed list of recent press coverage can be found at