Tuesday, August 4, 2009

AcademTech: Faceted People Search

AcademTech is a Computing Science-specific expert search engine based on the Terrier IR Platform. Persons working at Computing Science departments in Scottish Universities are considered as candidate experts by the system. Profiles of their expertise evidence are then mined from their homepages, publicly available digital libraries (e.g. DBLP) and related information found on the Web through Yahoo! BOSS. The ranking of experts is provided by a variant of the Voting Model expert search approach.

The system is integrated with a novel faceted search interface to allow users to browse and explore the results using a number of categories such as Location or Conference/Journal publications. Each expert in the system has a profile page containing a number of elements including query specific supporting publications, most informative associated terms displayed as a tag cloud, co-authors and web links. Although the system is currently applied in the context of Scottish Computing Science Academia, it can easily be expanded to go beyond its current Scottish scope, cover other academic fields, and people in general.

I was lucky enough to be able to demo AcademTech at SIGIR 2009 in Boston on July 20th. Thankfully, I spoke to a large number of attendees receiving largely very helpful feedback.

A popular suggestion was to utilize AcademTech's core system in the scope of biology. This would meet the medical field's need for finding related organisms, diseases etc. Possible facets in the area would likely be biological classifications such as species and genus.

Daniel Tunkelang from The Noisy Channel suggested providing profile page-located facets, allowing filtering of search results by features present in a selected expert's page such as co-authors. This would satisfy an example scenario such as "Show me co-authors of this expert who work for the University of Glasgow." Profile facets could also allow the experts publications list to be filtered by a number of fields such as co-author location, conference etc.

Much of the feedback mirrored that of intended future work. Name disambiguation is a high priority update as a current problem with AcademTech is the publication mismatch when multiple experts have the same name. In fact, the system is specifically designed to allow for expansion of facets, and name disambiguation. With a large amount of publication collaborators working in industry a useful move would be to expand to accommodate these experts.

AcademTech Sigir 2009 PosterAcademTech is now publicly accessible from http://www.terrier.org/academtech
A description of the system is available in the SIGIR'09 proceedings.

Thank you to all those who spoke to me and gave me some great feedback.

Tuesday, July 21, 2009

SIGIR 2009: Expert Search from Glasgow

A short update from SIGIR09 to announce our recently published work on expert search. This should hopefully be the first of a series of a few posts about SIGIR this year.

In On Perfect Document Rankings for Expert Search (Craig Macdonald & Iadh Ounis), we examine the effect of the document ranking to an expert search engine. Intuitively, improving the topical relevance properties of the document ranking usually leads to an improvement in the performance of the generated ranking of documents. In this poster, we examine the extreme case, by making the document ranking component perfect with respect to topical relevance.

In Usefulness of Click-through data in Expert Search (Craig Macdonald & Ryen White), we examine how user clicks on an intranet search engine can be used as features by an expert search engine. The proposed techniques are based on the voting techniques from the Voting Model, but examine documents clicks instead of weighting model scores. To our knowledge, this is the first work examining how clicks can be integrated into expert search.

Finally, the Voting Model was show-cased in the Expertise Search in Academia using Facets (Duncan McDougall & Craig Macdonald), which demoed AcademTech, a faceted search interface for expert search in academia.

Thursday, June 4, 2009

CIKM 2011 in Glasgow!

We are delighted that our bid to host the ACM Conference on Information and Knowledge Management (CIKM 2011) in Glasgow has been successful.

After the highly successful ESSIR 2007 and ECIR 2008 events, we are excited at the prospect of hosting the prestigious ACM CIKM Conference in Glasgow in 2011. We look forward to having our colleagues gather in Glasgow, and to surpassing their expectations.

Further information about the conference (dates, venues, etc.) will be available in due course.

CIKM 2009 will be held on November 2-6, 2009, in Hong Kong. Hope to see you there!

Wednesday, April 29, 2009

TREC Blog track 2009

We have just released a draft of the guidelines for the TREC 2009 Blog track.

Compared to previous years, the Blog track 2009 aims to investigate more refined and complex search scenarios. In particular, we propose to run two tasks in TREC 2009:
  • Faceted blog distillation: a more refined version of the blog distillation task that addresses the quality aspect of the retrieved blogs and mimics an exploratory search task. The task can be summarised as "Find me a good blog with a principal, recurring interest in X". We propose several facets for the TREC 2009 blog distillation task, which may be of varying difficulty to identify for the participant systems.
  • Top stories identification: a new pilot task that addresses the news dimension in the blogosphere. Systems are asked to identify the top news stories of a given day, and to provide a list of relevant blog posts discussing each news story. The ranked list of blog posts should have a diverse nature, covering different/diverse aspects, perspectives or opinions of the news story.
The new Blogs08 collection, an up-to-date and large sample of the blogosphere from January 2008 to February 2009, will be used for both tasks.

We welcome feedback. Please feel free to post feedback and comments about the proposed tasks for 2009.

Thursday, April 9, 2009

Blogs08 Collection Released

We are pleased to announce that the Blogs08 collection is now ready for distribution. As announced before, Blogs08 is one order of magnitude bigger than Blogs06, and samples the blogosphere from January 2008 to February 2009. The uncompressed permalink size is approx 1.3TB, while including feeds, this amounts to over 2TB of data. As usual, the data is shipped compressed on a SATA hard drive.

The distribution mechanism will be the same as for Blogs06. There is specific information about the size of the collection here, while the instructions for obtaining the collection are here.

If you intend on participating in the TREC 2009 Blog track, please start working on the paperwork right away, so that you can get the collection as soon as possible. Due to the larger size of the collection, we will operate a queuing system for shipping the data. Moreover, if you haven't done so already, respond to the TREC 2009 Call for Participation.

Blog track co-ordinators are finalising the guidelines for this year's tasks and will continue to update the TREC Blog wiki, the TREC blog track mailing list and this blog.

Tuesday, March 3, 2009

Craig's Thesis Available

Following up from my successful defence, I'm pleased to announce that my thesis, titled The Voting Model for People Search is now available online.

My thesis proposes the Voting Model for various people search problems, such as expert search in enterprise settings (find me someone who knows about...) , or blog(ger) search (find me a blog about the general topic...). I also examine the reviewer assignment problem (suggest for me reviewers for this paper...). In general, the Voting Model is concerned with the ranking of aggregates of documents.

Experimental chapters are mainly carried out using TREC Enterprise track and Blog track test collections.

Sunday, February 22, 2009

Correlator launched

Yahoo! Research has launched a new search engine called Correlator. It uses advanced techniques from Natural Language Processing and Computational Linguistics to locate entities within text and to group sentences about these entities from different documents. In his talk at the ECIR 2008 Industry Day, Hugo Zaragoza who championed the Correlator project at Yahoo! Research Barcelona, described some of the system's underlying approaches and technologies. In a blog post introducing the search engine, he states:
The core of Correlator is a search engine capable of returning not only relevant documents, but also relevant sentences and entities.
Currently, Correlator uses Wikipedia as the underlying document collection. However, the Correlator team contends that this can be extended to other collections and types of documents such as blogs.

I have quickly tried Correlator this morning. My first impression of the system is that it does extremely well on many queries - e.g. results for queries such as "precision and recall" are pretty good and informative. However, there are several areas for improvement when it comes to identifying relationships between entities. For example, for the query "Tony Blair", when searching for names, the system suggests many entities as 'probably related to Tony Blair', however the precise nature of the relationship between the two entities is not stated, e.g. Cherie Blair should be presented as the definite wife of Tony Blair. Indeed, it is left to the user the task of browsing through various possible suggested relationships between the named entities. However, this might be a design choice by the designers of the system, favouring high coverage over high precision.

Relatedly, it is of note that TREC 2009 will include a new Entity track. One of the currently proposed search tasks is the identification of relationships between entities.

Friday, February 20, 2009

Choosing an issue tracking system

I recently posted about our deployment of an issue tracking system for use by the Terrier platform.

When we recently made the decision process to an issue tracking system, I reviewed several alternative products, namely Scarab, Mantis, and JIRA. Firstly, I should say that I was already familiar with JIRA through its extensive usage by the Apache Software Foundation, e.g. by Hadoop and Pig.

When I evaluated all three products, I found Scarab to be the nearest to JIRA, however it required much work to setup an issue tracking system that was full of "attributeGroup1" etc. labels. Scarab also lacked querying features: e.g. it had an advanced SQL inspired querying interface, but no easy way to see "All Issues" or "Most Recently Updated Issues". For me, this is a killer problem, as Bugzilla faces the same issue. It should be easy to see what the current problems are, or what issues people are working on. At the end of the day, as a developer, you spend more time querying an issue tracker than you do filing issues.

I'm no expert at HTML & CSS, so for Mantis, I was pleased when it installed with a cleaner default theme. However, with Mantis, the fields for an issue were hard-coded in the submission and display pages. This was ultimately the downfall for Mantis, as I'm keen on minimising the unnecessary fields, and Mantis had many fields that were not appropriate for Terrier. Another downside for Mantis was that issues were numbered 0000001. In JIRA and Scarab, issues have a project prefix, and no leading zeros, making them instantly recognisable outside of the issue tracker - e.g. if I write TR-5 (e.g. in a SVN commit message), then people are likely to know what I'm on about. In contrast 0000005 is not something that using a search engine they would find quickly.

Finally, I installed a trial for Atlassian's JIRA. We found JIRA easy to browse and query existing issues. The dashboard functionality it also useful, and customisable. Finally, we also liked the user/SE friendly URLs used by JIRA for issues: e.g. http://ir.dcs.gla.ac.uk/terrier/issues/browse/TR-1 for issue TR-1. I'm pleased with JIRA, which exhibits an overall very polished UI. Atlassian have also very kindly provided Terrier with an open source license for JIRA. Thanks Atlassian!

Wednesday, February 18, 2009

WSDM 09 : Your mileage may vary

So, for the second year WSDM arrives, attendance has remained constant (despite the economic downturn) and we're all packing t-shirts rather than bags. However, unwrapping the packaging, what do we find?

Initially, the conference seemed promising, beginning with an excellent speech by Jeff Dean. A 101 on efficiency at Google since 1999, well rounded with explanations and statistics in equal measure. Unfortunately, this staring performance seemed to overshadow the rest of the conference - a benchmark never surpassed.

If I was to use one word to describe WSDM'09, then it would be inconsistent. There were some nice speeches, Eytan Adar's talk on detecting how the web changes over time springs to mind (also best student paper), and Songhua Xu gets bonus points for turning up with an interface paper to a predominantly text-based conference. However, many were poorly presented, insubstantial or both.

The dominating topic of the conference was unsurprisingly Wikipedia, with well over 40% of papers giving it a mention. Ignoring the proliferation of Wikipedia papers over the last year, high point here was Eytan Adar's paper on Information Arbitrage Across Multi-lingual Wikipedia, for coming up with something which might actually be useful in practice.

The videos from the conference should be up soon on videolectures.net, and I would recommend Jeff Deans opening speech - for those of you who can survive watching in tiny eye-strain-o vision which comes with flash. As for the rest remember - your mileage may vary.

Tuesday, February 17, 2009

WSDM 2009 highlights

Richard and I went to Barcelona last week to attend the WSCD 2009 workshop and the WSDM 2009 conference. Craig was also there on Monday to present an interesting poster on the usefulness of click-through data for training.

Besides being held in an exciting city (!), WSDM 2009 kept up with its previous edition in bringing together industry and academia to a common, quality forum for Web IR and data mining, with papers covering a wide range of trendy topics -- fairly well summarised by the tag cloud printed on the t-shirts given to the participants! -- from query intent detection, through search results diversification, to tagging-based clustering and classification, and social network-driven marketing analysis, to name a few.

The best paper award went to Fernando Diaz for his work on the selective integration of news content into Web results based on the classification of the newsworthiness of each query. Eytan Adar et al. received the best student paper award for their study of the dynamics of the content and structure of Web documents of varying popularity over a fine-grained timescale. In the new late breaking results session, the award went to Irem Arikan et al.'s paper on applying a language model approach for improving the retrieval effectiveness for queries with temporal expressions. The invited talks by Jeff Dean and Gerhard Weikum were also insightful -- we couldn't attend Ravi Kumar's though. All talks should be available soon from VideoLectures.net.

Overall, WSDM is rapidly moving towards establishing itself among the major IR conferences. In 2010, it will probably be held in Los Angeles, CA, USA.

Monday, February 16, 2009

Twitter and CEOs

Thanks to Theo Huibers for pointing out to an article in Forbes about Why Europe's CEOs should Twitter.

The article reports that unlike their counterpart in the USA, the CEOs of European companies are being slow in embracing the Twitter tool. In general, the article argues that European chief executives are not very aware of the benefits of social networking tools to their businesses, missing out on opportunities to engage with their customers.

If this is true, than this is rather worrying. Indeed, I can easily see many scenarios where social networking tools such as Twitter could be helpful for businesses. The Forbes article mentions several of these. For example, a case where the public relations office of General Motors has used Twitter to clamp down on rumours affecting the company. In his blog, Daniel Tunkelang reported a first-hand experience, when one of his technical questions posted on Twitter received care from the president and COO of GoDaddy.com, albeit with a degree of attention that goes beyond what Daniel bargained for.

It is of interest to note that the Forbes article suggests that the Twitter's interface is still too complex for a widespread adoption by end-users and businesses. While I have only been an occasional user of Twitter, I have never had the feeling that the tool was difficult to use. However, I'm happy to stand corrected by HCI experts!

Wednesday, February 11, 2009

Grid@CLEF track : a framework for IR experimentation

Don't be put of by the title, this isn't a post about Grid Computing. Instead, I'm going to talk about the Grid@CLEF task, which defines a framework and TREC-style track for experimentation with various components of IR systems. Disclaimer: I'm pleased to be on the advisory committee of the Grid@CLEF task.

Firstly, I'll give a bit of background. Cross-Language Evaluation Forum (CLEF) is a spin-off from TREC which concentrates on the evaluation of mono-lingual (non-English) and cross-lingual retrieval. CLEF has been running since 2000, and attracts a wide spread of participating research groups from across the globe, reaching 130 for CLEF 2008.

The tracks have now been defined for CLEF 2009, which includes the Grid track. Nicola Ferro (Univ. of Padova) and Donna Harman (NIST) are the big-wigs for this task, with suggestions from the advisory committee. So what does Grid mean in this context? Well, the idea (in my own words) is that the components of an IR system that have effect can be roughly categorised as follows: tokeniser, stopword list, word-decompounder, stemmer, and ranking function. In the Grid track, the concept is that these components can be interchanged, and a fuller understanding of their impact derived. The Grid framework facilitates such interchanges, by defining a way to allow various mixes of components to be attempted, thus creating a "grid" of experimental results.

However, the problem with such an experiment is that often each of these components is tied to an IR system, and that having the IR system itself can have an impact on the results. Instead, the idea behind the Grid track is that the output from each component (tokeniser, stopword list etc) of a given IR system is saved in an XML format, and shared among participants. In this way, every combination of each component can be investigated.

The Grid@CLEF site describes more the intuitions of the task, including an example of how results will be presented.

Here in Glasgow, we like the concept behind the Grid track. Indeed, it has some similarities to the way we ran the opinion finding task in the TREC 2008 Blog track. In the opinion finding task (where the aim is to retrieve relevant and opinionated blog posts about the target topic), the retrieval performance of opinion identification approaches appears to be linked to the ability of the underlying "topical relevance" retrieval approach. To investigate this in TREC 2008, we provided 5 standard topical relevance baselines, which participants were able to use as input to their opinion finding technique(s). You can read more in the Overview of the TREC 2008 Blog track (Iadh Ounis, Craig Macdonald and Ian Soboroff), which should be released in a few weeks time.

I have committed to implementing Terrier support for the Grid@CLEF track. The XML specification is being agreed by the Grid@CLEF organisers and advisers. However, if you are interested in using Terrier on this task, you can follow the progress on the TR-9 issue concerning Terrier's Grid@CLEF support. The exact specification for the Grid@CLEF XML interchange format is still in flux, but once its settled down, Terrier support should be forthcoming.

Building Terrier by Open Collaboration

An important benefit of having an open source IR platform, like Terrier, is that users of the platform can contribute code to the platform, and overall, everyone gains. IR platforms which are not open source may be popular, but can stagnate if it does not evolve to meet modern needs. Open source is a good way of building such a critical mass of people to evolve a project.

To facilitate the task of our users who contribute to Terrier, we are in the process of making changes that will also make the development process easier:
An issue tracker allows issues (bugs or feature requests) to be named, discussed, and patches proposed. Other contributors may review and discuss these patches before they are committed. All development work on the Terrier open source platform will now be done via the issue tracker. In deciding to deploy JIRA, we did take some time to review several issue trackers. I'll describe these and how we came to our decision in a future post.
The goal of opening our source code repository is that patches submitted by contributors can be made against the latest (trunk) Terrier source, thus ensuring that no stale patches are received. As a committer this will make my job easier.

I recently announced these changes in Rome at the New challenges in Information Retrieval and Text Mining in an open source initiative workshop. You can see my slides from the workshop below:

Thursday, January 22, 2009

Craig successfully defends his thesis

I am pleased to announce that last Thursday (15th January), I successfully defended my thesis, titled the Voting Model for People Search. I want to give many thanks to Iadh Ounis for supervising my PhD, and also to my committee: convener David Watt, and, in particular, to examiners Ricardo Baeza-Yates and Phil Gray, for the rewarding discussion and constructive feedback.

I have 4 weeks to make very minor corrections to my thesis, after which time it will be available online.