Sunday, November 30, 2008

The TREC 2008 Blog track workshop

We just came back from Gaithersburg a few days ago. It was a nice (and cold!) week at the TREC 2008 conference. Besides presenting the main results of our participation in the Blog, Enterprise, and Relevance Feedback tracks, we had fruitful discussions at the Blog track workshop regarding the directions of the track for 2009.

It was a consensus among the attendees that opinion retrieval and polarity detection are still open, relevant problems. Yet a few groups managed to deploy interesting techniques that achieved consistent opinion retrieval performances across several strongly performing baselines in the track this year, polarity detection approaches looked rather naive. It was suggested that polarity detection be investigated at a finer granularity (e.g., at the sentence rather than the document level). This, however, could result in crossing the boundaries with respect to the TAC conference.

Nonetheless, believing that, after three years, the Blog track has contributed a comprehensive experimental setting for those who wish to continue investigating these search scenarios, the organisers decided to discontinue the opinion finding and polarity tasks, at least in their current format. Instead, they propose to investigate the opinionated nature of blogs as one of many interesting facets of a broader search task. This task extends the current blog distillation task by moving beyond topic relevance and introducing different requirements in order to qualify "good" blogs, i.e., blogs that have a recurrent interest in a given topic and that also fulfil a set of predefined "facets". This way, for instance, one could search for humorous blogs about the government, or opinionated blogs about whisky.

Besides this faceted blog distillation task, a second task was considered relevant and worth investigating by the workshop attendees, namely, tracking stories on the blogosphere. The aim is to investigate how stories emerge and evolve along the time frame of the blog corpus. It was also noted that this task could be linked to a news search task so as to draw a connection between stories published on the blogosphere and on the mainstream media.

As pointed out, however, the 11-weeks time frame of the Blogs06 collection does not adequately support the story tracking task. Furthermore, the availability of a more representative sample of the blogosphere is an important step towards addressing blog search as a social media problem. For such, a new corpus will be used in 2009, with a much larger size and time frame.

For those who did not attend the Blog track workshop at TREC, please feel free to post your comments about the proposed tasks for 2009.

Hope you all join us in the TREC 2009 Blog track!


Sérgio Nunes said...

Please consider keeping the baseline adhoc retrieval task for both blog posts and feeds. This would allow research on web retrieval besides opinion finding.


Iadh Ounis said...


You mean, you would like to keep the baseline task, i.e. have a two-stage submission procedure like this year?

Sérgio Nunes said...

I think that the two-stage submission procedure was a good solution. It established a common ground to isolate opinion/polarity extraction techniques but also allowed research on basic web retrieval features. However, with the new proposed tracks (story tracking and faceted distillation) this two-stage submission seems inadequate.

Since I am interested in basic web document retrieval, and the WWW track has been discontinued, I see the Blog track as the only venue for testing adhoc retrieval techniques over web documents. Thus, my suggestion for a simple, standard adhoc retrieval task over the blog collection.

Iadh Ounis said...

The faceted blog distillation task will require a two-stage assessement procedure: first the blog is assessed on whether it is relevant, i.e. have a recurring
 interest in a topic of interest. If deemed relevant, then it will be assessed on whether it meets the
. Somehow, the news search task will also include an adhoc retrieval dimension (e.g. "Find me the top stories being currently discussed in the blogosphere").

On the other hand, you might wish to know that there will be a Web track running next year in TREC 2009. It will use a new collection of about 1 billion of documents. Happy indexing!

Sérgio Nunes said...

Great news about a new edition of the web track. However, I suspect that the collection will be a static (snapshot like) crawl of web documents - i.e. no temporal information.