Thursday, April 9, 2009

Blogs08 Collection Released

We are pleased to announce that the Blogs08 collection is now ready for distribution. As announced before, Blogs08 is one order of magnitude bigger than Blogs06, and samples the blogosphere from January 2008 to February 2009. The uncompressed permalink size is approx 1.3TB, while including feeds, this amounts to over 2TB of data. As usual, the data is shipped compressed on a SATA hard drive.

The distribution mechanism will be the same as for Blogs06. There is specific information about the size of the collection here, while the instructions for obtaining the collection are here.

If you intend on participating in the TREC 2009 Blog track, please start working on the paperwork right away, so that you can get the collection as soon as possible. Due to the larger size of the collection, we will operate a queuing system for shipping the data. Moreover, if you haven't done so already, respond to the TREC 2009 Call for Participation.

Blog track co-ordinators are finalising the guidelines for this year's tasks and will continue to update the TREC Blog wiki, the TREC blog track mailing list and this blog.

2 comments:

jeff.dalton said...

Thanks to everyone there for your hard work putting this together!

Iadh Ounis said...

Thanks Jeff! Much appreciated.
The main person to thank is Craig who put a lot of efforts and work to wrap up the collection.

Due to the scale of the collection, and some complex issues, the building process was much more time-consuming, and required more work and resources than expected.

However, we are very pleased with the outcome and the characteristics of the new collection. We believe that it will allow some great research to be done.

Hope to see you and some of the UMass folks participating in the blog track this year!