Wednesday, December 10, 2008
Mining query logs
It is often reported in the literature how search engines can use their query logs to improve document ranking. However, the query logs could also be used for various mining activities. For example, an article in The New York Times described how a power cut in the New York area was reflected in the Google's query logs within 2 seconds after its occurrence, while it took about 15 minutes for newswire services to report the same event.
Relatedly, Abdur Chowdhury in his position talk at the SSM 2008 Workshop mentioned that news about a major earthquake in China were reported on Twitter well before the newswire services. A BBC blog post commented on the same issue.
Finally, the BBC recently reported that Google has developed a system to detect flu outbreaks in the USA by analysing the query logs and identifying the location of people issuing flu-related queries.
Unfortunately, query logs are scarcely available to researchers in academia, especially after the AOL data debacle. This limits scientific work in the field, as most current research results using query logs are not reproducible due to lack of publicly shared data. As a consequence, I very much welcome the forthcoming Workshop on Web Search Click Data (WSCD 2009), where the issue of publicly releasing query logs is being addressed as one of the objectives of the workshop.