tag:blogger.com,1999:blog-60437057928075447092024-03-05T10:53:18.110+00:00TerrierTeamThis is the <a href="http://ir.dcs.gla.ac.uk/terrier">Terrier Team</a> blog. It is managed by several members of the Terrier Team at the <a href="http://www.gla.ac.uk">University of Glasgow</a>. It is used to publicise our research projects, to post news about various research work and activities performed by the team, as well as to share information and thoughts on information retrieval and search engines issues.Terrier Team @ Glasgowhttp://www.blogger.com/profile/11678159696002044810noreply@blogger.comBlogger46125tag:blogger.com,1999:blog-6043705792807544709.post-58475788301580881002014-06-19T13:00:00.000+01:002014-06-19T13:00:15.494+01:00Terrier IR Platform v4.0 Released<div class="p1">
<a href="http://terrier.org/">Terrier</a> 4.0, the next version of the open source IR platform from the <a href="http://www.gla.ac.uk/">University of Glasgow</a> (Scotland) has been released on 18th June 2014.<br />
<br />
Terrier 4.0 represents a major update over the previous 3.6 release, adding significant new features, including:</div>
<ul class="ul1">
<li class="li2"><a href="http://terrier.org/docs/v4.0/realtime_indices.html">Real-time index structures</a> to facilitate incremental indexing of new documents as over time</li>
<li class="li2"><a href="http://terrier.org/docs/v4.0/compression.html">Pluggable state-of-the-art index compression</a> reduces the size of Terrier's index structures</li>
<li class="li2"><a href="http://terrier.org/docs/v4.0/learning.html">Learning-to-rank</a> support enables out-of-the-box supervised ranking models, trained using state-of-the-art approaches such as LambdaMART</li>
<li class="li2"><a href="http://terrier.org/docs/v4.0/website_search.html">A website search application</a> is now provided, illustrating real-time crawling, indexing and retrieval within Terrier.</li>
</ul>
<br />
<div class="p1">
Terrier can be downloaded for free from <a href="http://terrier.org/"><span class="s1">http://terrier.org</span></a>. <br />
<br />
A full change log can be found at <br />
<a href="http://terrier.org/docs/current/whats_new.html"><span class="s1">http://terrier.org/docs/current/whats_new.html</span></a>.</div>
Iadh Ounishttp://www.blogger.com/profile/05740425172350940695noreply@blogger.com0tag:blogger.com,1999:blog-6043705792807544709.post-59733380087233189842013-03-26T10:44:00.001+00:002013-03-26T10:55:01.993+00:00Learning to Rank Research Using Terrier - The Role of Features (Part 2)<div style="text-align: justify;">
This is the second post in a <a href="http://terrierteam.blogspot.co.uk/2013/03/learning-to-rank-research-using-terrier.html">two-part series</a> addressing our recent research in learning to rank. While the <a href="http://terrierteam.blogspot.co.uk/2013/03/learning-to-rank-research-using-terrier.html">previous blog post</a> addressed the role of the sample within learning to rank, and its impact on effectiveness and efficiency, in this blog post, I'll be discussing the role of different features within the learning to rank process.</div>
<div>
<h3 style="text-align: justify;">
Features</h3>
<div>
<div style="text-align: justify;">
With various IR approaches being proposed over the years, these have naturally formed features within learned approaches. Features can be calculated on the documents, in a query independent (e.g. URL length, PageRank) or query dependent (e.g. weighting or proximity models) manner. For instance, the <a href="http://research.microsoft.com/en-us/um/beijing/projects/letor//">LETOR learning to rank test collections</a> deploy various (query dependent) weighting models calculated on different fields (body, title, anchor text, and URL). We have successfully been using the same four fields for representing Web documents in our participations to the <a href="http://terrierteam.dcs.gla.ac.uk/publications/terrier10trec.pdf">TREC</a> <a href="http://dcs.gla.ac.uk/~craigm/publications/mccreadietrec2011.pdf">Web</a> <a href="http://trec.nist.gov/pubs/trec21/papers/uogTr.medical.microblog.web.final.pdf">tracks</a>.</div>
</div>
<div>
<div style="text-align: justify;">
<br /></div>
</div>
<div>
<div style="text-align: justify;">
The role of different weighting models - particularly those calculated on different fields - within learning to rank intrigued us and formed an article <a href="http://www.dcs.gla.ac.uk/~craigm/publications/macdonald13multquerydf.pdf">About Learning Models with Multiple Query Dependent Features</a> that we recently published in Transactions in Information Systems. In particular, some learned models take the form of linear combinations of feature scores. In contrast, <a href="http://dl.acm.org/citation.cfm?doid=1031171.1031181">Robertson [CIKM 2004]</a> warned against the linear combination of weighting model scores. Yet the LETOR test collections deploy weighting models deployed on each field (e.g. BM25 body, BM25 title, BM25 anchor text). Hence, among other research questions, we re-examined the role of field-based weighting models in the learning to rank era. Our findings on the TREC ClueWeb09 corpus showed that field-based models such as <a href="http://trec.nist.gov/pubs/trec13/papers/microsoft-cambridge.web.hard.pdf">BM25F</a> and <a href="http://ims-sites.dei.unipd.it/documents/71612/86362/CLEF2005wn-WebCLEF-MacdonaldEt2005.pdf">PL2F</a> are still important for effective learned models.</div>
</div>
<div>
<div style="text-align: justify;">
<br /></div>
</div>
<div>
<div style="text-align: justify;">
<a href="http://www.dcs.gla.ac.uk/~craigm/publications/macdonald13multquerydf.pdf">Our TOIS paper</a> also shows how to efficiently calculate multiple query dependent features within in IR system. In particular, as the postings of an inverted index are compressed, it is expensive to calculate additional query dependent features once the top K sample has been identified, due to cost of decompressing the relevant parts of the inverted index posting lists again. Instead, we show how the postings of documents that are inserted into the top K documents within a DAAT retrieval strategy can be "cloned", and retained decompressed in memory, such that additional query dependent features can be calculated in the Feature Extraction phase. We call this the <i>fat framework</i>, as it "fattens" the result set with the postings of the query terms for those documents. We have implemented this framework within the <a href="http://terrier.org/">Terrier IR platform</a>, and it will be released as part of the next major release of Terrier, as described in our <a href="http://dcs.gla.ac.uk/~craigm/publications/macdonald12terrier.pdf">OSIR 2012 paper</a>.</div>
</div>
<div>
<div style="text-align: justify;">
<br /></div>
</div>
<div>
<div style="text-align: justify;">
Finally, <i>query features</i> are an increasingly important aspect within learning to rank. In contrast to (query independent or query dependent) document features, query features have the same value for each document ranked for a query. In this way, query features can be said to be document independent. Our <a href="http://terrierteam.dcs.gla.ac.uk/publications/santos2010cikm.pdf">CIKM 2010</a> and <a href="http://terrierteam.dcs.gla.ac.uk/publications/santos2011sigir-a.pdf">SIGIR 2011</a> papers on diversification used query features to decide on the ambiguity of a query, or to decide on the likely type of information need underlying different aspects of an ambiguous query. On the other hand, the role of query features is to allow learners based on regression trees (e.g. <a href="http://en.wikipedia.org/wiki/Gradient_boosting#Gradient_tree_boosting">GBRT</a>, LambdaMART) to customise branches of the learned model for different types of query. For instance, if query length is a query feature, then the learner might recognise that a query dependent proximity (document) feature is important for two terms queries, but not for one term queries. In our <a href="http://dcs.gla.ac.uk/~craigm/publications/macdonald12queryf.pdf">CIKM 2012 short paper</a>, we recognised the lack of a comprehensive study on the usefulness of query features for learning to rank. Our experiments combined the LambdaMART learning to rank technique with 187 different query features that were grouped into four types: pre-retrieval Query Performance Prediction, Query Concept Identification, Query Log Mining, and Query Type Classification. We found that over a quarter of the 187 query features could significantly improve the effectiveness of a learned model. The most promising query features were Query Type Classification, which identified the presence of entities in the query, suggesting that such features are useful for triggering sub-trees that promote entity homepages. Overall, we found query features could be employed to customise learned ranking models for queries with different popularity, length, difficulty, ambiguity, and related entities.</div>
<h3>
Summary</h3>
</div>
</div>
<div style="text-align: justify;">
There remains a great deal of black magic involved in the effective and efficient application of learning to rank within information retrieval. Indeed, my colleagues and I strongly believe that empirically dervied best practices are an important part of information retrieval research. This series of blog posts has been aimed at addressing some of the aspects missing in the literature, and provide insights into our recent research within this area.</div>
<div>
<h3>
Acknowledgements</h3>
<div style="text-align: justify;">
This body of research would not have been possible without a number of co-authors and contributors: Iadh Ounis, Rodrygo Santos, Nicola Tonellotto (CNR, Italy) and Ben He (University of the Chinese Academy of Science).<br />
<div>
<div style="text-align: start;">
<h3>
Key References</h3>
</div>
<div style="text-align: start;">
<a href="http://www.dcs.gla.ac.uk/~craigm/publications/macdonald13multquerydf.pdf">About Learning Models with Multiple Query Dependent Features</a>. Craig Macdonald, Rodrygo L.T. Santos, Iadh Ounis and Ben He. <i>Transactions in Information Systems</i>, 2013, in press.<br />
<br />
<a href="http://www.dcs.gla.ac.uk/~craigm/publications/macdonald12queryf.pdf">On the Usefulness of Query Features for Learning to Rank</a>. Craig Macdonald, Rodrygo Santos and Iadh Ounis. In <i>Proceedings of CIKM 2012</i>.<br />
<br />
<a href="http://www.dcs.gla.ac.uk/~craigm/publications/tonellotto2012selective.pdf">Efficient and effective retrieval using selective pruning</a>. Nicola Tonellotto, Craig Macdonald and Iadh Ounis. In <i>Proceedings of WSDM 2013</i>.</div>
<div style="text-align: start;">
<br /></div>
<div style="text-align: start;">
<a href="http://www.dcs.gla.ac.uk/~craigm/publications/macdonald12inrt_ltr.pdf">The Whens and Hows of Learning to Rank</a>. Craig Macdonald, Rodrygo Santos and Iadh Ounis. <i>Information Retrieval Journal</i>, 2012.</div>
<div style="text-align: start;">
<br /></div>
<div style="text-align: start;">
<a href="http://www.dcs.gla.ac.uk/~craigm/publications/macdonald12effect.pdf">Effect of Dynamic Pruning Safety on Learning to Rank Effectiveness</a>. Craig Macdonald, Nicola Tonellotto, and Iadh Ounis. In <i>Proceedings of SIGIR 2012</i>.</div>
</div>
</div>
</div>
Craig Macdonaldhttp://www.blogger.com/profile/13764972230026912718noreply@blogger.com0tag:blogger.com,1999:blog-6043705792807544709.post-67768713363375215672013-03-25T15:12:00.000+00:002013-03-27T15:13:19.513+00:00Explicit web search result diversification<!--[if gte mso 9]><xml>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]-->
<!--[if gte mso 9]><xml>
<w:WordDocument>
<w:View>Normal</w:View>
<w:Zoom>0</w:Zoom>
<w:TrackMoves/>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:DoNotPromoteQF/>
<w:LidThemeOther>EN-US</w:LidThemeOther>
<w:LidThemeAsian>JA</w:LidThemeAsian>
<w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:SnapToGridInCell/>
<w:WrapTextWithPunct/>
<w:UseAsianBreakRules/>
<w:DontGrowAutofit/>
<w:SplitPgBreakAndParaMark/>
<w:EnableOpenTypeKerning/>
<w:DontFlipMirrorIndents/>
<w:OverrideTableStyleHps/>
<w:UseFELayout/>
</w:Compatibility>
<m:mathPr>
<m:mathFont m:val="Cambria Math"/>
<m:brkBin m:val="before"/>
<m:brkBinSub m:val="--"/>
<m:smallFrac m:val="off"/>
<m:dispDef/>
<m:lMargin m:val="0"/>
<m:rMargin m:val="0"/>
<m:defJc m:val="centerGroup"/>
<m:wrapIndent m:val="1440"/>
<m:intLim m:val="subSup"/>
<m:naryLim m:val="undOvr"/>
</m:mathPr></w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="true"
DefSemiHidden="true" DefQFormat="false" DefPriority="99"
LatentStyleCount="276">
<w:LsdException Locked="false" Priority="0" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Normal"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="heading 1"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 2"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 3"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 4"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 5"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 6"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 7"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 8"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 9"/>
<w:LsdException Locked="false" Priority="39" Name="toc 1"/>
<w:LsdException Locked="false" Priority="39" Name="toc 2"/>
<w:LsdException Locked="false" Priority="39" Name="toc 3"/>
<w:LsdException Locked="false" Priority="39" Name="toc 4"/>
<w:LsdException Locked="false" Priority="39" Name="toc 5"/>
<w:LsdException Locked="false" Priority="39" Name="toc 6"/>
<w:LsdException Locked="false" Priority="39" Name="toc 7"/>
<w:LsdException Locked="false" Priority="39" Name="toc 8"/>
<w:LsdException Locked="false" Priority="39" Name="toc 9"/>
<w:LsdException Locked="false" Priority="35" QFormat="true" Name="caption"/>
<w:LsdException Locked="false" Priority="10" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Title"/>
<w:LsdException Locked="false" Priority="1" Name="Default Paragraph Font"/>
<w:LsdException Locked="false" Priority="11" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtitle"/>
<w:LsdException Locked="false" Priority="22" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Strong"/>
<w:LsdException Locked="false" Priority="20" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Emphasis"/>
<w:LsdException Locked="false" Priority="59" SemiHidden="false"
UnhideWhenUsed="false" Name="Table Grid"/>
<w:LsdException Locked="false" UnhideWhenUsed="false" Name="Placeholder Text"/>
<w:LsdException Locked="false" Priority="1" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="No Spacing"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 1"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 1"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 1"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 1"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 1"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 1"/>
<w:LsdException Locked="false" UnhideWhenUsed="false" Name="Revision"/>
<w:LsdException Locked="false" Priority="34" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="List Paragraph"/>
<w:LsdException Locked="false" Priority="29" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Quote"/>
<w:LsdException Locked="false" Priority="30" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Quote"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 1"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 1"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 1"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 1"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 1"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 1"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 1"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 1"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 2"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 2"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 2"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 2"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 2"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 2"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 2"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 2"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 2"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 2"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 2"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 2"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 2"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 2"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 3"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 3"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 3"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 3"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 3"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 3"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 3"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 3"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 3"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 3"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 3"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 3"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 3"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 3"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 4"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 4"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 4"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 4"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 4"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 4"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 4"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 4"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 4"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 4"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 4"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 4"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 4"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 4"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 5"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 5"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 5"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 5"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 5"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 5"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 5"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 5"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 5"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 5"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 5"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 5"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 5"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 5"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 6"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 6"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 6"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 6"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 6"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 6"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 6"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 6"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 6"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 6"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 6"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 6"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 6"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 6"/>
<w:LsdException Locked="false" Priority="19" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtle Emphasis"/>
<w:LsdException Locked="false" Priority="21" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Emphasis"/>
<w:LsdException Locked="false" Priority="31" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtle Reference"/>
<w:LsdException Locked="false" Priority="32" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Reference"/>
<w:LsdException Locked="false" Priority="33" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Book Title"/>
<w:LsdException Locked="false" Priority="37" Name="Bibliography"/>
<w:LsdException Locked="false" Priority="39" QFormat="true" Name="TOC Heading"/>
</w:LatentStyles>
</xml><![endif]-->
<!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-priority:99;
mso-style-parent:"";
mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
mso-para-margin:0cm;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:Cambria;
mso-ascii-font-family:Cambria;
mso-ascii-theme-font:minor-latin;
mso-hansi-font-family:Cambria;
mso-hansi-theme-font:minor-latin;}
</style>
<![endif]-->
<!--StartFragment-->
<br />
<div class="MsoNormal" style="text-align: left;">
<div style="text-align: justify;">
A
couple of weeks ago I successfully defended my PhD thesis at the <a href="http://www.gla.ac.uk/schools/computing/">School of Computing Science</a> of the <a href="http://www.gla.ac.uk/">University of Glasgow</a>. The thesis, entitled <i><a href="http://theses.gla.ac.uk/4106/">“Explicit web search result diversification”</a></i>, was unconditionally approved with no corrections by the examination board.<o:p></o:p></div>
</div>
<div class="MsoNormal" style="text-align: left;">
<div style="text-align: justify;">
<br /></div>
</div>
<div class="MsoNormal" style="text-align: left;">
<div style="text-align: justify;">
The
thesis tackles the problem of ambiguity in web search queries. In particular, with
the enormous size of the Web, a misunderstanding of the information need
underlying an ambiguous query can misguide the search engine, ultimately
leading the user to abandon the originally submitted query. To overcome this
problem, a sensible approach is to diversify the documents retrieved for the
user's query. As a result, the likelihood that at least one of these documents
will satisfy the user's actual information need is increased.</div>
<o:p></o:p></div>
<div class="MsoNormal" style="text-align: left;">
<div style="text-align: justify;">
<br /></div>
</div>
<div class="MsoNormal" style="text-align: left;">
<div style="text-align: justify;">
In
the thesis, we argue that an ambiguous query should be seen as representing not
one, but <a href="http://terrierteam.dcs.gla.ac.uk/publications/santos2011ddr.pdf">multiple information needs</a>. Based upon this premise, we propose <a href="http://terrierteam.dcs.gla.ac.uk/publications/santos10www.pdf">xQuAD <span style="font-family: Cambria; font-size: 12pt;">– </span>E<b>x</b>plicit <b>Qu</b>ery <b>A</b>spect <b>D</b>iversification</a>,
a novel probabilistic framework for search result diversification. In
particular, the xQuAD framework naturally models several dimensions of the search
result diversification problem in a principled yet practical manner. To this
end, the framework represents the possible information needs underlying a query
as a set of keyword-based <i><a href="http://terrierteam.dcs.gla.ac.uk/publications/ecir2010_rodrygo_div.pdf">sub-queries</a></i>.
Moreover, xQuAD accounts for the overall <i><a href="http://terrierteam.dcs.gla.ac.uk/publications/santos2011sigir-a.pdf">coverage</a></i>
of each retrieved document with respect to the identified sub-queries, so as to
rank highly diverse documents first. In addition, it accounts for how well each
sub-query is covered by the other retrieved documents, so as to promote <i><a href="http://terrierteam.dcs.gla.ac.uk/publications/santos2012irj.pdf">novelty</a></i> <span style="font-family: Cambria; font-size: 12pt;">– </span>and hence penalise
redundancy <span style="font-family: Cambria; font-size: 12pt;">– </span>in the ranking. The framework also models the <i><a href="http://terrierteam.dcs.gla.ac.uk/publications/santos2013irj.pdf">importance</a></i> of each of the identified sub-queries, so as to
appropriately cater for the interests of the user population when diversifying
the retrieved documents. Finally, since not all queries are equally ambiguous,
the xQuAD framework caters for the ambiguity level of different queries, so as
to appropriately <a href="http://terrierteam.dcs.gla.ac.uk/publications/santos2010cikm.pdf">trade-off relevance for diversity</a> on a per-query
basis.</div>
</div>
<div class="MsoNormal" style="text-align: justify; text-justify: inter-ideograph;">
<o:p></o:p></div>
<div class="MsoNormal" style="text-align: left;">
<div style="text-align: justify;">
<br /></div>
</div>
<div class="MsoNormal" style="text-align: left;">
<div style="text-align: justify;">
The
xQuAD framework is general and can be used to instantiate several
diversification models, including the most prominent models described in the
literature. In particular, within xQuAD, each of the aforementioned dimensions
of the search result diversification problem can be tackled in a variety of
ways. In this thesis, as additional contributions besides the xQuAD framework,
we introduce novel machine learning approaches for addressing each of these
dimensions. These include a learning to rank approach for <a href="http://terrierteam.dcs.gla.ac.uk/publications/santos2013irj.pdf">identifying effective sub-queries</a> as query suggestions mined from a query log, an intent-aware
approach for choosing the ranking models most likely to be effective for
<a href="http://terrierteam.dcs.gla.ac.uk/publications/santos2011sigir-a.pdf">estimating the coverage and novelty</a> of multiple documents with respect to a
sub-query, and a selective approach for automatically predicting <a href="http://terrierteam.dcs.gla.ac.uk/publications/santos2010cikm.pdf">how much to diversify</a> the documents retrieved for each individual query. In addition, we
perform the first empirical analysis of <a href="http://terrierteam.dcs.gla.ac.uk/publications/santos2012irj.pdf">the role of novelty</a> as a diversification
strategy for web search.</div>
<o:p></o:p></div>
<div class="MsoNormal" style="text-align: left;">
<div style="text-align: justify;">
<br /></div>
</div>
<div class="MsoNormal" style="text-align: left;">
<div style="text-align: justify;">
As
demonstrated throughout the thesis, the principles underlying the xQuAD
framework are general, sound, and effective. In particular, to validate the
contributions of this thesis, we thoroughly assess the effectiveness of xQuAD
under the standard experimentation paradigm provided by the diversity task of
the <a href="http://plg.uwaterloo.ca/~trecweb/">TREC 2009, 2010, and 2011 Web tracks</a>. The results of this investigation
demonstrate the effectiveness of our proposed framework. Indeed, xQuAD attains
consistent and significant improvements in comparison to the most effective
diversification approaches in the literature, and across a range of
experimental conditions, comprising multiple input rankings, multiple sub-query
generation and coverage estimation mechanisms, as well as queries with multiple
levels of ambiguity.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
These investigations led to the <a href="http://www.dcs.gla.ac.uk/~rodrygo/publications.html">publication of 12 peer-reviewed research papers and 5 evaluation forum reports</a> directly related to the thesis. Moreover, the thesis opened up directions for other researchers, who deployed and extended the xQuAD framework for different applications, and inspired a series of workshops on <a href="http://www.dcs.gla.ac.uk/workshops/ddr2012/">Diversity in Document Retrieval</a> as well as a research track at the internationally renown <a href="http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings9/NTCIR/01-NTCIR9-OV-INTENT-SongR.pdf">NTCIR</a> forum. From a practical perspective, xQuAD has been subjected to scrutiny from the research community as a regular contender in both <a href="http://plg.uwaterloo.ca/~trecweb/">TREC</a> and <a href="http://www.thuir.org/intent/ntcir9/">NTCIR</a>. As the winning entry in all editions of the diversity task of the TREC Web track (best cat. B submission in <a href="http://plg.uwaterloo.ca/~trecweb/2009.html">TREC 2009</a> and <a href="http://plg.uwaterloo.ca/~trecweb/2010.html">TREC 2010</a>; best overall submission in <a href="http://plg.uwaterloo.ca/~trecweb/2011.html">TREC 2011</a> and <a href="http://plg.uwaterloo.ca/~trecweb/2012.html">TREC 2012</a>), we believe that the xQuAD framework has secured its place in the state-of-the-art.</div>
</div>
<div class="MsoNormal" style="text-align: left;">
<div style="text-align: justify;">
<br /></div>
</div>
<div class="MsoNormal" style="text-align: left;">
<div style="text-align: justify;">
The
thesis is now available online at <a href="http://theses.gla.ac.uk/4106/">http://theses.gla.ac.uk/4106/</a>.
In addition, a reference implementation of the xQuAD framework will feature in the
next major release of the open-source <a href="http://terrier.org/">Terrier Information Retrieval Platform</a>.<o:p></o:p></div>
</div>
<!--EndFragment-->Rodrygo L.T. Santoshttp://www.blogger.com/profile/09502952528669992135noreply@blogger.com0tag:blogger.com,1999:blog-6043705792807544709.post-79096078045597128402013-03-21T16:34:00.001+00:002013-03-21T16:34:26.424+00:00Learning to Rank Research using Terrier - The Importance of the Sample (Part 1)<div style="text-align: justify;">
This is the first of two blog posts addressing some of our recent research in learning to rank. In particular, in recent years, the information retrieval (IR) field has experienced a paradigm shift in the application of machine learning techniques to achieve effective ranking models. A few years ago, <a href="http://dl.acm.org/citation.cfm?id=1390334.1390348">we</a> were using hill-climbing optimisation techniques such as simulated annealing to optimise the parameters in weighting models, such as BM25 or <a href="http://en.wikipedia.org/wiki/Divergence-from-randomness_model">PL2</a>, or latterly <a href="http://trec.nist.gov/pubs/trec13/papers/microsoft-cambridge.web.hard.pdf">BM25F</a> or <a href="http://ims-sites.dei.unipd.it/documents/71612/86362/CLEF2005wn-WebCLEF-MacdonaldEt2005.pdf">PL2F</a>. Instead, driven first by commercial search engines, IR is increasingly adopting a feature-based approach, where various mini-hypothesis are represented as numerical features, and <a href="http://en.wikipedia.org/wiki/Learning_to_rank">learning to rank techniques</a> are deployed to decide their importance in the final ranking formulae.</div>
<br />
<div style="text-align: justify;">
The typical approach for ranking is described in the following figure from our recently presented <a href="http://www.dcs.gla.ac.uk/~craigm/publications/tonellotto2012selective.pdf">WSDM 2013 paper</a>:</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://www.dcs.gla.ac.uk/~craigm/publications/tonellotto2012selective.pdf" style="margin-left: auto; margin-right: auto;"><img border="0" height="166" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgJxHbLUVBjaVH_akdM6-FOMPDHj9XO4fhakH-3bEN11WDn7HGY3vpAx3YwJBcy-fJ-5PQDCfz5QcfinkcBZ2W57BHLESgpZXVRbC0yqFtKZa8Ma_wVblxhJrX2cwRK5rMDHno8pY8IaUxQ/s400/phases.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Phases of a retrieval system deploying learning to rank, taken from Tonellotto et al, WSDM 2013.</td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: justify;">
<span id="goog_368356707"></span><span id="goog_368356708"></span><a href="http://www.blogger.com/"></a></div>
<div style="text-align: justify;">
In particular, there are typically three phases:</div>
<ol>
<li style="text-align: justify;"><b>Top K Retrieval</b>, where a number of top-ranked documents are identified, which is known as the <i>sample</i>.</li>
<li style="text-align: justify;"><b>Feature Extraction</b> - various features are calculated for each of the sample documents.</li>
<li style="text-align: justify;"><b>Learned Model Application </b>- the learned model obtained from a learning to rank technique re-ranks the sample documents to better satisfy the user.</li>
</ol>
<h3 style="text-align: justify;">
The Sample</h3>
<div>
<div>
<div style="text-align: justify;">
The set of top K documents selected within the first retrieval phase is called the sample by <a href="http://www.nowpublishers.com/product.aspx?product=INR&doi=1500000016">Liu</a>, even though the selected documents are not iid. Indeed, in selecting the sample, <a href="http://www.nowpublishers.com/product.aspx?product=INR&doi=1500000016">Liu</a> suggested that the top K documents ranked by a simple weighting model such as BM25 is not the best, but is sufficient for effective learning to rank. However, the size of the sample - i.e. the number of documents to be re-ranked by the learned model - is an important parameter: with less documents, the first pass retrieval can be made more efficient by the use of dynamic pruning strategies (e.g. <a href="http://dl.acm.org/citation.cfm?id=956944">WAND</a>); on the other hand, too few documents may result in insufficient relevant documents being retrieved, and hence effectiveness being degraded.<br />
<br />
Our article <a href="http://www.dcs.gla.ac.uk/~craigm/publications/macdonald12inrt_ltr.pdf">The Whens and Hows of Learning to Rank</a> in the Information Retrieval Journal studied the sample size parameter for many topic sets and learning to rank techniques - for the mixed information needs on the TREC ClueWeb09 collection, we found that while a sample size of 20 documents was sufficient for effective performance according to ERR@20, larger sample sizes of thousands of documents were needed for effective NDCG@20; for navigational information needs, predominantly larger samples sizes (upto 5000 documents) were needed; Moreover, the particular document representations that used to identify the sample was shown to have an impact on effectiveness - indeed, navigational queries were found to be considerably easier (requiring smaller samples) when anchor text was used, but for informational queries, the opposite was observed. In the article, we examined these issues in detail, across a number of test collections and learning to rank techniques, as well as investigating the role of the evaluation measure and its rank cutoff for listwise techniques - for in depth details and conclusions, see the IR Journal article. </div>
</div>
<div>
<div style="text-align: justify;">
<br /></div>
</div>
<div>
<div style="text-align: justify;">
Dynamic pruning strategies such as WAND are generally configured to be <i>safe-to-rank-K</i>, which means that the effectiveness of the sample is not degraded. Alternatively, they can be configured to prune in an unsafe, more aggressive manner, which can degrade effectiveness by changing the retrieved documents. While the safety of WAND has previously been shown not to have great impact on the effectiveness of the retrieved (sample) documents, in <a href="http://dcs.gla.ac.uk/~craigm/publications/macdonald12effect.pdf">our SIGIR 2012 poster</a>, we showed that the impact on the effectiveness of the documents after re-ranking by application of a learned model could be marked. Moreover, this poster also investigated biases in the retrieved documents that are manifest in WAND when configured for unsafe pruning. For further details, please see our SIGIR 2012 poster.</div>
</div>
<div>
<div style="text-align: justify;">
<br /></div>
</div>
<div>
<div style="text-align: justify;">
How many documents that are necessary in the sample clearly varies from query to query. In our <a href="http://www.dcs.gla.ac.uk/~craigm/publications/tonellotto2012selective.pdf">WSDM 2013 paper</a>, we proposed <i>selective pruning</i>, whereby the size of the sample and the aggressiveness of the WAND pruning strategy used to create it is altered on a per-query basis. This permits retrieval that is both effective and efficient. Indeed, by using selective pruning, we showed that mean response time could be improved by 36%, the response times experienced by the slowest 10% of queries could be reduced by 50%, while still maintaining significantly high effectiveness. The full paper investigates the effect of unsafe pruning on both efficiency and effectiveness, as well as different ways to make the decision for selective pruning - see the WSDM 2013 paper for more details.<br />
<br /></div>
</div>
</div>
<div style="text-align: justify;">
In the next blog post (Part 2), I'll be looking at more details about the choice of features within learning to rank.</div>
<div>
<h3>
Key References</h3>
</div>
<div>
<a href="http://www.dcs.gla.ac.uk/~craigm/publications/tonellotto2012selective.pdf">Efficient and effective retrieval using selective pruning</a>. Nicola Tonellotto, Craig Macdonald and Iadh Ounis. In <i>Proceedings of WSDM 2013</i>.</div>
<div>
<br /></div>
<div>
<a href="http://www.dcs.gla.ac.uk/~craigm/publications/macdonald12inrt_ltr.pdf">The Whens and Hows of Learning to Rank</a>. Craig Macdonald, Rodrygo Santos and Iadh Ounis. <i>Information Retrieval Journal</i>, 2012.</div>
<div>
<br /></div>
<div>
<a href="http://www.dcs.gla.ac.uk/~craigm/publications/macdonald12effect.pdf">Effect of Dynamic Pruning Safety on Learning to Rank Effectiveness</a>. Craig Macdonald, Nicola Tonellotto, and Iadh Ounis. In <i>Proceedings of SIGIR 2012</i>.</div>
Craig Macdonaldhttp://www.blogger.com/profile/13764972230026912718noreply@blogger.com0tag:blogger.com,1999:blog-6043705792807544709.post-40889672267858280522012-07-27T14:39:00.000+01:002012-07-27T14:49:23.362+01:00SMART: An open source framework for searching the physical world<div style="text-align: justify;">
Some of our readers are probably aware of our new project <a href="http://www.smartfp7.eu/">SMART,</a> which aims to develop a new technology for the real-time indexing and retrieval of sensor and social streams. This three-year project is funded by the European Commission under the <a href="http://cordis.europa.eu/fp7/home_en.html">Seventh Framework Programme</a> (grant number 287583). The project, which has started in November 2011, has already received a large national and international <a href="http://terrierteam.blogspot.co.uk/2012/06/smart-way-to-search-your-city.html">press coverage</a> in online and print news over the last month. The <a href="http://www.bbc.co.uk/news/technology-18408123">BBC</a> will shortly be broadcasting a piece of television about the project.<br />
<br />
The name of the project and the resulting search engine, SMART, acknowledges the vision of the <a href="http://en.wikipedia.org/wiki/Internet_of_Things">Internet of Things</a> in general, and the concept of <a href="http://en.wikipedia.org/wiki/Smart_city"><i>smart cities</i></a> in particular. Indeed, SMART builds on the growing trend of smart cities, where in addition to physical infrastructure (roads, buildings), digital knowledge infrastructure is deployed to serve the needs of the citizens and local governments. The backbone of the digital knowledge infrastructure is mainly composed
of sensors such as cameras, microphone arrays, or other environmental
sensors, from weather to parking sensors. For example, in "smart
cities", drivers can be notified where it is good to park their car or
where to avoid traffic jams in the city centre at any time of the day. The main idea of the SMART project is to connect these sensors to the Internet and have search technologies to allow citizens to benefit from the information that these sensors can provide in real-time.</div>
<div style="text-align: justify;">
<br />
The SMART search engine builds upon the <a href="http://terrier.org/">Terrier Information Retrieval platform</a>, and exemplifies our <a href="http://terrierteam.blogspot.co.uk/2012/07/from-puppy-to-maturity-experiences-in.html">recent move</a> towards building new, separate and tailored products on top of the Terrier platform. In particular, Terrier has been enhanced and expanded with real-time indexing and a scalable distributed architecture allowing to process and handle a large volume of continuous and parallel streams.<br />
<br />
SMART is a multi-disciplinary project in nature, encompassing state-of-the-art technologies from <a href="http://researcher.watson.ibm.com/researcher/view.php?person=il-ZVI">audio</a> & <a href="http://www.ait.gr/ait_web_site/faculty/apne/pnevmatikakis.html">video</a> processing, social search and <a href="http://www.iis.ee.ic.ac.uk/%7Ej.pitt/Home.html">reasoning</a>. Building upon these technologies, SMART analyses the input from sensors in real-time, for example
to detect large crowds, or if live music can be heard. These can be compared with recent posts on social networks from the same area, to see whether the system can learn more about what is happening in the area around the sensors. By analysing the sensors across
multiple locations within the city, when a user asks “<i style="mso-bidi-font-style: normal;">what’s happening near me</i>”, the system has some idea of which locations
have the most interesting events.<br />
<br />
Clearly, making real-world events searchable can have privacy/ethics implications. In fact, never before in our research have we been confronted with such a dichotomy between what is technologically feasible and what we conceive to be ethical. That's why we and our partners in the project are carefully considering privacy issues in our research. Indeed, we are closely working with various national Data Protection Authorities (DPAs) (i) to ensure that we don’t overstep the legal or ethical boundaries of privacy and (ii) to provide guidelines for the ethical implications of the SMART technologies and help prospective deployers to use/deploy SMART in a legal, ethical, and friendly manner. Interested readers can consult the first issue of the <a href="http://www.smartfp7.eu/sites/default/files/field/files/events/1stSMARTNewsletter.pdf">SMART Newsletter</a> for further details about our ongoing efforts towards the privacy issue.<br />
<br />
While we will be trialling the SMART search technology in <a href="http://www.ayto-santander.es/">The City of Santander</a> (Spain), the key infrastructure of SMART (including the search components based on Terrier) will be made available as open source, encapsulating a vision whereby other smart cities can easily become involved and benefit from the project's outcomes. We expect the first release of the SMART search technology to become available as open source under the <a href="http://www.mozilla.org/MPL/">Mozilla Public License</a> (MPL) 2.0 by the end of 2012. By releasing parts of SMART as open source, we aim to allow the formation of a community of early adopters that will be key for evaluating and sustaining the project.<br />
<br />
With this in mind, we have just published a paper in the <a href="http://opensearchlab.otago.ac.nz/">SIGIR 2012 Open Source Information Retrieval (OSIR 2012) workshop</a> describing our current progress in the project as well as the open source vision of the project:</div>
<div style="text-align: justify;">
</div>
<div style="text-align: justify;">
</div>
<div style="text-align: justify;">
<br />
<a href="http://www.smartfp7.eu/sites/default/files/field/files/page/smart_OSIR.pdf">SMART: An open source framework for searching the physical world</a>.
M-Dyaa Albakour, Craig Macdonald, Iadh Ounis, Aristodemos Pnevmatikakis
and John Soldatos. In Proceedings of the SIGIR 2012 Workshop on Open
Source Information Retrieval. <span class="info">Portland, Oregon, USA. </span>August 2012.<br />
<br />
As always, we welcome comments and contributions from smart cities, community members and developers to the SMART vision.</div>Iadh Ounishttp://www.blogger.com/profile/05740425172350940695noreply@blogger.com0tag:blogger.com,1999:blog-6043705792807544709.post-86355794873754691392012-07-25T16:29:00.001+01:002012-08-03T16:31:35.159+01:00From Puppy to Maturity: Experiences in Developing Terrier<div style="text-align: justify;">
We will be taking part in the <a href="http://opensearchlab.otago.ac.nz/">SIGIR 2012 Workshop on Open Source Information Retrieval</a>. In particular, we have published a paper on the <a href="http://terrier.org/">Terrier open source information retrieval platform</a>, detailing the vision behind the platform, some recent developments in Terrier, as well as a roadmap for future releases.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
As always, our vision for the Terrier platform is to continue empowering
researchers and practitioners in information retrieval (IR) with up-to-date,
easily adaptable, effective and scalable indexing and search approaches, allowing them to build
and evaluate the next generation IR applications. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
In particular, Terrier will be moving towards feature-based retrieval, in line with the increasing importance of the <a href="http://en.wikipedia.org/wiki/Learning_to_rank">learning-to-rank</a> paradigm in modern information retrieval where machine-learned ranking functions combining multiple features are deployed. To do so, Terrier will be supporting the efficient and effective extraction of query-independent and query-dependent features.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
To support scalability and efficiency, Terrier's data structures have undergone a major enhancement to support <a href="http://dl.acm.org/citation.cfm?id=2037662">advanced dynamic pruning techniques</a>, as well as the development of applications requiring distributed and real-time indexing and retrieval such as Twitter search.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Finally, the growth of the Terrier platform over the past decade into exciting new areas such as MapReduce indexing and crowdsourcing entails increased functionality, but also platform complexity. To avoid software bloat, we are moving from a monolithic release structure, to a system of periodic core releases and timely plugin expansions. The first such release will be the <a href="http://www.dcs.gla.ac.uk/%7Ecraigm/publications/mccreadie12crowdterrier.pdf">CrowdTerrier </a>plugin, providing researchers with an out-of-the-box tool to achieve fast and cheap relevance assessments.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
A more comprehensive account of the forthcoming Terrier releases is detailed in our paper below: </div>
<div style="text-align: justify;">
<br />
<a href="http://www.dcs.gla.ac.uk/%7Ecraigm/publications/macdonald12terrier.pdf">From Puppy to Maturity: Experiences in Developing Terrier</a>. Craig
Macdonald, Richard McCreadie, Rodrygo Santos and Iadh Ounis. In Proceedings of the SIGIR
2012 Workshop on Open Source Information Retrieval. Portland, Oregon, USA. <span class="info"></span>August 2012 </div>
<div style="text-align: justify;">
</div>
<div style="text-align: justify;">
</div>
<div style="text-align: justify;">
</div>
<div style="text-align: justify;">
</div>
<div style="text-align: justify;">
</div>
<div style="text-align: justify;">
</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
We hope to see many colleagues <a href="http://ir.dcs.gla.ac.uk/wiki/Terrier/Contribute">joining us</a> to work towards the objectives of the platform and enriching its functionalities. As always, we welcome suggestions and any feedback on the roadmap in the run up to the forthcoming Terrier 4.0. </div>Iadh Ounishttp://www.blogger.com/profile/05740425172350940695noreply@blogger.com0tag:blogger.com,1999:blog-6043705792807544709.post-48098956859847942732012-06-25T20:28:00.000+01:002012-06-26T12:34:20.043+01:00Efficiency, Effectiveness, Medical Search, Dataset Development and Crowdsourcing at SIGIR 2012<div style="text-align: justify;">
The <a href="http://terrierteam.dcs.gla.ac.uk/">TerrierTeam</a> will be well represented at <a href="http://www.sigir.org/sigir2012/">SIGIR 2012</a> this year with a full paper, four posters, a demonstration and a workshop, covering a wide range of disciplines within the field of information retrieval. For those of you interested in Web search efficiency, we have a number of contributions to look for. Our full paper <a href="http://www.dcs.gla.ac.uk/%7Ecraigm/publications/macdonald12predict.pdf">Learning to Predict Response Times for Online Query Scheduling</a> defines the new area of <i>query efficiency prediction</i>. In particular, it postulates that not every query takes the same time to complete, particularly where efficient dynamic pruning strategies such as WAND are used to reduce retrieval latency. In our paper, we show and explain why queries with similar properties (e.g. posting list lengths) can have markedly different response times. We use these explanations to propose a learned approach for query efficiency prediction that can accurately predict the response time of a query <i>before</i> it is executed. Furthermore, we show that using query efficiency prediction can markedly increase the efficiency of query routing within a search engine that uses multiple replicated indices. Relatedly, our poster <a href="http://www.dcs.gla.ac.uk/%7Ecraigm/publications/freire12scheduling.pdf">Scheduling Queries Across Replicas</a> builds upon our work on query efficiency prediction, to show how a replicated and distributed search engine can be improved by the application of response time predictions. In particular, the response time predictions are used to estimate the workload of each replica of each index shard. Then each newly arrived query can be routed to the replica of each index shard that will be ready to process the query earliest. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
At SIGIR this year we also present recent work examining both efficiency and effectiveness. Dynamic pruning strategies, such as WAND, can increase efficiency by omitting the scoring of documents that can be guaranteed not to make the top-K retrieved set - a feature known as safeness. Broder et al. showed how WAND could be made more efficient by relaxing the safeness guarantee, with little impact on the top-ranked documents. Through experiments on the TREC ClueWeb09 corpus and 33 query dependent and query independent features, our poster <a href="http://www.dcs.gla.ac.uk/%7Ecraigm/publications/macdonald12effect.pdf">Effect of Dynamic Pruning Safety on Learning to Rank Effectiveness</a> shows that relaxing safeness to aid efficiency can have an unexpectedly large impact on retrieval effectiveness when combined with modern learning to rank models, in contrast to the earlier work by Broder et al. In particular, we show that inherent biases by unsafe WAND towards documents with lower docids can markedly impact the effectiveness of learned models. </div>
<div style="text-align: justify;">
</div>
<div style="text-align: justify;">
<br />
Those interested in the Medical search domain, in particular participants in the TREC Medical track, will be interested in our paper entitled <a href="http://ir.dcs.gla.ac.uk/terrier/publications/limsopatham2012sigir.pdf">Exploiting Term Dependence while Handling Negation in Medical Search</a>. We show that it is important to handle negation in medical records - in particular, when searching for cohorts (groups of patients) with specific symptoms, our approach ensures that patients known <i>not </i>to have exhibited particular symptoms are not retrieved. Our results demonstrate that appropriate negation handling can increase retrieval effectiveness, particularly when the dependence between negated terms are considered using a <a href="http://terrierteam.dcs.gla.ac.uk/publications/p843-peng.pdf">term dependence model</a> from the <a href="http://ir.dcs.gla.ac.uk/wiki/DivergenceFromRandomness">Divergence From Randomness framework</a>. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Our poster <a href="http://www.dcs.gla.ac.uk/%7Ecraigm/publications/mccreadie12tweets.pdf">On Building a Reusable Twitter Corpus</a> tackles an important issue raised during the creation of the Tweets11 dataset as part of the TREC Micoblog track, namely how reusable Tweets11 is, given the dynamics of Twitter. Our poster shows that corpus degradation due to deleted tweets does <i>not</i> effect the ranking of systems that participated in the TREC 2011 Microblog track. Meanwhile, we are also demonstrating the first release of a new extension to our <a href="http://terrier.org/">Terrier IR platform</a>, namely CrowdTerrier, which enables relevance assessments to be created in a fast semi-automatic manner using crowdsourcing. CrowdTerrier is an infrastructure addition to <a href="http://terrier.org/">Terrier</a> that enables relevance assessments to be created in a fast semi-automatic manner using crowdsourcing. <a href="http://terrier.org/crowdterrier">CrowdTerrier</a> will be made available for download soon.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Finally, together with a group representing six open source IR systems, we are involved in the organisation of a <a href="http://opensearchlab.otago.ac.nz/">SIGIR'12 workshop on Open Source Information Retrieval</a>. The workshop aims to provide a forum for users and authors of open source IR tools to get together, and to work together to build
<i>OpenSearchLab</i>, an open source, live and functioning, online web search
engine for research purposes and discuss the joint future.</div>Iadh Ounishttp://www.blogger.com/profile/05740425172350940695noreply@blogger.com0Glasgow, Glasgow City, UK55.864237 -4.25180655.792955000000006 -4.4097345 55.935519 -4.0938775000000005tag:blogger.com,1999:blog-6043705792807544709.post-41689317002298021642012-06-19T15:43:00.000+01:002012-06-19T15:43:29.529+01:00A SMART way to Search your City<div style="text-align: justify;">
TerrierTeam is currently expanding its outreach into social and sensor-based search systems as part of the ongoing <a href="http://www.smartfp7.eu/">SMART</a> EU-funded project (FP7 287583). SMART aims to develop an open source search framework for multimedia data stemming from the physical world and social streams such as Twitter. The end-goal is to be able to answer location and time-sensitive queries such as “<i>where can I go to listen to live music in the city centre tonight?</i>” or “<i>where are my friends hanging out in the city?</i>” by augmenting social media signals with live city sensor information.<br />
<br />
Our role in the SMART project is to develop fast and effective real-time search from the flood of information provided by social and city sensor streams on top of our open-source <a href="http://terrier.org/">Terrier information retrieval platform</a>. Indeed, a real-time Twitter search <a href="http://www.smartfp7.eu/content/twitter-indexing-demo">demo</a> illustrating incremental and distributed indexing and real-time retrieval in Terrier is now available. Try it at: <a href="http://demos.terrier.org/SMART/twittersearch/">http://demos.terrier.org/SMART/twittersearch/</a><br />
<br />
SMART has seen wide-ranging national, European and international coverage in online and print news media over the last week. Indeed, we are tracking over 100 articles and counting! Some sample articles can be found below:</div>
<ul>
<li><a class="ext" href="http://www.gla.ac.uk/news/headline_234495_en.html">University of Glasgow developing new type of internet search engine - University of Glasgow</a><span class="ext"></span> </li>
<li><a href="http://www.bbc.com/news/technology-18408123">Researchers work on smart city search engine - BBC</a></li>
<li><a href="http://www.newelectronics.co.uk/electronics-news/researchers-developing-smart-city-search-engine/42966">Scottish scientists build search engine for 'Internet of Things' - TechWorld </a></li>
<li><a href="http://www.techweekeurope.co.uk/news/smart-city-search-engine-uses-sensors-82093"><span class="date">Smart City Search Engine Uses</span> Sensors - TechWeekEurope UK</a></li>
<li><a href="http://www.newelectronics.co.uk/electronics-news/researchers-developing-smart-city-search-engine/42966">Researchers developing ‘smart city’ search engine - New Electronics</a></li>
<li><a href="http://blogs.wsj.com/tech-europe/2012/06/12/new-search-engine-combines-twitter-with-sensors/">Glasgow New Search Engine Combines Twitter with Sensors – Wall Street Journal</a></li>
</ul>
<div style="text-align: justify;">
A more detailed list of recent press coverage can be found at </div>
<div style="text-align: justify;">
<a href="http://www.smartfp7.eu/content/media-coverage-smart">http://www.smartfp7.eu/content/media-coverage-smart</a>.</div>Iadh Ounishttp://www.blogger.com/profile/05740425172350940695noreply@blogger.com0tag:blogger.com,1999:blog-6043705792807544709.post-64814441274435949312011-06-16T17:41:00.009+01:002011-06-16T18:31:15.706+01:00Terrier 3.5 releasedToday, we are proud to announce a brand new release of <a href="http://terrier.org/">Terrier</a>, our state-of-the-art open source information retrieval platform. Terrier 3.5 represents a significant update over its previous version (Terrier 3.0), including:<br /><ul><li><a href="http://terrier.org/docs/v3.5/javadoc/org/terrier/matching/daat/package-summary.html">Document-at-a-time (DAAT)</a> retrieval for large indices</li><li>Refactored <a href="http://terrier.org/docs/v3.5/javadoc/org/terrier/indexing/tokenisation/package-summary.html">tokenisation</a> for enhanced <a href="http://terrier.org/docs/v3.5/languages.html">multi-language support</a></li><li>Upgraded <a href="http://terrier.org/docs/v3.5/hadoop_configuration.html">Hadoop support</a> to version 0.20</li><li><a href="http://terrier.org/docs/v3.5/querylanguage.html">Synonym support</a> in query language and retrieval</li><li>Out-of-the box indexing support for <a href="http://terrier.org/docs/v3.5/terrier_http.html">query-biased summaries and improved example web-based interface</a></li><li>New, 2nd generation <a href="http://terrier.org/docs/v3.5/dfr_description.html">DFR models</a> as well as other recent effective information-theoretic models</li><li>Fully revised and improved <a href="http://terrier.org/docs/v3.5/">documentation</a><br /></li><li>Many more JUnit tests (now 300+)</li></ul>Check out the full <a href="http://terrier.org/docs/current/whats_new.html">change log</a> for this release and <a href="http://terrier.org/download">upgrade to Terrier 3.5</a>!<br /><br />Many thanks to everyone at the <a href="http://terrierteam.dcs.gla.ac.uk/">TerrierTeam</a> and all <a href="http://terrier.org/people.html">Terrier contributors</a> for their hard work making this release possible!Rodrygo L.T. Santoshttp://www.blogger.com/profile/09502952528669992135noreply@blogger.com0tag:blogger.com,1999:blog-6043705792807544709.post-62982717304759181362011-04-27T12:42:00.010+01:002012-06-26T09:45:37.849+01:00ECIR 2011 + DDR 2011 in Dublin<div style="text-align: justify;">
Last week, a few of us attended <a href="http://www.ecir2011.dcu.ie/">ECIR 2011</a> in Dublin. The conference was a resounding success both in terms of its program and organisation. Compared to last year, the event was very well attended with about 250 delegates registered to the conference and/or its satellite events. The majority of delegates were from Ireland and the United Kingdom.</div>
<div style="text-align: justify;">
<br />
<span style="font-size: 130%; font-weight: bold;">Workshops</span><br />
<br /></div>
<div style="text-align: justify;">
The kick-off was on Monday, with a selection of workshops and tutorials at the fabulous <a href="http://www.guinness-storehouse.com/">Guinness Storehouse</a>. We attended the <a href="http://www.dcs.gla.ac.uk/workshops/ddr2011/">Diversity in Document Retrieval (DDR 2011) workshop</a>, jointly organised by <a href="http://www.dcs.gla.ac.uk/%7Ecraigm/">Craig Macdonald</a>, <a href="http://web4.cs.ucl.ac.uk/staff/jun.wang/blog/">Jun Wang</a>, and <a href="http://plg.uwaterloo.ca/%7Eclaclark/">Charlie Clarke</a>.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
The DDR workshop was sometimes a standing-room only event and appeared to be the largest workshop of the conference. It was structured around three broad themes: <span style="font-style: italic; font-weight: bold;">evaluation</span>, <span style="font-style: italic; font-weight: bold;">modelling</span>, and <span style="font-style: italic; font-weight: bold;">applications</span>. Besides good keynotes by <a href="http://research.microsoft.com/en-us/people/tesakai/">Tetsuya Sakai</a> and <a href="http://disi.unitn.it/moschitti/">Alessandro Moschitti</a>, the workshop featured technical and position paper presentations, as well as a poster session and a breakout group discussion on all three workshop themes. While there was no agreement on a possible "killer application" for diversity, there was a consensus that diversity is best described or seen as the <span style="font-style: italic;">lack of contex</span>t. In addition, a few key points arose across the boundaries of the tackled themes:</div>
<ul style="text-align: justify;">
<li><span style="font-style: italic; font-weight: bold;">Representing diversity</span><br />How to best represent the possible multiple information needs underlying a query? Should this representation reflect the interests of the user population, or should it be itself diverse?</li>
<li><span style="font-style: italic; font-weight: bold;">Measuring diversity</span><br />What does diversity mean and how should it be promoted in different scenarios? The workshop featured some ideas for applications, including expert search, geographical IR, and graph summarisation.</li>
<li><span style="font-style: italic; font-weight: bold;">Unifying diversity</span><br />How to diversify across multiple search scenarios (e.g., multiple verticals of a search engine)? How to convey a summary relevant to multiple information needs in a single page of results?</li>
</ul>
<div style="text-align: justify;">
Some of these ideas are currently being investigated as part of the <a href="http://www.thuir.org/intent/ntcir9/">NTCIR-9 Intent</a> task. Charlie was also keen to consider these questions in future incarnations of the diversity task in the <a href="http://plg.uwaterloo.ca/%7Etrecweb/">TREC Web track</a>. During the workshop, <a href="http://www.dcs.gla.ac.uk/%7Erodrygo">Rodrygo</a> presented our position paper entitled "<a href="http://terrierteam.dcs.gla.ac.uk/publications/santos2011ddr.pdf">Diversifying for multiple information needs</a>". The full DDR <a href="http://www.dcs.gla.ac.uk/workshops/ddr2011/ddr2011.proceedings.pdf">workshop proceedings</a> are available online.<br />
<br />
While we haven't attended it, it was of note that the <a href="http://ir.cis.udel.edu/ECIR11Sessions/index.html">Information Retrieval Over Query Sessions</a> workshop, which was held at the same time as DDR, also received very good and positive feedback from its attendees.<br />
<br />
The workshops were followed by an excellent welcome reception where the least we could say is that Guinness was not in shortage.<br />
<br />
<span style="font-size: 130%; font-weight: bold;">Conference</span><br />
<br />
On Tuesday, the main conference took over with a diverse (no pun intended) <a href="http://www.ecir2011.dcu.ie/program/">program</a>. The conference started with a thoughtful keynote by <a href="http://www.uta.fi/%7Elikaja/">Kalervo Järvelin</a> who urged the information retrieval community to see beyond the [search] box. The keynote led to some very interesting discussions about whether IR is a science or a technology (i.e. mostly about engineering). We would like to believe that it is science, although some delegates argued (sadly) for the opposite.<br />
<br />
<span class="fontbold font10">The second keynote was given by <a href="http://research.yahoo.com/Evgeniy_Gabrilovich">Evgeniy Gabrilovich</a>, winner of this year's</span> <a href="http://irsg.bcs.org/ksjaward.php"> </a><span class="fontbold font10"><a href="http://irsg.bcs.org/ksjaward.php">KSJ Award</a>. </span><span class="fontbold font10">Evgeniy provided a very comprehensive overview of the fascinating computational advertising field, highlighting </span>the current state-of-the-art and possible future research directions. We were encouraged to hear about the <a href="http://labs.yahoo.com/Academic_Relations/Faculty">Yahoo! Faculty Research and Engagement Program (FREP)</a>, which might allow academics to access the necessary datasets to conduct research in a field that has been thus far the sole territory of researchers based in industry.<br />
<br />
The last keynote talk was superbly given by <a href="http://www.cs.cornell.edu/people/tj/">Thorsten Joachims</a> about the value of user feedback. Thorsten convincingly argued for the importance of collecting user feedback as an intrinsic part of both the retrieval and learning processes. The talk highlighted how user feedback could improve the quality of retrieval and by how much. We wish that the slides will be made publicly available at some point.<br />
<br />
As for the rest of the program, there were two types of papers/presentations: full papers were presented in 30 min, while short papers had only 15 min. As usual, the quality of papers (or at least the presentations) varied from the outstanding to the less good. One suggestion for future ECIR conferences is to limit all the talks to at most 20 min, encouraging conciseness and pushing the speakers to focus on the "message out of the bottle". Indeed, some talks appeared to be exceedingly long with respect to their informative content. While we see the value of giving a 30 min slot to a 10-pages long ACM-style paper, there does not seem to be a valid reason for giving that much time for a (comparatively much shorter) 12-pages LNCS-style paper.<br />
<br />
It was interesting to see several Twitter-related papers in the program, suggesting that the community will find the upcoming new <a href="http://sites.google.com/site/trecmicroblogtrack/">TREC 2011 Microblog track</a> and its corresponding shared dataset particularly useful/helpful. The theme of crowdsourcing was also highly featured in the conference, with several papers showing how cheap and reliable relevance assessments could be obtained through the <a href="https://www.mturk.com/mturk/welcome">Amazon Mechanical Turk</a> or similar services. Finally, we were very pleased to see many presented papers using our open source <a href="http://terrier.org/">Terrier</a> software in their experiments.<br />
<br />
Overall, a few papers caught our attention and were particularly interesting:</div>
<ul style="text-align: justify;">
<li><span style="font-style: italic;">On the contributions of topics to system evaluation</span><br />Steve Robertson</li>
<li><span style="font-style: italic;">Caching for realtime search</span> - in our opinion by far the best paper/presentation of the conference<br />Edward Bortnikov, Ronny Lempel and Kolman Vornovitsky </li>
<li><span style="font-style: italic;">Are semantically related links effective for retrieval?</span><br />Marijn Koolen and Jaap Kamps</li>
<li><span style="font-style: italic;">A methodology for evaluating aggregated search results</span> - Excellent paper/presentation that was also awarded the best student paper award<br />Jaime Arguello, Fernando Diaz, Jamie Callan and Ben Carterette</li>
<li><span style="font-style: italic;">Design and implementation of relevance assessments using crowdsourcing</span><br />Omar Alonso and Ricardo Baeza-Yates</li>
<li><span style="font-style: italic;">The power of peers </span><span style="font-style: italic;"><br /></span>Nick Craswell, Dennis Fetterly and Marc Najork</li>
<li><span style="font-style: italic;">Automatic people tagging for expertise profiling in the enterprise</span><br />Pavel Serdyukov, Mike Taylor, Vishwa Vinary, Matthew Richardson and Ryen W. White</li>
<li><span style="font-style: italic;">What makes re-finding information difficult? A study of email re-finding</span><br />David Elsweiler, Mark Baillie and Ian Ruthven</li>
</ul>
<div style="text-align: justify;">
Of course, we also recommend our own paper, which was nominated for best paper award, and for which we received excellent feedback:</div>
<ul style="text-align: justify;">
<li><a href="http://www.dcs.gla.ac.uk/%7Ecraigm/publications/macdonald11learned.pdf"><span style="font-style: italic;">Learning models for ranking aggregates</span></a><br />Craig Macdonald and Iadh Ounis</li>
</ul>
<div style="text-align: justify;">
The program also featured a busy poster and demo session. We liked the work of <span style="font-style: italic;">Gerani Keikha, Carman and Crestani</span> concerning identifying personal blogs using the <a href="http://ir.dcs.gla.ac.uk/wiki/TREC-BLOG">TREC Blog track</a>, and that of <span style="font-style: italic;">Perego, Silvestri and Tonellotto,</span> which suggests that document length can be quantized from docids without loss of retrieval effectiveness. There were also several interesting demos that caught our eye:</div>
<ul style="text-align: justify;">
<li><span style="font-style: italic;">ARES - A retrieval engine based on sentiments: Sentiment-based search result annotation and diversification</span> - which used our <a href="http://ir.dcs.gla.ac.uk/terrier/publications/santos10www.pdf">xQuAD framework</a> for diversifying sentiments<span style="font-style: italic;"> </span><br />Gianluca Demartini</li>
<li><span style="font-style: italic;">Conversation Retrieval from Twitter</span><br />Matteo Magnani, Danilo Montesi, Gabriele Nnziante and Luca Rossi</li>
<li><span style="font-style: italic;">Finding Useful Users on Twitter: Twittomender the Followee Recommender</span> - addressed the Who to Follow (WTF?) task on Twitter<br />John Hannon, Kevin McCarthy and Barry Smyth</li>
</ul>
<div style="text-align: justify;">
The ECIR organisers hosted a particularly sumptuous conference banquet at the impressive, unique and beautiful venue of <a href="http://www.villageatlyons.com/">The Village at Lyons Demesne</a> in County Kildare. The journey to the village was a welcome break from the hotel setting of the conference and its technical program.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
On the last day of the conference, and concurrently to the technical research sessions, an <a href="http://www.ecir2011.dcu.ie/program/industry-day/">Industry Day</a> event was under way. However, we only had the chance to go and see the excellent talk by <a href="http://research.yahoo.com/Flavio_Junqueira">Flavio Junqueira</a> on the practical aspects of caching in search engine deployments. There is a comprehensive summary of the whole Industry program in this <a href="http://www.flax.co.uk/blog/2011/04/27/ecir-2011-industry-day-part-1-of-2/">blog post</a>. We believe that the planning of the Industry Day event in parallel to the technical sessions was detrimental to attendance. Next year, the Industry Day will be held after the conference ends.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Finally, we would like to thank the organisers of ECIR 2011 for a very enjoyable conference, and a great stay in Dublin. <a href="http://ecir2012.upf.edu/">ECIR 2012</a> will be held in Barcelona, Spain, between 1st and 5th April 2012. We hope to see you all there.</div>Terrier Team @ Glasgowhttp://www.blogger.com/profile/11678159696002044810noreply@blogger.com0tag:blogger.com,1999:blog-6043705792807544709.post-36931468023480324622010-11-26T12:39:00.026+00:002012-06-26T09:46:24.449+01:00TREC 2010 Roundup<div style="text-align: justify;">
Back from another successful TREC conference on the NIST campus. 2010 is a transition year, with the end of old tracks and the proposition of new ones. Indeed, TREC is moving with the times, looking at new data sources and test collections, as well as new evaluation strategies.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<span style="font-size: 130%; font-style: italic; font-weight: bold;">Outwith the old . . . </span></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
For example, TREC 2010 marks the end of the Relevance Feedback and Blog tracks. While TREC 2010 will be the last year of the Relevance Feedback track, the Blog track, which has been running for the last 5 years, is now morphing into a new Microblog track, investigating real-time and social search tasks in Twitter. A brand new test collection possibly containing 2 months of tweets is planned, with linked web-pages and a partial follower graph. <a href="http://groups.google.com/group/trec-microblog">Join the Microblog track googlegroup</a> to obtain the latest updates and <a href="http://twitter.com/trecmicroblog">follow the Microblog track on Twitter</a>.</div>
<div style="text-align: justify;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVLBK7RPiMcV0F7nwMM53OBkmbTEPwZWLPffT7qEK14wTNqecEAMc1mxWvDVurp0xVaYc36iep-TFSMv9x3pQMmVWS-zUOMGMRlmYbRR4c9SrNUw52yQEelNpPlnAkwoUeKbnBBb6_8-8/s1600/microblogpostercraig.jpg" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"><br /></a><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEirK73C6gEgEioZeLj0cJrKQp8D-haZJLXUXJU2pBYoYOc4Gc3iCdQATZVqV9R3ZDXTFeuwewuowRQfN86QBBf1GLWm4A9rmRVE-t-6Hzu3E11uebmFu5WpwVQwbTomKbQqD1apTPSQBcU/s1600/microblogpostercraig.jpg"><img alt="" border="0" id="BLOGGER_PHOTO_ID_5543870665284628850" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEirK73C6gEgEioZeLj0cJrKQp8D-haZJLXUXJU2pBYoYOc4Gc3iCdQATZVqV9R3ZDXTFeuwewuowRQfN86QBBf1GLWm4A9rmRVE-t-6Hzu3E11uebmFu5WpwVQwbTomKbQqD1apTPSQBcU/s320/microblogpostercraig.jpg" style="cursor: pointer; display: block; height: 280px; margin: 0px auto 10px; text-align: center; width: 320px;" /></a></div>
<div style="text-align: justify;">
TREC 2011 will also witness the initiation of the new Medical Records track, dedicated to investigating approaches to access free-text fields of electronic medical records.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
On the test collection front, the Web track is also forward planning a new large-scale dataset to replace ClueWeb09. Indications are that this new dataset will be about the same scale as ClueWeb09 but might provide more temporal information (multiple versions of a page or site over time). Moreover, we have suggested that this might be the heart of a larger dataset comprised of multiple parallel/aligned corpora, for example blogs and news feeds covering the same timeframe.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<span style="font-size: 130%; font-style: italic; font-weight: bold;">TREC Assessors, Relevant?</span></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
In terms of evaluation, 2010 marks the first year where evaluation judgments were crowdsourced using an online worker marketplace, as opposed to relying on TREC assessors, the participants themselves, or a select group of experts. Indeed, both the Blog track and the Relevance Feedback track crowdsourced some of their evaluation (although the Relevance Feedback track suffered many setbacks and its crowdsourcing process is still incomplete). Furthermore, to investigate the challenges in this new field of crowdsourcing, a specific Crowdsourcing track has been created and will run in 2011. More details can be found <a href="http://groups.google.com/group/trec-crowd">here</a>.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<span style="font-size: 130%; font-style: italic;"> <span style="font-weight: bold;">Themes</span></span></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
As usual, themes emerged within the various tracks. Learned approaches were far more prevalent this year, now that training data was available for the ClueWeb09 dataset. Indeed, the Web track was dominated by trained models mostly based on link and proximity search features. Diversification, on the other hand, remains a challenging task, with the top groups leaving their initial rankings as is. An outstanding exception is our own approach using the <a href="http://ir.dcs.gla.ac.uk/terrier/publications/santos10www.pdf">xQuAD framework</a> under a <a href="http://ir.dcs.gla.ac.uk/terrier/publications/santos2010cikm.pdf">selective diversification</a> regime, which further improves our strongly performing adhoc baseline. <a href="http://www.dcs.gla.ac.uk/%7Ecraigm">Craig Macdonald</a> presented our work in the Web track plenary session.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
In the Blog track, voting model-based and language modeling approaches proved popular for blog distillation. For faceted blog ranking, participants employed variants of facet dictionaries to either train a classifier or as features for learning. For the top news task, participants deployed a wide variety of methods to rank news stories in a <span style="font-style: italic;">real-time</span> setting, from probabilistic modeling to <a href="http://terrierteam.dcs.gla.ac.uk/publications/richard10riao_168.pdf">blog post voting with historical evidence.</a> Richard Mccreadie presented our work on the blog track as a poster during TREC 2010, which attracted very interesting discussions.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
During the TREC conference, <a href="http://www.dcs.gla.ac.uk/%7Eounis">Iadh Ounis</a>, <a href="http://www.dcs.gla.ac.uk/%7Erichardm">Richard Mccreadie</a> and others have done a fair amount of tweeting. You can follow some bits of the TREC conference through the <a href="http://twitter.com/#search?q=%23trec2010">#trec2010</a> hashtag.</div>Dr. Richard McCreadiehttp://www.blogger.com/profile/11063287777854855902noreply@blogger.com0tag:blogger.com,1999:blog-6043705792807544709.post-10623797298122882172010-11-03T18:03:00.006+00:002012-06-26T11:26:27.121+01:00CIKM 2010 in Toronto, ON, Canada<div style="margin-bottom: 0cm; text-align: justify;">
I'm back from Toronto, where a few of us attended the CIKM 2010 conference last week. On Friday, I presented our paper on <a href="http://ir.dcs.gla.ac.uk/terrier/publications/santos2010cikm.pdf"><span style="font-style: italic;">"Selectively diversifying Web search results"</span></a>, a joint work with <a href="http://www.dcs.gla.ac.uk/%7Ecraigm/">Craig Macdonald</a> and <a href="http://www.dcs.gla.ac.uk/%7Eounis/">Iadh Ounis</a>. This work extends our successful participation in the diversity task of the <a href="http://trec.nist.gov/pubs/trec18/papers/uglasgow.BLOG.ENT.MQ.RF.WEB.pdf">TREC 2009 Web track</a>, by investigating the need for search result diversification in the first place. In particular, we proposed a novel supervised learning approach to predict not only whether promoting diversity is beneficial, but also how much diversification should be applied to attain an effective retrieval performance on a per-query basis. After thorough, large-scale experiments with over 900 query features, we found that our selective approach can substantially improve existing diversification approaches, including our <a href="http://ir.dcs.gla.ac.uk/terrier/publications/santos10www.pdf">state-of-the-art xQuAD framework</a>. Nonetheless, we believe the significance of our contribution goes beyond these successful results. Indeed, it was with great pleasure that we heard from the NTCIR organisers that NTCIR-9 will run an <a href="http://www.thuir.org/intent/ntcir9/">Intent task</a>, aimed---among other things---at selectively diversifying search results, an area where we are proud to be pioneers.</div>
<div style="margin-bottom: 0cm; text-align: justify;">
Besides our own paper, a few other papers caught my attention:</div>
<ul style="text-align: justify;">
<li><span style="font-style: italic;">Web Search Solved? All Result Rankings the Same?</span> by Hugo Zaragoza, B. Barla Cambazoglu and Ricardo Baeza-Yates</li>
<li><span style="font-style: italic;">Reverted Indexing for Feedback and Expansion</span>, by Jeremy Pickens, Matthew Cooper and Gene Golovchinsky</li>
<li><span style="font-style: italic;">Rank Learning for Factoid Question Answering with Linguistic and Semantic Constraints</span>, by Matthew Bilotti, Jonathan Elsas, Jaime Carbonell and Eric Nyberg</li>
<li><span style="font-style: italic;">Organizing Query Completions for Web Search</span>, by Alpa Jain and Gilad Mishne</li>
<li><span style="font-style: italic;">Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models</span>, by Jianfeng Gao, Xiaodong He and Jian-Yun Nie</li>
</ul>
<div style="margin-bottom: 0cm; text-align: justify;">
The conference also featured great keynotes, of which those by Jamie Callan and Susan Dumais deserve a particular mention. Jamie talked about his view for the future of search, in which search engines capable of fully leveraging the structure of queries and documents would enable more sophisticated applications built on top of them. Susan addressed the temporal evolution of Web content, how it impacts the way users access this content, and how test collections should account for it. For more details, have a look at the excellent posts by Gene Golovchinsky on <a href="http://palblog.fxpal.com/?p=4866">Jamie</a> and <a href="http://palblog.fxpal.com/?p=4873">Susan</a>'s talks.</div>
<div style="margin-bottom: 0cm; text-align: justify;">
Last but not least, many of us were involved in promoting the next edition of CIKM, to be held here in Glasgow. There was a lot of excitement from the several people that visited our booth, and also during the hand-over talk at the end of the conference. Well done Jon, Mary, Craig, and Iadh for the hard work! The arrangements for <a href="http://www.cikm2011.org/">CIKM 2011</a> are well advanced, and the <a href="http://www.cikm2011.org/callforpapers">call for papers</a> is now online. You can also follow the latest news about CIKM 2011 on <a href="http://twitter.com/CIKM2011">Twitter</a>, <a href="http://www.facebook.com/group.php?gid=171830502274">Facebook</a>, <a href="http://events.linkedin.com/CIKM-2011-20th-ACM-Conference/pub/162795">LinkedIn</a>, and <a href="http://lanyrd.com/2011/cikm/">Lanyrd</a>. We look forward to welcoming you all to Glasgow next year! </div>Rodrygo L.T. Santoshttp://www.blogger.com/profile/09502952528669992135noreply@blogger.com0tag:blogger.com,1999:blog-6043705792807544709.post-32681804826415308282010-07-20T08:49:00.006+01:002012-06-26T11:26:41.889+01:00Terrier Team at SIGIR 2010 in Geneva<div style="text-align: justify;">
<a href="http://sigir2010.org/doku.php">SIGIR 2010</a> has just started in Geneva. From the <a href="http://terrierteam.dcs.gla.ac.uk/">TerrierTeam</a>, <a href="http://twitter.com/richardm_">Richard</a> and <a href="http://twitter.com/craig_macdonald">myself</a> are attending.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
On Monday, Richard presented his PhD topic, <a href="http://portal.acm.org/citation.cfm?id=1835449.1835692">Leveraging User-generated Content for News Search</a> at the doctoral consortium.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Later, at the <a href="http://research.microsoft.com/en-us/events/webngram/">Web Ngram workshop</a>, I'll be presenting a paper on <a href="http://www.dcs.gla.ac.uk/%7Ecraigm/publications/macdonald10_prox.pdf">Global Statistics in Proximity Weighting Models</a>. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
About the same time, Richard will be presenting at the <a href="http://www.ischool.utexas.edu/%7Ecse2010/">Crowdsourcing for Search Evaluation</a> workshop. His paper on <a href="http://www.dcs.gla.ac.uk/%7Erichardm/papers/CrowdsourcingNQC.pdf">Crowdsourcing a News Query Classification Dataset</a> examines the effectiveness of different interfaces for having Mechanical Turkers classify queries as news-related or not.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Last but not least, and continuing on our proximity theme, <a href="http://hpc.isti.cnr.it/%7Ekhast">Nicola Tonellotto</a> from CNR is presenting our joint work titled <a href="http://www.dcs.gla.ac.uk/%7Ecraigm/publications/tonellotto_lsds2010.pdf">Efficient Dynamic Pruning with Proximity Support</a> at the <a href="http://www.lsdsir.org/">Large Scale & Distributed Systems</a> workshop.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Meanwhile, please say hello if you see us at the conference, or stay up to date by following <a href="http://twitter.com/#search?q=%23sigir2010">#sigir2010</a>. And remember, if you are near the registration desk, please pick up flyers for <a href="http://terrier.org/">Terrier</a> and <a href="http://cikm2011.org/">CIKM 2011</a>.</div>Craig Macdonaldhttp://www.blogger.com/profile/13764972230026912718noreply@blogger.com0tag:blogger.com,1999:blog-6043705792807544709.post-52681537405998496462010-07-19T13:04:00.017+01:002012-06-26T11:26:58.895+01:00Top Authors in Information Retrieval<div style="text-align: justify;">
Thanks to <span class="fn"><a href="http://twitter.com/SSN">Sérgio Nunes</a> who alerted us to this ranking by </span><a href="http://academic.research.microsoft.com/">Microsoft Academic Search</a> of the <a href="http://academic.research.microsoft.com/CSDirectory/author_category_8_last5.htm">Top Authors in Information Retrieval</a>, in the past 5 years.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
According to this recent <a href="http://academic.research.microsoft.com/CSDirectory/author_category_8_last5.htm">ranking</a>, two members of the <a href="http://terrierteam.dcs.gla.ac.uk/">TerrierTeam,</a> namely <a href="http://www.dcs.gla.ac.uk/%7Eounis">Iadh Ounis</a> and <a href="http://www.dcs.gla.ac.uk/%7Ecraigm">Craig Macdonald</a>, are in the top 5 authors in Information Retrieval in the past 5 years (position #1 and #4, respectively). The ranking is based on <a href="http://academic.research.microsoft.com/About/Help.htm#Ranking">in-domain citations</a>.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
This good news comes just at the start of the <a href="http://www.sigir2010.org/doku.php">SIGIR 2010</a> Conference, which will be held in Geneva, Switzerland this week (19-23 July 2010). Several members of the team will be in attendance.</div>Iadh Ounishttp://www.blogger.com/profile/05740425172350940695noreply@blogger.com2tag:blogger.com,1999:blog-6043705792807544709.post-86504406719780669902010-05-04T01:48:00.014+01:002012-06-26T11:29:05.509+01:00WWW 2010 in Raleigh, NC, USA<div style="text-align: justify;">
<span lang="EN-GB">I am back from the sunny <a href="http://www.visitraleigh.com/">Raleigh, NC, USA</a>. Besides the nice weather, I had a great time last week attending the <a href="http://www2010.org/">19th International World Wide Web Conference (WWW 2010)</a>, where I presented our paper on <i><a href="http://ir.dcs.gla.ac.uk/terrier/publications/santos10www.pdf">Exploiting query reformulations for Web search result diversification</a></i>, a joint work with <a href="http://www.dcs.gla.ac.uk/%7Ecraigm">Craig Macdonald</a> and <a href="http://www.dcs.gla.ac.uk/%7Eounis">Iadh Ounis</a>. The paper introduces a probabilistic formulation of our xQuAD framework for search result diversification, and analyses the effectiveness of query reformulations provided by three commercial search engines for the diversification task. My talk was very well received, with lots of questions from the audience, and subsequent chatting with many people from both academia and industry.<o:p></o:p></span><span lang="EN-GB"><o:p></o:p></span><span lang="EN-GB"><br /></span><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgnlYjGYsMfJuQVnP2Rd3VFCSpLBlE7EHkQDZwD-dSBacOaZ5VTV8eIhsFAHm2zvUjuMKOWwBHhtA98X_mJZ9O5CrOOAyhKUbB6GlTkZDY9cTjPMqWuW_JcOxW7j_1RO9rP08Fc5GzGMXpP/s1600/DSC01010.JPG" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"><img alt="" border="0" id="BLOGGER_PHOTO_ID_5467219456072040818" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgnlYjGYsMfJuQVnP2Rd3VFCSpLBlE7EHkQDZwD-dSBacOaZ5VTV8eIhsFAHm2zvUjuMKOWwBHhtA98X_mJZ9O5CrOOAyhKUbB6GlTkZDY9cTjPMqWuW_JcOxW7j_1RO9rP08Fc5GzGMXpP/s400/DSC01010.JPG" style="cursor: pointer; display: block; height: 225px; margin: 0px auto 10px; text-align: center; width: 400px;" /></a><span lang="EN-GB"><br />The blend academia-industry was indeed a signature of WWW. I was also impressed with the multidisciplinary nature of the confere</span>nce<span lang="EN-US">—</span>wit<span lang="EN-GB">h up to five parallel sessions, there was always something for everyone! In particular, from the sessions I attended, a few papers caught my attention:<o:p></o:p></span><span lang="EN-GB"><o:p></o:p></span><i><span lang="EN-GB"><br /></span></i></div>
<ul style="text-align: justify;">
<li><i><span lang="EN-GB">Clustering query refinements by user intent</span></i><span lang="EN-GB">, by Eldar Sadikov et al. (Stanford University and Google)<o:p></o:p></span></li>
<li><i><span lang="EN-GB">Optimal rare query suggestion with implicit user feedback</span></i><span lang="EN-GB">, by Yang Song and Li-wei He (Microsoft Research)<o:p></o:p></span></li>
<li><i><span lang="EN-GB">Building taxonomy of Web search intents for name entity queries</span></i><span lang="EN-GB">, by Xiaoxin Yin and Sarthak Shah (Microsoft Research)<o:p></o:p></span></li>
<li><i><span lang="EN-GB">Exploring Web scale language models for search query processing</span></i><span lang="EN-GB">, by Jian Huang et al. (Microsoft Research Asia, Facebook, and Penn State University)<o:p></o:p></span></li>
<li><i><span lang="EN-GB">Classification-enhanced ranking</span></i><span lang="EN-GB">, by Paul N. Bennett et al. (Microsoft Research)<o:p></o:p></span></li>
<li><i><span lang="EN-US">Ranking specialization for Web search: A divide-and-conquer approach by using topical RankSVM</span></i><span lang="EN-US">, by Jiang Bian et al. (Georgia Tech and Yahoo! Labs)<o:p></o:p></span></li>
<li><i><span lang="EN-GB">Generalized distances between rankings</span></i><span lang="EN-GB">, by Ravi Kumar and Sergei Vassilvitskii (Yahoo! Research)<o:p></o:p></span></li>
<li><i><span lang="EN-GB">Relational duality: Unsupervised extraction of semantic relations between entities on the Web</span></i><span lang="EN-GB">, by Danushka T. Bollegala et al. (University of Tokyo)<o:p></o:p></span><span lang="EN-GB"><o:p></o:p></span></li>
</ul>
<div style="text-align: justify;">
<span lang="EN-GB"><o:p></o:p>The conference also featured three passionate keynotes:</span><span lang="EN-GB"><o:p><br /></o:p></span></div>
<ul style="text-align: justify;">
<li><span lang="EN-GB"><a href="http://www.google.com/corporate/execs.html#vint">Vint Cerf</a> discussed a broad range of topics of interest on today's Web, where <a href="http://analytics.ncsu.edu/reports/www/www2010-cerf.pdf">everything is connected</a>: 1.8 billion users, around a billion Web-enabled mobile devices, and still a large room for growth in developing countries. Touched points included the implications of the explosion of data production on mobility, accessibility, security and privacy, intellectual property, digital preservation, as well as new technologies (e.g., cloud computing).<o:p></o:p></span></li>
<li><span lang="EN-GB"><a href="http://www.danah.org/">dannah boyd</a> discussed <a href="http://www.danah.org/papers/talks/2010/WWW2010.html">privacy implications of the availability of "big data"</a>. Her keynote revolved around common misconceptions associated with the analysis of data produced by online social activities, as well as ethical concerns related to using this data in the first place, "just because it is accessible".<o:p></o:p></span></li>
<li><span lang="EN-GB"><o:p></o:p>Carl Malamud</span><span lang="EN-GB"> from <a href="http://public.resource.org/">public.resource.org</a> described his experiences trying to convince seven bureaucratic institutions to make public data publicly accessible. His keynote was organised around <a href="http://www.elon.edu/e-web/predictions/futureweb2010/carl_malamud_www_keynote.xhtml">"10 rules for radicals"</a>, a guide on how to break the barriers towards negotiating with bureaucrats.<o:p></o:p></span><span lang="EN-GB"><o:p></o:p></span></li>
</ul>
<div style="text-align: justify;">
<span lang="EN-GB"><o:p></o:p>On Thursday night, the conference banquet featured an exciting performance by the North Carolina string band <a href="http://www.carolinachocolatedrops.com/">Carolina Chocolate Drops</a>. Check out <i><a href="http://www.youtube.com/watch?v=_Sk3mNm2Mfs">Snowden's Jig (Genuine Negro Jig)</a></i> and <i><a href="http://www.youtube.com/watch?v=EKzbVi9hOjU">Don't get trouble in your mind</a></i> for a taste.</span><span lang="EN-GB"><o:p><br /></o:p></span><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjb1LPjRQvGSx3ocWCZtsspgccRsYcYdnE77yZIXXH1V9pjL8k0W77jFwaEsVYbEoknj1AmAZBX8sz3yv1x7R3AlraHlC-eQSIgNIyM1eepIEvWA5A5oaFzaIlb_iObYLpY4BwDY83s8WZW/s1600/DSC01018.JPG" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"><img alt="" border="0" id="BLOGGER_PHOTO_ID_5467220992104659634" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjb1LPjRQvGSx3ocWCZtsspgccRsYcYdnE77yZIXXH1V9pjL8k0W77jFwaEsVYbEoknj1AmAZBX8sz3yv1x7R3AlraHlC-eQSIgNIyM1eepIEvWA5A5oaFzaIlb_iObYLpY4BwDY83s8WZW/s400/DSC01018.JPG" style="cursor: pointer; display: block; height: 225px; margin: 0px auto 10px; text-align: center; width: 400px;" /></a></div>
<div style="text-align: justify;">
<span lang="EN-GB"><o:p></o:p>Friday held the closing ceremony, with the announcement of the award winners.</span><span lang="EN-GB"><o:p><br /></o:p>Best Paper:<o:p></o:p></span></div>
<div class="MsoNoSpacing" style="text-align: justify;">
</div>
<ul style="text-align: justify;">
<li><i><span lang="EN-GB">Factorizing personalized Markov chains for next-basket recommendation</span></i><span lang="EN-GB">, by Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme (Osaka University and University of Hildesheim)</span><span lang="EN-GB"></span></li>
</ul>
<div style="text-align: justify;">
<span lang="EN-GB">Best Student Paper:</span> </div>
<ul style="text-align: justify;">
<li><i><span lang="EN-GB">Privacy wizards for social networking sites</span></i><span lang="EN-GB">, by Lujun Fang and Kristen LeFevre (University of Michigan)<o:p></o:p></span><span lang="EN-GB"></span></li>
</ul>
<div style="text-align: justify;">
<span lang="EN-GB">Best Posters:</span> </div>
<ul style="text-align: justify;">
<li><i><span lang="EN-GB">How much is your personal recommendation worth</span></i><span lang="EN-GB">, by Paul Dütting, Monika Henzinger and Ingmar Weber (EPFL Lausanne, University of Vienna, and Yahoo! Research)</span></li>
<li><i><span lang="EN-GB">SourceRank: Relevance and trust assessment for deep Web sources based on inter-source agreement</span></i><span lang="EN-GB">, by Raju Balakrishnan and Subbarao Kambhampati (Arizona State University)</span><span lang="EN-US" style="font-family: NimbusSanL-Regu; font-size: 10pt;"></span><span lang="EN-GB"><o:p></o:p></span><span lang="EN-GB"><o:p></o:p></span></li>
</ul>
<div style="text-align: justify;">
<span lang="EN-GB"><o:p></o:p>The closing ceremony also featured a short presentation of <a href="http://www2011.org/">WWW 2011</a>, to be held in Hyderabad, India. <a href="http://www2012.org/">WWW 2012</a> will take place in Lyon, France.<br /><br />Finally, on Saturday, the <a href="http://www.iw3c2.org/">IW3C2</a> announced the Brazilian bid as the winner to host <a href="http://www2013.org/">WWW 2013</a>, which I was very glad to hear about!</span></div>
<span lang="EN-GB"><o:p></o:p></span>Rodrygo L.T. Santoshttp://www.blogger.com/profile/09502952528669992135noreply@blogger.com0tag:blogger.com,1999:blog-6043705792807544709.post-80366149069081569322010-04-28T13:47:00.016+01:002012-06-26T11:27:18.721+01:00RIAO 2010 in Paris, France.<div style="text-align: justify;">
The 9th International <a href="http://www.riao2010.org/">RIAO</a> Conference has started in Paris, France (28-30 April, 2010). It is unfortunate that it is being held concurrently with <a href="http://www2010.org/www/">WWW 2010</a> in Raleigh.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
The first RIAO conference was held in Grenoble in 1985. RIAO is currently a triennial conference, addressing Information Retrieval research topics of interest to both Academia and Industry. This year, the conference focuses on Adaptivity, Personalization and Fusion of Heterogeneous Information.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
The following papers have caught my eyes, while browsing the <a href="http://www.riao2010.org/?action=programme.nouveau&lang=en">RIAO 2010 program</a>:</div>
<ul style="text-align: justify;">
<li><span style="font-style: italic;">Boiling down information retrieval test collections</span>. T. Sakai et al. (Microsoft Research Asia, CMU)</li>
<li><span style="font-style: italic;">Improving tag recommendation using social networks</span>. A. Rae et al. (The Open University, Yahoo! Research Barcelona).</li>
<li><span style="font-style: italic;">Analysis of robustness in trust-based recommender systems</span>. Z. Cheng and N. Hurley (UCD)</li>
<li><span style="font-style: italic;">Opinion-finding in blogs: A passage-based language modelling approach</span>. M. Saad Missen et al (IRIT)</li>
<li><span style="font-style: italic;">Predicting query performance using query, result, and user interaction features</span>. Q. Guo et al. (Emory University/Microsoft Research)</li>
<li><span style="font-style: italic;">Towards a collection-based results diversification</span>. J.A. Akinyemi et al. (University of Waterloo)</li>
</ul>
<div style="text-align: justify;">
In addition, the <a href="http://terrierteam.dcs.gla.ac.uk/">TerrierTeam</a> has two full papers, which are being presented today at the conference (hopefully, the slides will follow shortly):</div>
<ul style="text-align: justify;">
<li><a href="http://terrierteam.dcs.gla.ac.uk/publications/santos2010riao.pdf">Voting for Related Entities</a> by R.L.T. Santos, C. Macdonald and I. Ounis. The paper addresses the problem of entity search, where the goal is to rank not documents, but entities in response to a given query. The paper proposes to tackle this problem as a <a href="http://www.dcs.gla.ac.uk/%7Ecraigm/thesis.shtml">voting process</a>, by considering the occurrence of an entity among the top ranked documents for a given query as a vote for the existence of a relationship between this and the entity in the query. The approach led to high precision and unparalleled recall compared to TREC 2009 systems. </li>
<li><a href="http://terrierteam.dcs.gla.ac.uk/publications/richard10riao_168.pdf">News Article Ranking: Leveraging the Wisdom of Bloggers</a> by R. McCreadie, C.Macdonald and I. Ounis. The paper investigates how news article ranking can be performed automatically, so as to assist editors in selecting the articles, which should make the front page of their newspaper. In particular, the paper investigates the blogosphere as a prime source of evidence, on the intuition that bloggers, and by extension their blog posts, can indicate interest in one news article or another. The paper proposes to model the automatic news article ranking task as a <a href="http://www.dcs.gla.ac.uk/%7Ecraigm/thesis.shtml">voting process</a>, where each relevant blog post acts as a vote for one or more news articles. The approach led to the best TREC 2009 retrieval performance in the <a href="http://ir.dcs.gla.ac.uk/wiki/TREC-BLOG">Blog track</a>.</li>
</ul>
<div style="text-align: justify;">
<a href="http://www.dcs.gla.ac.uk/%7Ecraigm">Craig Macdonald</a> is tweeting the conference, pending an appropriate wireless signal. You can follow some bits of the RIAO conference through the <a href="http://twitter.com/#search?q=%23riao2010">#riao2010</a> hashtag.</div>Iadh Ounishttp://www.blogger.com/profile/05740425172350940695noreply@blogger.com0tag:blogger.com,1999:blog-6043705792807544709.post-54962074361371577292010-04-07T13:04:00.043+01:002012-06-26T11:30:14.811+01:00ECIR 2010 in Milton Keynes: A Report<div style="text-align: justify;">
Last week, five of us attended the <a href="http://kmi.open.ac.uk/events/ecir2010/">ECIR 2010</a> conference in <a href="http://en.wikipedia.org/wiki/Milton_Keynes">Milton Keynes</a>. The conference was fairly well-organised, although it markedly lacked the lustre of the prev<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh2_m49tp3mkuemLbSKeQcIKbjc-_rn8zi71Cbkb0qr0lCiNHA_PoXSFD0nU9UhYOptqEcaAmbxSJbozOxnbzyuF3Wt5i2axk4NGvg1fAB9bJj93WI2zDEKXiR1DP93P_BAodAbDPm2z8wN/s1600/DSC05128.JPG" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"><img alt="" border="0" id="BLOGGER_PHOTO_ID_5457379833723312514" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh2_m49tp3mkuemLbSKeQcIKbjc-_rn8zi71Cbkb0qr0lCiNHA_PoXSFD0nU9UhYOptqEcaAmbxSJbozOxnbzyuF3Wt5i2axk4NGvg1fAB9bJj93WI2zDEKXiR1DP93P_BAodAbDPm2z8wN/s320/DSC05128.JPG" style="cursor: pointer; float: right; height: 211px; margin: 0pt 0pt 10px 10px; width: 281px;" /></a>ious three editions of the conference. In terms of attendance, only about 170 delegates have <span style="font-style: italic;">registered</span>, much less than Glasgow 2008 (210+), and Toulouse 2009 (180+). Perhaps, the exotic town of Milton Keynes was not deemed to be a very attractive venue for a conference. In fact, apart from attending the conference, there was not much else to do -- e.g. the nearest proper pub was at about 2 miles from the conference venue.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
The ECIR 2010 conference has suffered from a new and previously unseen problem: several authors and presenters did not make it to the conference, preferring to give their presentation by proxy or using a pre-recorded talk. No less than 5 no-shows were recorded during the conference. Even the keynote speaker and winner of the first <a href="http://irsg.bcs.org/ksjaward.php">BCS IRSG Karen Sparck Jones award</a>, <a href="http://homepages.inf.ed.ac.uk/mlap/index.html">Mirella Lapata</a>, did not show up and gave her presentation through a pre-recorded video. While Lapata certainly had a valid reason (as probably did the other speakers) not to show up, it is clear that ECIR should concretely deal with such a problem, e.g., by making it compulsory that at least one author of each accepted paper be present during the conference. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
In addition, the organisers decided not to have parallel sessions (because of lack of facilities?) during ECIR 2010. Therefore, several full papers were turned into poster presentations, which were held during the short lunch period. This was a very bad move, as because of the setting, these papers received much less attention and credit, even compared to the actual posters, the session of which was rather successful. Some delegates argued that some of the full-papers-turned-posters should have been given a full presentation slot, in lieu of those full papers with a no-show author.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Other than the problems mentioned above, the conference program was generally of a very good quality. In the first day, we enjoyed an excellent tutorial by two MSR researchers on <a href="http://research.microsoft.com/en-us/events/ecir-2010-mlir-tutorial/">Machine Learning for IR</a>. The tutorial was given by Paul Bennett and Kevyn Collins-Thompson. We also enjoyed an equally excellent tutorial on <a href="http://en.wikipedia.org/wiki/Crowdsourcing">Crowdsourcing</a> by Omar Alonso from Bing.</div>
<br />
<div style="text-align: justify;">
In the next days, there were also several good papers that are worth reading:</div>
<ul style="text-align: justify;">
<li>A language modeling approach for temporal information needs (from Max-Planck)</li>
<li>The role of query sessions in extracting instance attributes from web search queries (from Google)</li>
<li>Interpreting user inactivity on search results (from Univ. of Washington, Univ. of Patras)</li>
<li>Learning to distribute queries onto Web search nodes (from Yahoo!)</li>
<li>Temporal shingling for version identification in Web archives (from Max-Planck)<span style="font-weight: bold;"></span></li>
<li>Evaluation and user preference study on spatial diversity<span style="font-weight: bold;"> </span>(University of Sheffield)<span style="font-weight: bold;"><br /></span></li>
</ul>
<div style="text-align: justify;">
The best paper award was <span style="font-style: italic;">jointly</span> awarded to:</div>
<ul style="text-align: justify;">
<li>Promoting ranking diversity for biomedical information retrieval using Wikipedia. Jimmy Huang and Xiaoshi Yin (York University)</li>
<li><span style="font-weight: bold;"></span>Evaluation of an adaptive search suggestion system. Sascha Kriewel and Norbert Fuhr (University of Duisburg-Essen, Germany)</li>
</ul>
<div style="text-align: justify;">
We have also had the chance to present our two full-papers on search result diversification, and learning to select:</div>
<ul style="text-align: justify;">
<li><a href="http://terrierteam.dcs.gla.ac.uk/publications/ecir2010_rodrygo_div.pdf">Explicit search result diversification through sub-queries</a> by Rodrygo L. T. Santos, Jie Peng, Craig Macdonald, and Iadh Ounis. Rodrygo presented our xQuAD search results diversification framework, and the talk was very well received by the delegates, leading to several questions, and many comments that this was arguably the best presentation of the conference.</li>
<li><a href="http://terrierteam.dcs.gla.ac.uk/publications/ecir2010_pj_selective.pdf">Learning to select a ranking function</a> by Jie Peng, Craig Macdonald and Iadh Ounis. This was one of the full-paper-turned-poster presentations. Jie presented the poster, which attracted a lot of attention and led to some very interesting discussions.</li>
</ul>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Finally, during the posters/demos session, two good contributions particularly caught our attention:</div>
<ul style="text-align: justify;">
<li>An Empirical Study of Query Specificity (Poster) - Avi Arampatzis and Jaap Kamps</li>
<li>NEAT :News Exploration Along Time (Demo) - Omar Alonso, Klaus Berberich, Srikanta Bedathur and Gerhard Weikum</li>
</ul>
<div style="text-align: justify;">
The conference had also an Industry day, which we missed. You can see a report on the Industry day in the following <a href="http://blog.twigkit.com/ecir-industry-day-2010/">blog post</a>. During the conference, a few of us actively twittered the conference sessions. You can look at the <a href="http://twapperkeeper.com/hashtag/ecir2010">archived ecir2010 hashtag</a> for more details.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
One of the most exciting moments of the conference was our visit to the <a href="http://en.wikipedia.org/wiki/Bletchley_Park">Bletchley Park</a> as part of the ECIR 2010 social dinner. This was an excellent venue with a lot of history, and the food was also good! During the dinner, we were given an impossible quiz to answer. Despite the wine, and a long day, some delegates did manage to find the <a href="http://kmi.open.ac.uk/events/ecir2010/ECIR-quiz-answers.pdf">answers</a>.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Usually, when ECIR is held in the UK, the last day of the conference is the venue for Annual General Meeting of the <a href="http://irsg.bcs.org/">BCS IRSG</a> - the umbrella group for ECIR. However, in 2010, there was no AGM. We can only suppose that this was because the 2009 AGM was only held in October, co-located with Search Solutions 2009 at BCS HQ. We say <span style="font-style: italic;">suppose</span>, because at the time of writing, the 2009 AGM minutes are not yet available!</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Finally, we would like to thank the organisers for their hard work during the conference, for the idea of the <span class="status-body" id="ptLastEntry" title="processed"><span class="status-content"><span class="entry-content">ball-bouncer game during the session breaks, which was really cool/fun </span></span></span>and for an overall reasonably organised conference. We look forward to <a href="http://ecir2011.dcu.ie/">ECIR 2011</a> in Dublin!</div>Iadh Ounishttp://www.blogger.com/profile/05740425172350940695noreply@blogger.com0tag:blogger.com,1999:blog-6043705792807544709.post-44891512030248450652010-03-10T18:56:00.004+00:002010-03-11T10:54:50.867+00:00Terrier 3.0 released<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.dcs.gla.ac.uk/%7Erichardm/images/Terrier3logo.jpg"><img style="float: right; margin: 0pt 0pt 10px 10px; cursor: pointer; width: 258px; height: 181px;" src="http://www.dcs.gla.ac.uk/%7Erichardm/images/Terrier3logo.jpg" alt="" border="0" /></a>Firstly, we have a new website for Terrier: <a href="http://terrier.org/">http://terrier.org</a><br /><br />Also, we have just released Terrier 3.0!<br /><span style="font-family:monospace;"></span><br />This is a major update to Terrier, including:<br /><ul><li>support for indexing WARC collections (such as ClueWeb09)</li><li>improved MapReduce mode indexing</li><li>improved and more scalable index structures</li><li>added field-based and proximity term dependence models, such as BM25F, PL2F and Markov Random Fields</li><li>new Web-based retrieval interface</li></ul>Fuller changelog at <a href="http://terrier.org/docs/current/whats_new.html">http://terrier.org/docs/current/whats_new.html</a><br /><br />If your looking for our team publications, etc., please see our new team website: <a href="http://terrierteam.dcs.gla.ac.uk/">http://terrierteam.dcs.gla.ac.uk/</a><br /><br />Thanks are due to everyone in the Terrier Team for their hard work to make this release, as well as the contributions and feedback about Terrier from our users and collaborators.Craig Macdonaldhttp://www.blogger.com/profile/13764972230026912718noreply@blogger.com2tag:blogger.com,1999:blog-6043705792807544709.post-80764892622311095232010-02-23T10:31:00.021+00:002012-06-26T11:30:42.969+01:00TREC Blog Track 2010<div style="text-align: justify;">
<span style="font-family: arial;">The </span><a href="http://trec.nist.gov/" style="font-family: arial;">TREC</a><span style="font-family: arial;"> Blog track will be continuing in 2010. In 2009, the Blog track has been markedly revamped , addressing more refined Blog search scenarios using the new </span><a href="http://ir.dcs.gla.ac.uk/test_collections/blogs08info.html" style="font-family: arial;">Blogs08</a><span style="font-family: arial;"> collection, a large sample of the blogosphere covering the period of 14th January 2008 to 10th February 2009.</span></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<span style="font-family: arial;">A summary of the TREC Blog track 2009 edition has been presented by </span><a href="http://www.dcs.gla.ac.uk/%7Eounis" style="font-family: arial;">Iadh Ounis</a><span style="font-family: arial;"> at the main TREC conference (</span><a href="http://ir.dcs.gla.ac.uk/terrier/TREC2009Blog-overview.pdf" style="font-family: arial;">Slides</a><span style="font-family: arial;">). The Blog track 2009 overview paper will be available on the TREC website shortly, once it is updated and reviewed.</span></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<span style="font-family: arial;">The details of the TREC 2010 Blog track are still being finalised by the organisers. However, following the discussions at the TREC 2009 Blog track workshop, here are some salient details (see also the TREC 2009 </span><a href="http://ir.dcs.gla.ac.uk/terrier/Blog-track-2009-Wrap-up.pdf" style="font-family: arial;">Wrap-up Slides</a><span style="font-family: arial;">):</span></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<span style="font-family: arial;">1. Faceted blog search task will run again in 2010: The task addresses the quality aspect of the retrieved blogs . It is a feed search task.</span></div>
<ul style="text-align: justify;">
<li style="font-family: arial;">We will adopt a two-stage submission procedure: (1) a participating group submits "topically-relevant"blogs for each query; (2) a few standard baselines will be distributed to participants, so that they can re-rank them with respect to various facet inclinations (e.g. opinionated, in-depth, personal).</li>
<li style="font-family: arial;">Groups can participate in stage 2 without stage 1, and vice-versa. Stage 1 is akin to an adhoc blog search task.</li>
<li><span style="font-family: arial;">More topics for various facet inclinations.</span></li>
</ul>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
2. Top news story identification task will run again in 2010: The task addresses the news‐related dimension of the blogosphere. In particular, it investigates whether the blogosphere can be used to identify the most important news stories of the day. </div>
<div style="text-align: justify;">
<br /></div>
<ul style="font-family: times new roman; text-align: justify;">
<li style="font-family: arial;">Real-time news search task rather than retrospective.</li>
<li style="font-family: arial;">Much larger and a more comprehensive headlines sample, provided by a major news organisation.</li>
<li style="font-family: arial;">A two-stage submission procedure: (1) Groups submit a ranking of top stories for some days per-category (e.g. sport, politics, business, etc.) (2) We will then select some top relevant stories, for which we will ask the participating groups to identify the related blog posts, in a manner that covers the various/diverse aspects of each story.</li>
<li style="font-family: arial;">Groups can participate in stage 2 without stage 1. In the latter case, its is an adhoc diversity blog post search task, where the headline is the query.</li>
</ul>
<div style="text-align: justify;">
We welcome any feedback and comments on the tasks above to trecblog-organisers (at) dcs.gla.ac.uk</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Finally, note that if you wish to participate in TREC 2010, you should answer the <a href="http://trec.nist.gov/call2010.html">TREC 2010 call for participation</a>. We will update the Blog track wiki as things become more refined - keep following the Blog track developments as they happen on our dedicated <a href="http://ir.dcs.gla.ac.uk/wiki/TREC-BLOG">Wiki web site</a>.</div>Iadh Ounishttp://www.blogger.com/profile/05740425172350940695noreply@blogger.com9tag:blogger.com,1999:blog-6043705792807544709.post-20274136372343628482009-08-04T14:16:00.017+01:002012-06-26T11:31:06.601+01:00AcademTech: Faceted People Search<div style="text-align: justify;">
<a href="http://ir.dcs.gla.ac.uk/terrier/academtech">AcademTech</a> is a Computing Science-specific expert search engine based on the <a href="http://ir.dcs.gla.ac.uk/terrier/">Terrier IR Platform</a>. Persons working at Computing Science departments in Scottish Universities are considered as candidate experts by the system. Profiles of their expertise evidence are then mined from their homepages, publicly available digital libraries (e.g. DBLP) and related information found on the Web through Yahoo! BOSS. The ranking of experts is provided by a variant of the <a href="http://portal.acm.org/citation.cfm?id=1183671">Voting Model</a> expert search approach.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
The system is integrated with a novel faceted search interface to allow users to browse and explore the results using a number of categories such as Location or Conference/Journal publications. Each expert in the system has a profile page containing a number of elements including query specific supporting publications, most informative associated terms displayed as a tag cloud, co-authors and web links. Although the system is currently applied in the context of Scottish Computing Science Academia, it can easily be expanded to go beyond its current Scottish scope, cover other academic fields, and people in general.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
I was lucky enough to be able to demo AcademTech at <a href="http://www.sigir2009.org/">SIGIR 2009</a> in Boston on July 20th. Thankfully, I spoke to a large number of attendees receiving largely very helpful feedback.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg7hzI9dkPbbpMlTJ3PtfuDhUaMTCrDQwoXc02NHkW36ONlEjDU68qEe-l01OGsdZQ3qxqWOc13uf5wxJ1cEggERj107JyJYjaASlZR97ox2OmhiiYIGbzv5uPRb_ptx3fAkeNPdRtzow4/s1600-h/duncan_mcdougall_academtech_sigir.jpg" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"><img alt="" border="0" id="BLOGGER_PHOTO_ID_5366129448105937730" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg7hzI9dkPbbpMlTJ3PtfuDhUaMTCrDQwoXc02NHkW36ONlEjDU68qEe-l01OGsdZQ3qxqWOc13uf5wxJ1cEggERj107JyJYjaASlZR97ox2OmhiiYIGbzv5uPRb_ptx3fAkeNPdRtzow4/s200/duncan_mcdougall_academtech_sigir.jpg" style="cursor: pointer; float: left; height: 150px; margin: 0pt 10px 10px 0pt; width: 200px;" /></a>A popular suggestion was to utilize AcademTech's core system in the scope of biology. This would meet the medical field's need for finding related organisms, diseases etc. Possible facets in the area would likely be <a href="http://en.wikipedia.org/wiki/Biological_classification">biological classifications</a> such as species and genus.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Daniel Tunkelang from <a href="http://thenoisychannel.com/">The Noisy Channel</a> suggested providing profile page-located facets, allowing filtering of search results by features present in a selected expert's page such as co-authors. This would satisfy an example scenario such as "Show me co-authors of this expert who work for the University of Glasgow." Profile facets could also allow the experts publications list to be filtered by a number of fields such as co-author location, conference etc.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Much of the feedback mirrored that of intended future work. Name disambiguation is a high priority update as a current problem with AcademTech is the publication mismatch when multiple experts have the same name. In fact, the system is specifically designed to allow for expansion of facets, and name disambiguation. With a large amount of publication collaborators working in industry a useful move would be to expand to accommodate these experts.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<a href="http://www.dcs.gla.ac.uk/%7Ecraigm/publications/mcdougall09academtech-poster.pdf"><img alt="AcademTech Sigir 2009 Poster" border="1" id="BLOGGER_PHOTO_ID_5366141837372646322" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEicsL42VLwXd9x8phEbXSbjSNE6awh05eXCvxw1RHZHj3UaOJ1klIzV-Vn0Yl-qVDHGMtYywjJ4EjTso44WbvbF21iqfw06ZCR5-C9I8UessnwgqtPLadePacDFOenwiqsjchEKSW4_UsU/s200/academtech_poster.png" style="cursor: pointer; float: right; height: 142px; margin: 0pt 0pt 10px 10px; width: 200px;" /></a>AcademTech is now publicly accessible from <a href="http://owa1.dcs.gla.ac.uk/exchweb/bin/redir.asp?URL=http://www.terrier.org/academtech" target="_blank">http://www.terrier.org/academtech</a></div>
<div style="text-align: justify;">
A description of the system is available in the <a href="http://portal.acm.org/citation.cfm?id=1571941.1572154">SIGIR'09 proceedings</a>.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Thank you to all those who spoke to me and gave me some great feedback.</div>Duncan McDougallhttp://www.blogger.com/profile/06662063389408020091noreply@blogger.com2tag:blogger.com,1999:blog-6043705792807544709.post-19319969852247137692009-07-21T16:08:00.005+01:002012-06-26T11:49:09.454+01:00SIGIR 2009: Expert Search from Glasgow<div style="text-align: justify;">
<span class="Apple-style-span" style="font-size: 100%;">A short update from </span><span style="font-size: 100%;"><a href="http://www.sigir2009.org/"><span class="Apple-style-span">SIGIR09</span></a></span><span class="Apple-style-span" style="font-size: 100%;"> to announce our recently published work on expert search. This should hopefully be the first of a series of a few posts about SIGIR this year.</span></div>
<div style="text-align: justify;">
<span class="Apple-style-span" style="font-size: 100%;"><br /></span></div>
<div style="text-align: justify;">
<span class="Apple-style-span" style="font-size: 100%;">In </span><span style="font-size: 100%;"><a href="http://www.dcs.gla.ac.uk/%7Ecraigm/publications/macdonald09perfect.pdf"><span class="Apple-style-span">On Perfect Document Rankings for Expert Search</span></a></span><span class="Apple-style-span" style="font-size: 100%;"> (Craig Macdonald & Iadh Ounis), we examine the effect of the document ranking to an expert search engine. Intuitively, improving the topical relevance properties of the document ranking usually leads to an improvement in the performance of the generated ranking of documents. In this poster, we examine the extreme case, by making the document ranking component perfect with respect to topical relevance.</span></div>
<div style="text-align: justify;">
<span class="Apple-style-span" style="font-size: 100%;"><br /></span></div>
<div style="text-align: justify;">
<span class="Apple-style-span" style="font-size: 100%;">In </span><span style="font-size: 100%;"><a href="http://www.dcs.gla.ac.uk/%7Ecraigm/publications/macdonald09exclicks.pdf">Usefulness of Click-through data in Expert Search</a> (Craig Macdonald & Ryen White), we examine how user clicks on an intranet search engine can be used as features by an expert search engine. The proposed techniques are based on the voting techniques from the Voting Model, but examine documents clicks instead of weighting model scores. To our knowledge, this is the first work examining how clicks can be integrated into expert search.</span></div>
<div style="text-align: justify;">
<span class="Apple-style-span" style="font-size: 100%;"><span class="Apple-style-span"><br /></span></span></div>
<div style="text-align: justify;">
<span class="Apple-style-span" style="font-size: 100%;"><span class="Apple-style-span">Finally, the Voting Model was show-cased in the<a href="http://www.dcs.gla.ac.uk/%7Ecraigm/publications/mcdougall09academtech.pdf"> </a></span><span class="Apple-style-span"><span class="Apple-style-span"><a href="http://www.dcs.gla.ac.uk/%7Ecraigm/publications/mcdougall09academtech.pdf">Expertise Search in Academia using Facets</a></span></span><span class="Apple-style-span"> (Duncan McDougall & Craig Macdonald), which demoed </span><a href="http://terrier.org/academtech/"><span class="Apple-style-span">AcademTech</span></a><span class="Apple-style-span">, a faceted search interface for expert search in academia.</span></span></div>Craig Macdonaldhttp://www.blogger.com/profile/13764972230026912718noreply@blogger.com1tag:blogger.com,1999:blog-6043705792807544709.post-79111592696860327812009-06-04T10:38:00.003+01:002012-06-26T11:28:13.073+01:00CIKM 2011 in Glasgow!<div style="text-align: justify;">
We are delighted that our bid to host the <a href="http://www.cs.umbc.edu/cikm/">ACM Conference on Information and Knowledge Management</a> (CIKM 2011) in Glasgow has been successful.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
After the highly successful <a href="http://www.dcs.gla.ac.uk/essir2007">ESSIR 2007</a> and <a href="http://ecir2008.dcs.gla.ac.uk/">ECIR 2008</a> events, we are excited at the prospect of hosting the prestigious ACM CIKM Conference in Glasgow in 2011. We look forward to having our colleagues gather in Glasgow, and to surpassing their expectations.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Further information about the conference (dates, venues, etc.) will be available in due course.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<a href="http://www.comp.polyu.edu.hk/conference/cikm2009/about/">CIKM 2009</a> will be held on November 2-6, 2009, in Hong Kong. Hope to see you there!</div>Iadh Ounishttp://www.blogger.com/profile/05740425172350940695noreply@blogger.com5tag:blogger.com,1999:blog-6043705792807544709.post-50573017197791909232009-04-29T12:08:00.005+01:002012-06-26T11:50:11.330+01:00TREC Blog track 2009<div style="text-align: justify;">
We have just released a <a href="http://ir.dcs.gla.ac.uk/wiki/TREC-BLOG">draft of the guidelines</a> for the TREC 2009 Blog track.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Compared to previous years, the Blog track 2009 aims to investigate more refined and complex search scenarios. In particular, we propose to run two tasks in TREC 2009:</div>
<ul style="text-align: justify;">
<li>Faceted blog distillation: a more refined version of the blog distillation task that addresses the quality aspect of the retrieved blogs and mimics an exploratory search task. The task can be summarised as "<i>Find me a <b>good</b> blog with a principal, recurring interest in X</i>". We propose several facets for the TREC 2009 blog distillation task, which may be of varying difficulty to identify for the participant systems.</li>
</ul>
<ul style="text-align: justify;">
<li>Top stories identification: a new pilot task that addresses the news dimension in the blogosphere. Systems are asked to identify the top news stories of a given day, and to provide a list of relevant blog posts discussing each news story. The ranked list of blog posts should have a <b>diverse</b> nature, covering different/diverse aspects, perspectives or opinions of the news story. </li>
</ul>
<div style="text-align: justify;">
The new <a href="http://terrierteam.blogspot.com/2009/04/blogs08-collection-released.html">Blogs08 collection</a>, an up-to-date and large sample of the blogosphere from January 2008 to February 2009, will be used for both tasks.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
We welcome feedback. Please feel free to post feedback and comments about the proposed tasks for 2009.</div>Iadh Ounishttp://www.blogger.com/profile/05740425172350940695noreply@blogger.com4tag:blogger.com,1999:blog-6043705792807544709.post-9128559411960198742009-04-09T20:05:00.003+01:002012-06-26T11:51:00.522+01:00Blogs08 Collection Released<div style="text-align: justify;">
We are pleased to announce that the Blogs08 collection is now ready for distribution. As announced before, Blogs08 is one order of magnitude bigger than Blogs06, and samples the blogosphere from January 2008 to February 2009. The uncompressed permalink size is approx 1.3TB, while including feeds, this amounts to over 2TB of data. As usual, the data is shipped compressed on a SATA hard drive.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
The distribution mechanism will be the same as for Blogs06. There is specific information about the size of the collection <a href="http://ir.dcs.gla.ac.uk/test_collections/blogs08info.html">here</a>, while the instructions for obtaining the collection are <a href="http://ir.dcs.gla.ac.uk/test_collections/access_to_data.html">here</a>.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
If you intend on participating in the <a href="http://trec.nist.gov/">TREC</a> 2009 Blog track, please start working on the paperwork right away, so that you can get the collection as soon as possible. Due to the larger size of the collection, we will operate a queuing system for shipping the data. Moreover, if you haven't done so already, respond to the <a href="http://trec.nist.gov/call09.html">TREC 2009 Call for Participation.</a></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Blog track co-ordinators are finalising the guidelines for this year's tasks and will continue to update the <a href="http://ir.dcs.gla.ac.uk/wiki/TREC-Blog">TREC Blog wiki</a>, the TREC blog track mailing list and this blog.</div>Iadh Ounishttp://www.blogger.com/profile/05740425172350940695noreply@blogger.com2tag:blogger.com,1999:blog-6043705792807544709.post-20379827354228340352009-03-03T16:52:00.007+00:002012-06-26T11:52:30.500+01:00Craig's Thesis Available<div style="text-align: justify;">
Following up from my <a href="http://terrierteam.blogspot.com/2009/01/craig-successfully-defends-his-thesis.html">successful defence</a>, I'm pleased to announce that my thesis, titled <a href="http://www.dcs.gla.ac.uk/%7Ecraigm/thesis.shtml">The Voting Model for People Search</a> is now available online.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
My thesis proposes the Voting Model for various people search problems, such as expert search in enterprise settings (<span style="font-style: italic;">find me someone who knows about...</span>) , or blog(ger) search (<span style="font-style: italic;">find me a blog about the general topic...</span>). I also examine the reviewer assignment problem (<span style="font-style: italic;">suggest for me reviewers for this paper...</span>). In general, the Voting Model is concerned with the ranking of aggregates of documents.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Experimental chapters are mainly carried out using TREC Enterprise track and Blog track test collections.</div>Craig Macdonaldhttp://www.blogger.com/profile/13764972230026912718noreply@blogger.com0