ECIR 2008

Search and Discovery in User-Generated Text Content

Helped by the development of easy to use tools for content creation, experience sharing and communication, we increasingly live our lives online. Blogs, forums, commenting tools, mailing lists, social network sites, and video and image sharing sites offer possibilities to their users to make whatever information they want available online. For the first time in history, we are now able to collect huge amounts of user-generated content (UGC) within ``a blink of an eye''. The rapidly increasing amount of UGC poses challenges to the IR community, but also offers many previously unthinkable possibilities. In this tutorial we discuss different aspects of accessing (i.e., searching, tracking, and analyzing) UGC, with a focus on textual content.

The tutorial consists of four parts. Part one will be dedicated to blog search; with the launch of the TREC Blog track in 2006 more and more insights are being gained in the specifics of searching blog posts and the challenges it offers compared to searching in a general web collection. We look at blog post retrieval from both a topical
and opinionated point of view: especially the latter is very UGC-specific and could give raise to a whole new area of opinion mining on the web. A third issue discussed in this part is feed distillation: instead of searching for individual blog posts, the task is to identify blogs that deal with a certain topic on a regular basis.

In Part two of the tutorial we consider intelligent access to UGC other than blogs, including searching social networks (''social search''), tracking comments on news stories, and discussion search on, e.g., mailing lists. Each of these forms of UGC has its own challenges, and asks for specific approaches.

In Part three of the tutorial we explore the fact that search and discovery in UGC has some overlap with other IR branches. For instance, one can view the feed distillation task as an expert (blogger) finding problem and we give some insights in work in this field. An important problem in the area of UGC is spam: we touch on work done in spam identification. Also, important work is done on determining authoritativeness in UGC using linguistic characteristics: we look at how work in this field (possibly) influences work in accessing UGC. And finally, when accessing blogs, we're usually not so much interested in particular individual blog posts, but in general trends and developments; this has clear links with multi-document summarization.

We conclude the tutorial by looking at future directions of research focusing on UGC and at issues that still remain open. This part of the tutorial will be aimed especially at early stage researchers.