ECIR 2008

Researching and building IR applications using Terrier

1. Introduction

Terrier is a flexible open-source platform for developing Information Retrieval (IR) applications and for performing IR research. Terrier has a growing user community and an active discussion forum, making the platform lively and of continually improving quality. Indeed, the platform is ideal for researchers exploring the field and the existing retrieval models while knowing that it can be easily extended in the future to support their own research. Terrier provides a very comprehensive teaching platform for those lecturers involved in running undergraduate and postgraduate information retrieval courses.

This tutorial will introduce the main design of a scalable IR system, and use the Terrier platform as an example of how one should be built. We detail the architecture and data structures of Terrier, as well as the weighting models included, and describe, with examples, how Terrier can be used to perform experiments and extended to facilitate new
research and applications.

2. The main design of an IR system
We describe the indexing architecture of a scalable IR system, and supplement with the architectures adopted by Terrier. The main retrieval data structures are discussed together with commonly applied compression techniques, in particular those implemented in Terrier.

Direct file indexing with sort-based inversion
Single pass indexing with memory-based inversion

Terrier comes with Divergence From Randomness (DFR) weighting models, as well as other popular weighting models such as TF-IDF, Okapi's BM25 and Language modelling. We will introduce the general idea behind a weighting model and explain the DFR idea for weighting terms. We show that the same DFR idea can be naturally used for Query Expansion. In all cases, weighting models are explained and their implementation within Terrier is illustrated.

3. Experimenting and Researching with Terrier

Terrier allows experimentation using many standard test collections, such as off-the-shelf support for TREC experiments. In addition, Terrier is a flexible platform that allows easy implementation of your own research ideas, giving researchers a rapid path from idea development to experimentation.

In this part of tutorial, we will focus on how to implement your idea/method to facilitate your own research. For instance, we will introduce with examples how to extract text from your own collection of documents, and how to determine the most informative terms from a set of documents. In particular, we provide overview in how to implement current state-of-the-art applications, such as opinion finding retrieval (c.f. TREC Blog track), document prior integration (c.f. Web IR) and others.

4. Course Materials

Handouts containing slides, a Terrier "crib sheet", and detailed examples of implementations of common research problems will be provided, in addition to a bibliography of informative related papers.