Vechtomova O. Approaches to Using Word Collocation in Information Retrieval. PhD Thesis, City University, London, UK, 2001 [Full text]


The thesis explores long-span collocation and its application in information retrieval. The basic research question of the thesis is whether the use of long-span collocates can improve performance of a probabilistic model of IR. The model used in the project is the Robertson & Sparck Jones probabilistic model.

The basic research question was explored by investigating three different ways of integrating collocation information with the probabilistic model:

  1. Global collocation analysis. The method consists in expanding the original query with long-span global collocates of query terms. Global collocates of a query term are selected from large fixed-size windows around all occurrences of a term in the corpus and ranked by statistical measures of Mutual Information (MI) and Z score. A fixed number of top-ranked collocates is used in query expansion.

    Query expansion with global collocates did not show to be superior to the original queries, the possible reason being the fact that query terms often have a fairly broad meaning and, hence, a rather semantically heterogeneous pattern of occurrence.

  2. Local collocation analysis. This method is a form of iterative query expansion following relevance or pseudo-relevance (blind) feedback. The original query is expanded with the query terms' collocates which are extracted from the long-span windows around all occurrences of query terms in the known relevant documents, and selected using statistical measures of MI and Z. Some parameters whose effect was systematically studied in this experiment set are: window size, measure of collocation significance for collocate ranking, number of query expansion collocates and categories of terms in the expanded queries.

    Some results showed a tendency towards performance gain over relevance feedback in the probabilistic model, however it was not significant enough to conclude that this method is superior to the existing relevance feedback used in the model.

  3. Lexical cohesion analysis using local collocations. This experiment set aimed to explore whether the level of lexical cohesion between query terms in a document can be linked to the document's relevance property, and if so, whether it can be used to predict documents' relevance to the query. Lexical cohesion between different query terms is estimated from the number of collocates they have in common.

    The experiments proved that there exists a statistically significant association between the level of lexical cohesion of the query terms in documents and relevance. Another set of experiments, aimed at using lexical cohesion to improve probabilistic document ranking, showed that sets re-ranked by their lexical cohesion scores have similar performance as the original ranking.

  4. © Olga Vechtomova
    Last updated: 15 January, 2004