CSI4107, Winter 2005

CSI4107, Winter 2024

Assignment 1

Due Feb 11, 14:00

Information retrieval system [50 points]

Note: The work should be done in groups of students and submited via BrightSpace. Once a group member submits, all group members can see the submission.

You will implement an Information Retrieval (IR) system based on the vector space model, for a collection of documents, available here. You will submit the results of your system on a set of 50 test queries (available here). Read more about the format of the collection, queries, and relevance judgements here. You also have their ideal answers, the relevance judgments (available here) so that you can evaluate the performance of your system before submission (for computing evaluation measures you can use the trec_eval script, the latest version).

Implement an indexing scheme based on the vector space model, as discussed in class. Alternatively, you can use an existing IR system from the Internet and adapt it to work on this collection and queries (this system can use the vector space model or a more advanced model). The steps pointed out in class can be used as guidelines for the implementation. For weighting, you can use the tf-idf weighting scheme (w_ij = tf_ij∙idf_i). Alternatively, you can use a similar formula for that is known to work well and lead to higher retrieval performance, called BM25. For each query, your system will produce a ranked list of documents, starting with the most similar to the query and ending with the least similar.

Step1. [10 points] Preprocessing: Implement preprocessing functions for tokenization and stopword removal. The index terms will be all the words left after filtering out markup that is not part of the text, punctuation tokens, numbers, stopwords, etc. Optionally, you can use the Porter stemmer to stem the index words.

• Input: Documents that are read one by one from the collection

• Output: Tokens to be added to the index (vocabulary)

Step 2. [10 points] Indexing: Build an inverted index, with an entry for each word in the vocabulary. You can use any appropriate data structure (hash table, linked lists, Access database, etc.). An example of possible index is presented below. Note: if you use an existing IR system, use its indexing mechanism.

• Input: Tokens obtained from the preprocessing module

• Output: An inverted index for fast access

Step 3. [10 points] Retrieval and Ranking: Use the inverted index (from step 2) to find the limited set of documents that contain at least one of the query words. Compute the cosine similarity scores between a query and each document.

• Input: One query and the Inverted Index (from Step2)

• Output: Similarity values between the query and each of the documents. Rank the documents in decreasing order of similarity scores.

Run your system on the set of 50 test queries. Include the output in your submission as a file named Results. [10 points]

The file should have the following format, for the top-1000 results for each query/topic (the queries should be ordered in ascending order):

topic_id Q0 docno rank score tag
where: topic_id is the topic/query number, Q0 is an unused field (the literal 'Q0'), docno is the document id taken from the DOCNO field of the segment, rank is the rank assigned by your system to the segment (1 is the highest rank), score is the computed degree of match between the segment and the topic, and tag is a unique identifier you chose for this run (same for every topic and segment). Example:

1 Q0 AP880815-0061 1 0.8032 run_name

1 Q0 AP881005-0001 2 0.7586 run_name

1 Q0 AP881002-0014 3 0.6517 run_name

…

Resources: Here is a list of stopwords to use in Step 1. In Step1 you can use a regular expressions library to help with the filtering. If you want to use stemming, you can use the Porter stemmer. Stemming might increase the performance a bit.

One way to increase the performance is to add a pseudo-relevance feedback loop.

You can use any resources from the Internet, as long as you explain in your report how you used them (compilation, installation, adaptation, etc.).

You can use any IR system available on the Internet (Lucene, Lemur, Terrier, Inquire, etc.). One resource you can explore is the package ir.vsr. Some documentation on ir.vsr is available here.

Note: Feel free to add any optimizations or components that could improve the evaluation scores. Please explain them in your report and submit the Results file only for you best run. But please do not use neural information retrieval methods (based on deep learning, transformers, BERT or GPT based models). We will use them in Assignment 2.

Submission instructions:

- write a README file (plain text, Word format, or pdf) [10 points for this report] including:

* your names and student numbers Specify how the tasks were divided between the team members (if this info is not provided, the penalty is 5 points).

* a detailed note about the functionality of your programs,

* complete instructions on how to run them

* explain the algorithms, data structures, and optimizations that you used in each of the three steps. How big was the vocabulary? Include a sample of 100 tokens from your vocabulary. Include the first 10 answers to queries 1 and 25. Discuss your results.

* Include the Mean Average Precision (MAP) score computed with trec_eval for the results on the 50 test queries.

- produce a file named Results with the results for all the test queries for your best run, in the required format.

- for the queries/topic, use only the titles for one run and titles and descriptions for another run. Discuss which gives better results.

- submit your assignment, including programs, README file, and Results file, as a zip file in BrightSpace (only one team member needs to submit).

- don’t include the initial text collection.

Have fun!!!