Winter 2024

CSI4107: Information Retrieval and the Internet


Instructor:  Diana Inkpen

E-mail: diana.inkpen@uottawa.ca 

Meeting Times and Locations

Lectures: See in BrightSpace.   Office Hours: Fri, 1:30pm-2:30pm  in SITE 5015

Overview

Basic principles of Information Retrieval. Indexing methods. Query processing. Linguistic aspects of Information Retrieval. Agents and artificial intelligence approaches to Information Retrieval. Relation of Information Retrieval to the World Wide Web. Search engines. Servers and clients. Browser and server-side programming for Information Retrieval.
Pre-Requisites
(CSI3103 or ELG3300), (CSI3125 or CSI2115 or SEG2101) or permission from the instructor.

Announcements:

Evaluation  Students will be evaluated on:

Note: Everything will be submitted electronically through BrightSpace.

Timetable  (no late assignments are considered)

Recommended Textbook

Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schutze, Cambridge University Press, 2008 (online version available)

Pretrained Transformers for Text Ranking: BERT and Beyond


Other books:
Information Retrieval, by D. Grossman and O. Frieder, Springer, 2004 (second edition).
Another online book Information Retrieval, by C. J. van Rijsbergen (1979)
Modern Information Retrieval, by Ricardo Baeza-Yates and Berthier Ribeiro-Neto, 1999. Companion website to this book.

Course notes (additional reading, pdf file)



Syllabus (subject to minor modifications and updates)  (The lecture slides will be in pdf format, you can read them with Acrobat Reader)


Week 1: 
Preliminaries. Introduction
: Goals and history of IR. The impact of the web on IR. The role of artificial intelligence (AI) in IR.
The Internet and the WWW: History of Internet. TCP/IP. IP addresses. WWW. HTTP. HTML. Web servers and clients.
Links: Top search engines in US in 2010 Search engine watch TREC CLEF FIRE


Week 2:
Basic IR Models: Boolean and vector-space retrieval models; ranked retrieval; text-similarity metrics; TF-IDF (term frequency/inverse document  frequency) weighting; cosine similarity.
Slides on Implementation of Vector Space Model  Example discussed in class Solution to the example.


Week 3: 
Experimental Evaluation of IR: Performance metrics: recall, precision, and F-measure; Evaluations on benchmark text collections.
Interpolated Precision.       Example discussed in class Solution to example.  


Week 4: 
Query Operations and Languages: Relevance feedback; Query expansion; Query languages.    
Example discussed in class   Solution (do it by yourself first)
Links:  WordNet Corpus-based Similarity Demo Dekang Lin's Demos WordNet::Similarity
Text Representation: Word statistics; Zipf's law; Porter stemmer; morphology; index term selection; using thesauri. Metadata and markup languages (SGML, HTML, XML).
More slides on Web markup languages: HTML, XML, XHTML, RDF, OWL

Other materials: Semantic Web and Linked Data  

Links:   Semantic Web Linked Data video Example: term frequencies in Tom Sawyer   


Week 5: 
Web Search: Search engines; spidering; metacrawlers; directed spidering; link analysis (e.g. hubs and authorities, Google PageRank); shopping agents.
Link Analysis: the hubs and authorities algorithm, and PageRank algorithm.

PageRank   Hubs and authorities example discussed in class Solution (do it by yourself first) PageRank examples
Links: Google - Parallel architecture    Slides about the Google 1998 paper


Week 6:  Text Categorization : Categorization algorithms: decision trees; Rocchio; k-nearest neighbor, Naive Bayes. Introduction to Deep Learning 

Links:   Weka data mining tool  Scikit-learn TensorFlow PyTorch Keras

Other materials: Extra slides on Naive Bayes SVM Sentiment Analysis


Week 7:  Feb 19-24 Study Break (Reading Week, no classes)


Week 8: 
Feb 28, Midterm revision; Mar 1, during class: Midterm


Week 9: 
Advances IR Models: Neural Information Retrieval Word embeddings Transformers&BERT

Probabilistic models and LSI. Extra slides on LSI. Language Models for Information Retrieval.


Week 10:

Text Clustering Clustering algorithms: agglomerative clustering; k-means. Applications to information filtering and organization.
Examples of text classification and clustering discussed in class   Solution (do it by yourself first)


Week 11:
Learning to Rank.


Week 12: 

Question Answering : Retrieving precise short answers to natural language queries.
Other material: Slides about IBM's Watson. Links to IBM's Watson Deep QA


Week 13: 
Cross-Language IR  Image Information Retrieval
Links: Content-based image retrieval


Week 14: 
Exam revision