CMSC 476/676

Information Retrieval

Spring 2024

Charles Nicholas nicholas@umbc.edu

ITE 356
Office Hours: Monday and Wednesday 2:30-4pm or by appointment.
410-455-2594 I don't check voicemail often, so email is better!
TA:
Ran Liu <rliu2@umbc.edu>
office hours: Monday and Tuesday 9-11am
WebEx: https://umbc.webex.com/meet/rliu2
Grader: TBD

This class is offered in hybrid format. The assigned classroom is ENG 231.

Remote participation via WebEx is fine, too. The WebEx link is here. To access this page, you will need to be on campus, or connected with the UMBC VPN.

The class meets 5:30-6:45, Tuesdays and Thursdays

This course is an introduction to the theory and implementation of software systems designed to search through large collections of text. Did you ever wonder how World-Wide Web search engines work? Ever wondered why they don't? You'll learn about it here. Information retrieval (IR) is one of the oldest branches of computer science, and has influenced nearly every aspect of computer usage: "search and replace" in a word processor, querying a card catalog, grep'ing through your source code, filtering the spam out of your email, searching the Web.

This course will have two main thrusts. The first is to cover the fundamentals of IR: retrieval models, search algorithms, and IR evaluation. The second is to give a taste of the implementation issues by having you write (a good chunk of) your own text search engine and test it out on a sample text collection. This will be a semester-long project, details to follow.

You will need to have taken the equivalent of CMSC 341 (Data Structures), and an algorithms course (441 or 641) is recommended. Linear algebra (MATH 221) and Statistics (STAT 355) are recommended but not required; they give background which will be helpful in understanding many IR concepts.

Text and Handouts

We are using Introduction to Information Retrieval as the textbook.

Details about which chapters will be covered, and when, will follow. The slides to be used in class will be based on those provided by the authors of the textbook, but I may modify them from time to time. It'd be a good idea to study the slides BEFORE each class. Other papers and resources are available. Suggestions to add to this list are welcome.

The text from earlier offerings of the course, Modern Information Retrieval, second edition, by Ricardo Baeza-Yates and Berthier Ribeiro-Neto., may be useful as a reference. You can see the slides for that book at http://www.mir2ed.org.

Grading

There will be a multi-phase programming project, details to be announced, worth 50% of the grade. There will be a mid-term exam, worth 25% of the grade. There will also be a writing project, worth 25%.

Those students enrolled in CMSC 676 will be expected to write a paper of the depth that might lead to a Master's Writing Project or Thesis. Graduate stduents will also be expected to present their writing projects at the end of the semester, and undergraduates are welcome to do so. These presentations will take the place of the final exam, and no final exam as such is planned.

Generative AI: For this class, if you use ChatGPT (or similar chatbots or AI-based generation tools), you must describe exactly how you used it, including providing the prompt, original generation, and your edits. This applies to prose, code, or any form of content creation. Not disclosing is an academic integrity violation. If you do disclose, your answer may receive anywhere from 0 to full credit, depending on the extent of substantive edits, achievement of learning outcomes, and overall circumvention of those outcomes.

Use of AI/automatic tools for grammatical assistance (such as spell-checkers or Grammarly) or small-scale predictive text (e.g., next word prediction, tab completion) is okay. Provided the use of these tools does not change the substance of your work, use of these tools may be, but is not required to be, disclosed.

Academic Integrity

Students are expected to do their own assignments. We may allow collaboration on certain assignments during the semester, but we will tell you so as that happens. If you submit for credit work that is not your own, there will be consequences, perhaps including zero on that assignment, reduction in final grade, or forfeiture of current or future prospects for financial aid from CSEE. Here is a web site that explains UMBC's position on Academic Integrity.

Resources for Students

Do you know about Retriever Essentials? It's there if you need them. According to their web site, "Retriever Essentials is a faculty, staff, and student-led partnership that promotes food access in the UMBC community. However, we offer more than just free groceries, we also offer toiletries, baby items, and meal swipes. The services we provide that are listed below are 100% free. You can find more in-depth information regarding each of our services in the attached documents."

We also incorporate the Syllabus Language provided by the UMBC Office of Equity and Civil Rights for this semester, as given here:
https://ecr.umbc.edu/sample-title-ix-responsible-employee-syllabus-language/

 

What Happens in the Spring 2024 Semester UNDER CONSTRUCTION

We will follow the textbook closely. I reserve the right to make minor changes along the way, but the basic structure will be as follows. Some chapters are long enough or important enough to warrant coverage over two lectures.

We will cover the chapters in the text in order, at the rate of approximately one chapter per week. The 676 presentations will take place in early May. The following is subject to change as progress of the class warrants.

Week Dates in 2024 Topics/Activities
1 1/30, 2/1

Introduction
Chapter 1 (ppt,pdf)
discuss writing project: a topic that interests you, which you can describe in a few sentences, and 3-4 sources of information
students in 676 are expected to write a paper of the depth that could be expanded into a M.S. Writing Project or Thesis
students in 676 are also expected to present their work to the class at the end of the semester
Recording for 1/30

Chapter 2 (ppt,pdf)
Before we get into slides, we need to discuss document representations
Dr. Nicholas has a copy of the Shakespeare Corpus. Other useful corpora are available.
Discuss the Terrier IR system
Recording for 2/1

2 2/6, 2/8

Discuss Terrier, and demo of Terrier Desktop installation
More about TREC http://trec.nist.gov
Recording for 2/6

NO OFFICE HOURS February 7, 2024. Send email and we can schedule a time to meet.

An interesting link to other IR Resources
Continue Chapter 2 slides
Recording for 2/8

3 2/13, 2/15

CLASS on TUESDAY 2/13 is REMOTE - nobody needs to (or should try to) attend in person! Thanks!

Begin to discuss first phase of programming project
Chapter 3 (ppt, pdf)
Recording for 2/13/2024

AS of this date, we reserve the right to grant extra credit to people who are in class when it starts at 5:30!

CLASS on THURSDAY 2/15 is REMOTE - nobody needs to (or should try to) attend in person! Thanks!

Demo of the USPTO search engine located at Patent Public Search

Hints for term project:

  1. pick a topic in which you are interested
  2. describe it in a few sentences
  3. find 3-5 references, including the course textbook(s) (MRS covers some additional material), and seminal papers

Possible Paper Topics:

  • Various forms of term weighting - BM25, log-entropy, tf.idf, and others
  • IR for non-English (or non-Western) languages
  • text compression algorithms
  • author attribution, e.g. dialect or a specific person or genre
  • a specific author e.g. Lebanese-American Kahlil Gibran or Celine Dion's lyrics
  • construction of stoplists, from one language to another...

Please send your term project idea to me in an email to nicholas@umbc.edu within ten days.
finding information: Google, Google Scholar, find seminal papers, don't forget to look at patents

Recording for 2/15/2024

4 2/20, 2/22

CLASS on TUESDAY 2/20 and THURSDAY 2/22 is REMOTE - nobody needs to (or should try to) attend in person! Thanks!

Finish slides from before
A worksheet that explains Levenshtein Distance, and the (corrected) example.

Recording for 2/20/2024

Chapter 4 (pdf, ppt)

Project Parameters:

  • A paper can be prepared using LaTeX, or a Jupyter Notebook!
  • What's being talked about at IR conferences?
  • Searching the ACM Digital Library
  • Searching the IEEE Digital Library

Phase 1 is due this Friday.
Recording for 2/22/2024
5 2/27,2/29

CLASS on TUESDAY 2/27 and THURSDAY 2/29 is REMOTE - nobody needs to (or should try to) attend in person! Thanks!

Release Phase 2 of the project.
We will not spend a lot of time on Spell Correction in Chapter 5 (ppt, pdf)
We will spend some time on Chapter 6 (ppt, pdf)
You may find it helpful to look at this spreadsheet, which demonstrates some tf.idf concepts (xls)
Recording for 2/27/2024

Please submit project ideas by today!
More on term weighting in a term-document matrix.
Recording for 2/29/2024

6 3/5, 3/7

CLASS on TUESDAY 3/5 and THURSDAY 3/7 is REMOTE

More on Phase 2 of the project

Feedback on Phase 1 of the project: do NOT send me zip files. Don't send me the HTML corpus. Most people got almost full credit. I had some minor complaints with many of you regarding O() runtime analysis. It turns out that Python counter class does have O(1) lookup, on average. Sizes of HTML files may cause confusion.

Feedback on paper topics: most if not all have been approved. When citing references, give me title and author and venue, not just a link. Many of you chose topics that you may find are too broad, so focus as you need to.

More on Chapter 6 as needed
Some coverage of the probabilistic model of IR from Chapter 7 (ppt, pdf)
That stuff is too hair-raising! Let's use these slides (ppt, pdf) from Croft's Search Engines
and this handout I prepared a few years ago
Recording for 3/5/2024

More on Probabilistic IR
A paper on BM25 Lu, Robertson, and Macfarlane (pdf)
and I have noticed a Jupyter Notebook that explains BM25 and (maybe) lets one work with it!
More on Phase 2 as needed

Recording for 3/7/2024

  3/12, 3/14

CLASS on TUESDAY 3/12 and THURSDAY 3/14 is REMOTE

Phase 2 of project is due Monday, March 11
More on BM25, including a demo with Google Colab!
Recording for 3/12/2024


Release Phase 3 of project
Recording for 3/14/2024

  3/19. 3/21 Spring Break
8 3/26, 3/28

CLASS on TUESDAY 3/26 and THURSDAY 3/28 is HYBRID. We will meet in ENG 231, or WebEx, as the student prefers.

A demo of Zipf's Law using Google Colab. (Inside UMBC only)

Recording for 3/26/2024

The midterm exam will really be on Thursday of this week. There will be no recording,

The exam will be available over Blackboard at 5:30pm. Go to Blackboard, select this class, select Course Materials, and select Midterm. Open book and open notes. Web search is allowed, but no AI help. No other collaboration is allowed.

Topics include material from the slides presented in class and textbook Chapters 1-8, PLUS the Levenshtein distance.
The mid-term exam I gave in 2009 (pdf)
The mid-term exam I gave in 2014 (pdf)
The midterm exam will be open book and open notes

9 4/2, 4/4

CLASS on TUESDAY and THURSDAY this week is HYBRID. We will meet in ENG 231, or WebEx, as the student prefers.

Go over the exam.
Recording for 4/2/2024.

Recording for 4/4/2024

10 4/9, 4/11

Evaluation of IR systems Chapter 8 (ppt, pdf)
Project 4 is now available,and is due April 24
Recording for 4/9/2024

Finish Chapter 8

Schedule your presentations using the link found here.
DEADLINE is 11:59pm Friday April 26!


Recording for 4/11/2024

11 4/16, 4/18

MRS coverage of Latent Semantic Analysis is a little thin.
So add Ian's LSI slides (pdf).
The seminal paper on LSI is Deerwester et al.
My example.
recording from 4/16/2024

On Thursday, a special topic: authorship attribution.
Who Wrote This Document? (pdf)
recording from 4/18/2024

12 4/23, 4/25

Class will be ONLINE ONLY today. Tuesday, April 23. Due to Dr. Nicholas and a minor illness. (Allergies, I think, not COVID.)
Chapter 9 from Croft "Search Engines in Practice" (ppt,pdf)
Recording from 4/23/2024

Class will be hybrid on Thursday 4/25.
Students in CMSC 676 have until the end of class TODAY to sign up for a presentation slot.

Chapter 10 (ppt) as time permits
Chapter 11 (ppt) as time permits
Chapter 18 Web Crawling (ppt, pdf) as time permits
Project 5 is now available.

Recording from 4/25/2024

Schedule your presentations using the link found here.
DEADLINE is 11:59pm Friday April 26!

Format for student presentations: You can use your own, but I can suggest:

  • Title and your name
  • Brief introduction to your topic, and how you got interested (1-2 slides)
    • short bullet points are okay
  • Discussion of basic definitions and concepts (1 slide)
  • Survey of related work, "concept by concept" better than "paper by paper" (1-2 slides, OPTIONAL)
  • Some basic example or use case (2-3 slides, combine with next bullet)
  • Discussion of what still needs to be done in this area? What could be done?
  • Don't bother to list references
  • NO MORE THAN 8 SLIDES

  • Allow time for Q&A
  • Assuming you use Google Slides, share the link with me
  • Presentations will be virtual.
  • Presentations will be limited to 8-10 minutes.
  • Short presentations are okay!

13 4/30, 5/2

On Tuesday, I'll be giving a dry run of my Research Day talk.
You've heard parts of this before, I know, but your feedback will be appreciated.
Class on Tuesday will be REMOTE today, but hybrid on Thursday.

Recording from 4/30/2024

Student presentations for Thursday.
For each speaker, fill in this feedback form.

Recording from 5/2/2024

Please participate in CSEE Research Day on Friday May 3!

14 5/7, 5/9

Project 5 is due on May 8.

Student presentations for Tuesday

  • Siddhardha Gunnam
  • Vamsi Vajja
  • Anushka Dhekne
  • chaitanya
  • Sai Teja Challa

Recording from 5/7/2024

Student presentations for Thursday

  • Divya sree tamma
  • Saravan Pathapati
  • Vishal Goud Sakkari
  • Sukhbir Singh Sardar
  • Namratha Siddula (postponed until next time)

For each speaker, fill in this feedback form.
Recording from 5/9/2024

papers due when? Papers submitted via Blackboard by 11:59pm May 11.

Extra credit +10 if you submited your paper by 11:59pm May 6.
Extra credit +5 if paper submitted by original deadline of 11:59pm May 10
PDF ONLY

Approximately seven-ten pages, double-spaced, judicious use of figures and tables

15 5/14

Student presentations for Tuesday

  • Devon Slonaker
  • Prakhar Dixit
  • Shawn Bray
  • Rama Sai Mamidala
  • Jai Kishan Timmapatruni
  • Chris Abili

For each speaker, fill in this feedback form.
Recording from 5/14/2024

    NO FINAL EXAM, the writing project takes the place of the final exam