- We looked at building inverted indexes, mostly an introduction.
- Tokens, normalized tokens roughly equal words
- Inverted index look sort of like linked lists (?)
IIR Chapter 2:
- First we read about document types and what constitutes a document
- You don't want to consider a collection of books a document
- Conversely you don't want to consider a paragraph one either
- Try to find a happy medium, or tailor it to your need
- Interesting look at the challenges of languages
- Token is an instance of a sequence of characters
- Type is the class of all token
- Term is a token that is included in the IR systems' dictionary
- You might not include stop words like "the"
- Looked at how to deal with names such as O'Neill
- Normalization is the next step
- Followed possibly by stemming and lemmatization.
- It strikes me as interesting that stemming and lemmatization are still problematic and don't always solve the problem
- A discussion of skip pointers follows
- I'm still a little hazy on this.
- Biword indexes with there ability to weed out most false-positives seem to have a lot of upsides with not so many downsides when it comes to search...
- "friends romans", "romans countrymen"
- Then positional schemes are introduced, followed by the combination of the biword and positional to find "Michael Jackson"
IIR Chapter 3
- We begin with the venerable binary tree and b-tree.
- b-trees are nice because it can help when the dictionary is disk based.
- Then a discussion of wildcard queries begins
- Permuterm indexes.
- I still don't understand what's going on in these.
- k-gram indexes
- Spelling correction
- Choose the "nearest one"
- Choose the most likely if two words are equally close
- Forms of spelling correction
- isolated-term: correct a single query at a time
- context-sensitive: using the most likely candidate
- "flew form Heathrow" becomes "flew FROM Heathrow"
- Edit distance uses matrices to determine how far a word is form another
- cats" has a maximun Levenshtein distance of 3 from "fast"
- k-gram indexes for spelling correction
- how many k-grams does it have in common
- Phonetic correction (the most interesting in my mind)
- make a soundex index
- I always wondered if something like this existed
No comments:
Post a Comment