Tom's INFSCI 2140 Reading & Muddiest Notes: Reading Notes - Unit 2

Friday, January 10, 2014

Reading Notes - Unit 2

IIR Section1-2:

We looked at building inverted indexes, mostly an introduction.
Tokens, normalized tokens roughly equal words
Inverted index look sort of like linked lists (?)

IIR Chapter 2:

First we read about document types and what constitutes a document

You don't want to consider a collection of books a document
Conversely you don't want to consider a paragraph one either
Try to find a happy medium, or tailor it to your need

Interesting look at the challenges of languages
Token is an instance of a sequence of characters
Type is the class of all token
Term is a token that is included in the IR systems' dictionary

You might not include stop words like "the"

Looked at how to deal with names such as O'Neill
Normalization is the next step
Followed possibly by stemming and lemmatization.

It strikes me as interesting that stemming and lemmatization are still problematic and don't always solve the problem

A discussion of skip pointers follows

I'm still a little hazy on this.

Biword indexes with there ability to weed out most false-positives seem to have a lot of upsides with not so many downsides when it comes to search...

"friends romans", "romans countrymen"

Then positional schemes are introduced, followed by the combination of the biword and positional to find "Michael Jackson"

IIR Chapter 3

We begin with the venerable binary tree and b-tree.

b-trees are nice because it can help when the dictionary is disk based.

Then a discussion of wildcard queries begins

Permuterm indexes.

I still don't understand what's going on in these.

k-gram indexes

Spelling correction

Choose the "nearest one"
Choose the most likely if two words are equally close

Forms of spelling correction

isolated-term: correct a single query at a time
context-sensitive: using the most likely candidate

"flew form Heathrow" becomes "flew FROM Heathrow"

Edit distance uses matrices to determine how far a word is form another

cats" has a maximun Levenshtein distance of 3 from "fast"

k-gram indexes for spelling correction

how many k-grams does it have in common

Phonetic correction (the most interesting in my mind)

make a soundex index
I always wondered if something like this existed

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)