Friday, January 10, 2014

Reading Notes - Unit 2

IIR Section1-2:


  • We looked at building inverted indexes, mostly an introduction.  
  • Tokens, normalized tokens roughly equal words
  • Inverted index look sort of like linked lists (?)
IIR Chapter 2:

  • First we read about document types and what constitutes a document
    • You don't want to consider a collection of books a document
    • Conversely you don't want to consider a paragraph one either
    • Try to find a happy medium, or tailor it to your need
  • Interesting look at the challenges of languages
  • Token is an instance of a sequence of characters
  • Type is the class of all token
  • Term is a token that is included in the IR systems' dictionary
    • You might not include stop words like "the"
  • Looked at how to deal with names such as O'Neill
  • Normalization is the next step
  • Followed possibly by stemming and lemmatization.
    • It strikes me as interesting that stemming and lemmatization are still problematic and don't always solve the problem
  • A discussion of skip pointers follows
    • I'm still a little hazy on this.
  • Biword indexes with there ability to weed out most false-positives seem to have a lot of upsides with not so many downsides when it comes to search...
    • "friends romans", "romans countrymen"
  • Then positional schemes are introduced, followed by the combination of the biword and positional to find "Michael Jackson"

IIR Chapter 3

  • We begin with the venerable binary tree and b-tree.
    • b-trees are nice because it can help when the dictionary is disk based.
  • Then a discussion of wildcard queries begins
    • Permuterm indexes.
      • I still don't understand what's going on in these.
    • k-gram indexes
  • Spelling correction
    • Choose the "nearest one"
    • Choose the most likely if two words are equally close
  • Forms of spelling correction
    • isolated-term: correct a single query at a time
    • context-sensitive: using the most likely candidate
      • "flew form Heathrow" becomes "flew FROM Heathrow"
    • Edit distance uses matrices to determine how far a word is form another
      • cats" has a maximun Levenshtein distance of 3 from "fast"
    • k-gram indexes for spelling correction
      • how many k-grams does it have in common
    • Phonetic correction (the most interesting in my mind)
      • make a soundex index
      • I always wondered if something like this existed

No comments:

Post a Comment