Friday, January 31, 2014

Reading Notes - Unit 4

IIR Chap 1.3 - 1.4

This sections talk about Boolean search queries in an inverted index.  Section 1.3 talks about the best method for finding the answer: You take the the posting list of each term and do the intersection for AND, the union for OR.  It also goes on to talk about dealing with nested Boolean operators.  Section 1.4 compares Boolean search with ranked retrieval, for example Westlaw in the early 90's versus Google today.

IIR Chapter 6


  • Chapter 6 is about scoring weighting documents as opposed to the Boolean approach we've been using thus far
    • Parametric & Zone indexes
      • We use a parametric index that includes such items as date of creation
      • Zone indexes are more about the title and abstract, they are not predefined fields
      • We use weighted zone scoring on a pair of terms, on the interval [0,1]
      • Scoring is "learned"
        • Machine learning
    • Weighting term importance
      • Term frequency
        • The more a term is used, the more likely the document is about it.
        • However, we can use inverse document frequency to account for stop words which would occur a lot but are not important
          • Basically it finds if a naturally rare occurring word is occurring too frequently
    • Vector Space Scoring
      • Treat the document as a vector containing each of its dictionary terms
      • Documents can assessed against queries using the dot product
        • Treating the document as a vectors
    • Term Weighting for the vector space model
      • Words that occur 20x more often shouldn't 20x more important
      • tf Normalization

Tuesday, January 28, 2014

Muddiest Point - Unit 3

With ɣ codes, I understand how they work with the unary code + the offset.  I also understand that they clearly help with compression.  Its the jump between that I just don't understand.

Thursday, January 23, 2014

Reading Notes - Unit 3

IIR Chapter 4


  • Chapter 4 is about Index construction
    • A brief overview of the the general speed of different operations that affect us
      • seek time
      • transfer time
      • processor's clock rate
      • low-level operations
    • Because of hardware constraints external sorting algorithms are used
      • Block sort-based indexing
        • It does this in blocks to cut down on disk seek
        • Blocks are sorted individually then merged into one index
      • Single-pass in-memory indexing
        • Still unsure about how it is different from Block sort-based indexing
      • Distributed indexing
        • This is closest to real world, "Google-style" indexing
        • MapReduce
          • Controlled by master node
          • Uses key-value pairs to break a big problem into a small one
          • 2 phases
            • Map phase
              • maps input data to key-value pairs
            • This then goes gets parsed to a segment file
            • Reduce phase
              • The inverters finish the job
          • This is a safe method, because the master node will coordinate and if necessary reassign jobs from nodes that are taking too long or have failed to others
      • Dynamic indexing
        • Used for constantly changing collections of documents
    • This doesn't include ranked-retreival systems.
    • Briefly mentions access control lists

IIR Chapter 5

  • Compression
    • Does the obvious and saves us space, up to 75%
    • Also helps with caching
      • modern systems are so fast, a compressed file can be read and decompressed faster than a large un compressed file
  • The author goes onto a discussion of statistics properties of terms
    • Heaps' law
      • Estimates the number of terms of a collection
    • Zipf's Law
      • How terms are distributed across documents
  • Dictionary Compression
    • Dictionary as a string
      • treat the dictionary as one long string
    • Blocked Storage
      • treat the dictionary as a string merged with a byte encoding its length
    • Front Coding
      • use common beginnings of words with blocked storage to decrease size
  • Postings File Compression
    • bytewise compression
    • bitwise compression
    • ɣ encoding
      • uses variable length codes
      • This cause quite a bit of confusion for me.

Tuesday, January 14, 2014

Muddiest Point - Unit 2

When it comes to permuterm and K-gram for wildcard queries, when would you use one or the other? Permuterm seems it may return a smaller, targeted group of words, where k-grams could return a huge result depending on the word.

Friday, January 10, 2014

Reading Notes - Unit 2

IIR Section1-2:


  • We looked at building inverted indexes, mostly an introduction.  
  • Tokens, normalized tokens roughly equal words
  • Inverted index look sort of like linked lists (?)
IIR Chapter 2:

  • First we read about document types and what constitutes a document
    • You don't want to consider a collection of books a document
    • Conversely you don't want to consider a paragraph one either
    • Try to find a happy medium, or tailor it to your need
  • Interesting look at the challenges of languages
  • Token is an instance of a sequence of characters
  • Type is the class of all token
  • Term is a token that is included in the IR systems' dictionary
    • You might not include stop words like "the"
  • Looked at how to deal with names such as O'Neill
  • Normalization is the next step
  • Followed possibly by stemming and lemmatization.
    • It strikes me as interesting that stemming and lemmatization are still problematic and don't always solve the problem
  • A discussion of skip pointers follows
    • I'm still a little hazy on this.
  • Biword indexes with there ability to weed out most false-positives seem to have a lot of upsides with not so many downsides when it comes to search...
    • "friends romans", "romans countrymen"
  • Then positional schemes are introduced, followed by the combination of the biword and positional to find "Michael Jackson"

IIR Chapter 3

  • We begin with the venerable binary tree and b-tree.
    • b-trees are nice because it can help when the dictionary is disk based.
  • Then a discussion of wildcard queries begins
    • Permuterm indexes.
      • I still don't understand what's going on in these.
    • k-gram indexes
  • Spelling correction
    • Choose the "nearest one"
    • Choose the most likely if two words are equally close
  • Forms of spelling correction
    • isolated-term: correct a single query at a time
    • context-sensitive: using the most likely candidate
      • "flew form Heathrow" becomes "flew FROM Heathrow"
    • Edit distance uses matrices to determine how far a word is form another
      • cats" has a maximun Levenshtein distance of 3 from "fast"
    • k-gram indexes for spelling correction
      • how many k-grams does it have in common
    • Phonetic correction (the most interesting in my mind)
      • make a soundex index
      • I always wondered if something like this existed

Tuesday, January 7, 2014

Muddiest Notes - Unit 1

Week 1 didn't have anything that was too confusing.

Reading Notes - Unit 1

Finding Out About Section 1.1 Notes:


  • This chapter was a nice start the course.  It provided a larger overview of what we're doing here.  Having just come from Human Information Processing with Dr. Hirtle, this was a nice bridge.  
  • The FOA conversation loop was a perfect model for what goes on in a conversation between two people.
  • They go on to show the model a search engine and it really showed how far we have to come.

Information Retrieval  Section 1.1 - 1.2 Notes:

  • Section 1.1 is a basic introduction to the topic talking mostly about search engines but it does briefly touch on local search.
  • Figure 1.1 is nice model for what a search engine is doing.
  • They then go on to talk about systems and efficiency and effectiveness 


Modern Information Retrieval Sections 1.1 - 1.4 Notes:


  • Again, a basic introduction to the topic, with a great discussion of ancient libraries and 
  • Section 1.2-1.3 are a bit more CS heavy than the other readings but provide a concrete, high level model that I feel is the best of the three readings.
  • Section 1.3 introduces the idea of Ranking as the current best way to search
  • Section 1.4 has a brief history of the web and web search moving onto the modern era.