Thursday, January 23, 2014

Reading Notes - Unit 3

IIR Chapter 4


  • Chapter 4 is about Index construction
    • A brief overview of the the general speed of different operations that affect us
      • seek time
      • transfer time
      • processor's clock rate
      • low-level operations
    • Because of hardware constraints external sorting algorithms are used
      • Block sort-based indexing
        • It does this in blocks to cut down on disk seek
        • Blocks are sorted individually then merged into one index
      • Single-pass in-memory indexing
        • Still unsure about how it is different from Block sort-based indexing
      • Distributed indexing
        • This is closest to real world, "Google-style" indexing
        • MapReduce
          • Controlled by master node
          • Uses key-value pairs to break a big problem into a small one
          • 2 phases
            • Map phase
              • maps input data to key-value pairs
            • This then goes gets parsed to a segment file
            • Reduce phase
              • The inverters finish the job
          • This is a safe method, because the master node will coordinate and if necessary reassign jobs from nodes that are taking too long or have failed to others
      • Dynamic indexing
        • Used for constantly changing collections of documents
    • This doesn't include ranked-retreival systems.
    • Briefly mentions access control lists

IIR Chapter 5

  • Compression
    • Does the obvious and saves us space, up to 75%
    • Also helps with caching
      • modern systems are so fast, a compressed file can be read and decompressed faster than a large un compressed file
  • The author goes onto a discussion of statistics properties of terms
    • Heaps' law
      • Estimates the number of terms of a collection
    • Zipf's Law
      • How terms are distributed across documents
  • Dictionary Compression
    • Dictionary as a string
      • treat the dictionary as one long string
    • Blocked Storage
      • treat the dictionary as a string merged with a byte encoding its length
    • Front Coding
      • use common beginnings of words with blocked storage to decrease size
  • Postings File Compression
    • bytewise compression
    • bitwise compression
    • ɣ encoding
      • uses variable length codes
      • This cause quite a bit of confusion for me.

No comments:

Post a Comment