- Chapter 4 is about Index construction
- A brief overview of the the general speed of different operations that affect us
- seek time
- transfer time
- processor's clock rate
- low-level operations
- Because of hardware constraints external sorting algorithms are used
- Block sort-based indexing
- It does this in blocks to cut down on disk seek
- Blocks are sorted individually then merged into one index
- Single-pass in-memory indexing
- Still unsure about how it is different from Block sort-based indexing
- Distributed indexing
- This is closest to real world, "Google-style" indexing
- MapReduce
- Controlled by master node
- Uses key-value pairs to break a big problem into a small one
- 2 phases
- Map phase
- maps input data to key-value pairs
- This then goes gets parsed to a segment file
- Reduce phase
- The inverters finish the job
- This is a safe method, because the master node will coordinate and if necessary reassign jobs from nodes that are taking too long or have failed to others
- Dynamic indexing
- Used for constantly changing collections of documents
- This doesn't include ranked-retreival systems.
- Briefly mentions access control lists
IIR Chapter 5
- Compression
- Does the obvious and saves us space, up to 75%
- Also helps with caching
- modern systems are so fast, a compressed file can be read and decompressed faster than a large un compressed file
- The author goes onto a discussion of statistics properties of terms
- Heaps' law
- Estimates the number of terms of a collection
- Zipf's Law
- How terms are distributed across documents
- Dictionary Compression
- Dictionary as a string
- treat the dictionary as one long string
- Blocked Storage
- treat the dictionary as a string merged with a byte encoding its length
- Front Coding
- use common beginnings of words with blocked storage to decrease size
- Postings File Compression
- bytewise compression
- bitwise compression
- ɣ encoding
- uses variable length codes
- This cause quite a bit of confusion for me.
No comments:
Post a Comment