Tom's INFSCI 2140 Reading & Muddiest Notes: Reading Notes - Unit 3

Thursday, January 23, 2014

Reading Notes - Unit 3

IIR Chapter 4

Chapter 4 is about Index construction

A brief overview of the the general speed of different operations that affect us

seek time
transfer time
processor's clock rate
low-level operations

Because of hardware constraints external sorting algorithms are used

Block sort-based indexing

It does this in blocks to cut down on disk seek
Blocks are sorted individually then merged into one index

Single-pass in-memory indexing

Still unsure about how it is different from Block sort-based indexing

Distributed indexing

This is closest to real world, "Google-style" indexing
MapReduce

Controlled by master node
Uses key-value pairs to break a big problem into a small one
2 phases

Map phase

maps input data to key-value pairs

This then goes gets parsed to a segment file
Reduce phase

The inverters finish the job

This is a safe method, because the master node will coordinate and if necessary reassign jobs from nodes that are taking too long or have failed to others

Dynamic indexing

Used for constantly changing collections of documents

This doesn't include ranked-retreival systems.
Briefly mentions access control lists

IIR Chapter 5

Compression

Does the obvious and saves us space, up to 75%
Also helps with caching

modern systems are so fast, a compressed file can be read and decompressed faster than a large un compressed file

The author goes onto a discussion of statistics properties of terms

Heaps' law

Estimates the number of terms of a collection

Zipf's Law

How terms are distributed across documents

Dictionary Compression

Dictionary as a string

treat the dictionary as one long string

Blocked Storage

treat the dictionary as a string merged with a byte encoding its length

Front Coding

use common beginnings of words with blocked storage to decrease size

Postings File Compression

bytewise compression
bitwise compression
ɣ encoding

uses variable length codes
This cause quite a bit of confusion for me.

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)