Tom's INFSCI 2140 Reading & Muddiest Notes: January 2014

Friday, January 31, 2014

Reading Notes - Unit 4

IIR Chap 1.3 - 1.4

This sections talk about Boolean search queries in an inverted index. Section 1.3 talks about the best method for finding the answer: You take the the posting list of each term and do the intersection for AND, the union for OR. It also goes on to talk about dealing with nested Boolean operators. Section 1.4 compares Boolean search with ranked retrieval, for example Westlaw in the early 90's versus Google today.

IIR Chapter 6

Chapter 6 is about scoring weighting documents as opposed to the Boolean approach we've been using thus far

Parametric & Zone indexes

We use a parametric index that includes such items as date of creation
Zone indexes are more about the title and abstract, they are not predefined fields
We use weighted zone scoring on a pair of terms, on the interval [0,1]
Scoring is "learned"

Machine learning

Weighting term importance

Term frequency

The more a term is used, the more likely the document is about it.
However, we can use inverse document frequency to account for stop words which would occur a lot but are not important

Basically it finds if a naturally rare occurring word is occurring too frequently

Vector Space Scoring

Treat the document as a vector containing each of its dictionary terms
Documents can assessed against queries using the dot product

Treating the document as a vectors

Term Weighting for the vector space model

Words that occur 20x more often shouldn't 20x more important
tf Normalization

Tuesday, January 28, 2014

Muddiest Point - Unit 3

With ɣ codes, I understand how they work with the unary code + the offset. I also understand that they clearly help with compression. Its the jump between that I just don't understand.

Thursday, January 23, 2014

Reading Notes - Unit 3

IIR Chapter 4

Chapter 4 is about Index construction

A brief overview of the the general speed of different operations that affect us

seek time
transfer time
processor's clock rate
low-level operations

Because of hardware constraints external sorting algorithms are used

Block sort-based indexing

It does this in blocks to cut down on disk seek
Blocks are sorted individually then merged into one index

Single-pass in-memory indexing

Still unsure about how it is different from Block sort-based indexing

Distributed indexing

This is closest to real world, "Google-style" indexing
MapReduce

Controlled by master node
Uses key-value pairs to break a big problem into a small one
2 phases

Map phase

maps input data to key-value pairs

This then goes gets parsed to a segment file
Reduce phase

The inverters finish the job

This is a safe method, because the master node will coordinate and if necessary reassign jobs from nodes that are taking too long or have failed to others

Dynamic indexing

Used for constantly changing collections of documents

This doesn't include ranked-retreival systems.
Briefly mentions access control lists

IIR Chapter 5

Compression

Does the obvious and saves us space, up to 75%
Also helps with caching

modern systems are so fast, a compressed file can be read and decompressed faster than a large un compressed file

The author goes onto a discussion of statistics properties of terms

Heaps' law

Estimates the number of terms of a collection

Zipf's Law

How terms are distributed across documents

Dictionary Compression

Dictionary as a string

treat the dictionary as one long string

Blocked Storage

treat the dictionary as a string merged with a byte encoding its length

Front Coding

use common beginnings of words with blocked storage to decrease size

Postings File Compression

bytewise compression
bitwise compression
ɣ encoding

uses variable length codes
This cause quite a bit of confusion for me.

Tuesday, January 14, 2014

Muddiest Point - Unit 2

When it comes to permuterm and K-gram for wildcard queries, when would you use one or the other? Permuterm seems it may return a smaller, targeted group of words, where k-grams could return a huge result depending on the word.

Friday, January 10, 2014

Reading Notes - Unit 2

IIR Section1-2:

We looked at building inverted indexes, mostly an introduction.
Tokens, normalized tokens roughly equal words
Inverted index look sort of like linked lists (?)

IIR Chapter 2:

First we read about document types and what constitutes a document

You don't want to consider a collection of books a document
Conversely you don't want to consider a paragraph one either
Try to find a happy medium, or tailor it to your need

Interesting look at the challenges of languages
Token is an instance of a sequence of characters
Type is the class of all token
Term is a token that is included in the IR systems' dictionary

You might not include stop words like "the"

Looked at how to deal with names such as O'Neill
Normalization is the next step
Followed possibly by stemming and lemmatization.

It strikes me as interesting that stemming and lemmatization are still problematic and don't always solve the problem

A discussion of skip pointers follows

I'm still a little hazy on this.

Biword indexes with there ability to weed out most false-positives seem to have a lot of upsides with not so many downsides when it comes to search...

"friends romans", "romans countrymen"

Then positional schemes are introduced, followed by the combination of the biword and positional to find "Michael Jackson"

IIR Chapter 3

We begin with the venerable binary tree and b-tree.

b-trees are nice because it can help when the dictionary is disk based.

Then a discussion of wildcard queries begins

Permuterm indexes.

I still don't understand what's going on in these.

k-gram indexes

Spelling correction

Choose the "nearest one"
Choose the most likely if two words are equally close

Forms of spelling correction

isolated-term: correct a single query at a time
context-sensitive: using the most likely candidate

"flew form Heathrow" becomes "flew FROM Heathrow"

Edit distance uses matrices to determine how far a word is form another

cats" has a maximun Levenshtein distance of 3 from "fast"

k-gram indexes for spelling correction

how many k-grams does it have in common

Phonetic correction (the most interesting in my mind)

make a soundex index
I always wondered if something like this existed

Tuesday, January 7, 2014

Muddiest Notes - Unit 1

Week 1 didn't have anything that was too confusing.

Reading Notes - Unit 1

Finding Out About Section 1.1 Notes:

This chapter was a nice start the course. It provided a larger overview of what we're doing here. Having just come from Human Information Processing with Dr. Hirtle, this was a nice bridge.
The FOA conversation loop was a perfect model for what goes on in a conversation between two people.
They go on to show the model a search engine and it really showed how far we have to come.

Information Retrieval Section 1.1 - 1.2 Notes:

Section 1.1 is a basic introduction to the topic talking mostly about search engines but it does briefly touch on local search.
Figure 1.1 is nice model for what a search engine is doing.
They then go on to talk about systems and efficiency and effectiveness

Modern Information Retrieval Sections 1.1 - 1.4 Notes:

Again, a basic introduction to the topic, with a great discussion of ancient libraries and
Section 1.2-1.3 are a bit more CS heavy than the other readings but provide a concrete, high level model that I feel is the best of the three readings.
Section 1.3 introduces the idea of Ranking as the current best way to search
Section 1.4 has a brief history of the web and web search moving onto the modern era.