Tom's INFSCI 2140 Reading & Muddiest Notes: 2014

Tuesday, April 15, 2014

Saturday, April 12, 2014

Reading Notes - Unit 13

IIR - Chapter 13

Text Classification and Naive Bayes

The Text Classification Problem

Supervised learning

Naive Bayes Text Classification

A probabilistic learning method

Relation to Multinomial Unigram Language Model

Formally identical, its a special case

The Bernoulli Model

Equivalent to the binary independence model

Properties of Naive Bayes

An alternative formalization of the multinomial model represents each document

Feature Selection

The process of selecting a subset of the terms occurring in the training set
Mutual information
x^2 Feature selection
Frequency-based feature selection
Feature election of multiple classifiers
Mutual information and x^2 represent rather different feature selection methods

Evaluation of Text Classification

The classic Reuters-21578 collection was the main benchmark for text classification evaluation
We can measure recall, precision, and accuracy

IIR - Chapter 14

Vector Space Classification

Document representations and measure of relatedness in vector spaces
Rocchio classification
k nearest neighbor

Time complexity and optimality of kNN

Linear versus nonlinear classifiers
Classification with more than two classes
The bias-variance tradeoff

IIR - Chapter 16

Flat clustering

Clustering in information retrieval

States the fundamental assumption we make when using clustering in information retrieval

Problem statement: Given (i) a set of document a desired number of clusters K and an objective function that evaluates the qua lit of a clustering, we want to compute an assignment that minimized the objective function

Cardinality - the number of clusters

The evaluation of clustering
K-means

Cluster cardinality in K-means

Model-based clustering

IIR - Chapter 17

Hierarchical clustering

Clustering is efficient but has some drawbacks
Hierarchical agglomerative clustering
Single-link and complete-link clustering

The complexity of the naive HAC algorithm is O(n^3)
Group-average agglomerative clustering

Centroid Clustering

The similarity of two clusters is defined as the similarity of their centroids

Optimality of HAC

Single-link
GAAC
complete-link

Divisive clustering

Cluster hierarchy can be generated top down

Cluster Labeling

human users interact with clusters, need labeling

Implementation notes

Problems require the computation of a large number of dot products

Wednesday, April 9, 2014

Muddiest Point - Unit 12

None this week, since we're going more in depth next week.

Friday, April 4, 2014

Ahn et al. - Personalized Web Exploration with Task Models

This paper is about personalized web search, specifically exploratory web search. Exploratory searches are those that go beyond the typical "how many inches in a foot" type searches that seek a simple answer. This paper covers the testing of a tool the authors came up with called TaskSieve. TaskSieve uses relevance feedback to offer the user personalized search.

Pazzani & Billsus - Content-Based Recommendation Systems

This paper is about content-based recommendation systems. These systems are used everyday from web search to Amazon.com as a way to help the customer find other items they may enjoy and/or to help the retailer sell more product. These systems are usually helped by algorithms that analyze a user's prior history, though sometimes they also have the user enter the information too.

Gauch et al. - User Profiles for Personalized Information Access

This paper dove tails nicely with the previous two of this week. It covers the profiling of users. It covers methods for user identification, and other collection techniques. The paper looks at the need of companies and projects to have access to more specific information about their customers and participants. It is interesting that they only briefly touch on the privacy implications of all the interesting facts that you can glean from the user, both implicitly and explicitly.

Tuesday, April 1, 2014

Muddiest Points - Unit 11

It seems that an easier task would be to translate documents and then search them. Rather than to try and dynamically search documents with queries in a different language?

Thursday, March 27, 2014

Reading Notes - Unit 11

*A note to our regular readers Oard & Diekema's paper Cross-Language Information Retrieval is not able to be retrieved anywhere.

IES Chapter 14

Dealing with large amounts of data, like Google. This chapter looks at ways of dealing with the massive amount of data

Parallel Query Processing - using index partitioning & replication

Document Partitioning

Each server has a subset of the documents

Term Partitioning

Each server has a subset of the index in memory

Hybrid Schemes

Use term & document partitioning
OR use document partitioning and replication

Redundancy & Fault Tolerance

We must assume that with more queries and machines there will be faults
We can handle this with replication and partial replication

MapReduce

The Basic Framework

Highly parallelizable by executing map and reduce at the same time.

Combiners

A reduce function applied to a map shard and not a reduce shard

Secondary Keys

A function of MapReduce that deals with duplicate keys

Machine Failures

the map side deals with failure better than the reduce side

He & Wang - Cross-Language Information Retrieval

This excerpt from a book looks at the challenges of taking queries in one language and returning relevant information that is in another. Currently no search engine supports this, even Google. First, the system must decide on how it will translate the given query. Currently you can use all the usual techniques such as stemming tokenization, phrase identification, stop-word removal, n-gram etc. Then the application must have translation knowledge. This can be done with bilingual dictionaries or corpora. But, the system must be able to deal with acronyms and proper nouns. The chapter then goes into how to take this knowledge to find document that best suits the user using term weighting. Then the system must have some method for being evaluated. There are few evaluation frameworks available such as the CLIR TREC track.

Tuesday, March 25, 2014

Muddiest Point - Unit 10

This was a good week. I didn't have trouble with anything.

Friday, March 21, 2014

Reading Note - Unit 10

Brin & Page - The Anatomy of a Large-Scale Hypertextual Web Search Engine

This is the rock star of academic papers. This is what every paper hopes that it becomes. This isn't just a paper we can learn from.

This is history.

Whether you like Google or not, very few companies of its size can so clearly point to their beginning in a paper like this. Page and Brin discuss what have become the foundations for Google and web search. They talk about PageRank and how it uses a relatively simple algorithm to determine a page's importance by tracking the pages that link to it. There is then a review of related work that led them to PageRank. They then go on to discuss Google's architecture for storing and crawling the data and the lexicon. There is then a section of results and then conclusions. The last sentence is funny to look back on: "We hope Google will be a resource for searchers and researchers all around the world and will spark the next generation of search engine technology." If it had just been that.

Kleinberg - Authoritative Sources in a Hyperlinked Environment

This paper discusses web search. It would seem to come to the same conclusions as Brin & Page but without the billion dollar company: We can find good information on the web by looking to see who links to whom, but in this case this paper advocates for the use of authoritative papers.

IIR Chapter 19

This might be the first chapter in this book that actually has some practical stuff in it. There is brief discussion of the history of web search. Then it goes on to talk about the model of the Web and how it helped it grow so explosively. The web looks suspiciously like a graph. Then the paper goes on to talk about spam and how it was inevitable because "spam stems from the heterogeneity of motives in content creation on the Web." The chapter goes on to talk about the users that use search. It finishes up this chapter with dealing with duplicates.

IIR Chapter 20

This chapter deals with the structure of the web as dealt with briefly in chapter 19. This appears to be mostly a rehash of the two papers we read for this week .

Tuesday, March 4, 2014

Muddiest Notes - Unit 8

No unclear points this week, the lecture was a really great summary of what we've done up to this point, it really pulled it all together.

Friday, February 28, 2014

Reading Note - Unit 8

*Note, somehow reading note for Unit 6 got republished or was saved as a draft...not sure what happened

MIR Chapter 10

This chapter is about how we can present the information to user and how the user interacts with the search system. First it considers Human-Computer interaction. We must offer informative feedback. Reduce working memory load. Provide alternative inference for novice and expert users. Make sure that we use the capabilities of modern computers to visual display information. We must consider how the user will access the information. Basically this chapter covered the issues facing the search professional as we develop models to help users use what we find.

Hearst Chapter 1

This excerpt talks specifically about the design of search user interfaces. Keep it simple. Realize that users are no longer necessarily highly educated professionals, with in depth domain knowledge. we have to take into account how "useable" our interface is. There is very good research that can guide us. We must give the user timely and useful feedback. But we must balance this with doing some things automatically. We can't overwhelm the user's short-term memory. And, we must provide short cuts, or hints. Help the user to reduce errors. Don't forget the small things. It has to look good, too on top of everything else.

Hearst Chapter 11

This excerpt talks specifically about visualization for text analysis. We can use graphics to show relationships. We can visualize concordances, like SeeSoft or tag clouds. We can also visualize relationships between documents, like citations in the scientific literature.

Reading Note - Unit 6

*A note to our regular readers, we skipped unit 5

First up we have chapter 8 of IIR

Measuring the effectiveness of IR systems

We need a test collection
We need a set of test queries
We need a set of relevance judgments as the book calls them

Test collections for this purpose

Cranfield collection
TREC

Put together by NIST

GOV2

Bigger version of TREC, also done by NIST
Still 2 orders smaller than that indexed by search engines

NTCIR

Focuses on east-asian languages

CLEF

European languages

Reuters

Newswires

20 Newsgroups

Evaluating unranked retrieval results

Precision

fraction of documents retrieved that are relevant

Recall

fraction of relevant documents that are retrieved

F-measure

A single measure that uses both

Evaluating ranked retrieval results

Precision-recall curve

Why not just use F-measure?

Many other ways to evaluate results

Developing reliable and informative test collections

Using pooling of the top k documents and having them judged by experts

User utility & the use of document relevance

Satisfaction of the users is very important

Maybe more so than whether an expert judges something relevant

Results snippets

Just like Google, we should give small snippets of the returned text for each ranked document

Cumulated Gain-Based Evaluation of IR Techniques

This is a paper from 2002 that looks at several techniques for evaluating IR systems or techniques. It talks about recall and precision like Ch. 8 but attempts to go further. The first one uses the relevance scores of the documents in the results. The second, discounts "late-retrieved" documents. The third method looks at the performance of different techniques. They used the TREC-7 data set. This paper would seem to be the basis for our ability to really test different IR systems using different established methods.

What's the value of TREC

Tuesday, February 25, 2014

Muddiest Point - Unit 7

This week I do not have a muddiest point.

Friday, February 21, 2014

Reading Notes - Unit 7

IIR Chapter 9

Relevance feedback and query expansion

Relevance feedback pseudo relevance feedback

The Rocchio algorithm for relevance feedback

classic algorithm for implementing feedback
models feedback into vector space model

Probabilistic relevance feedback

Naive Bayes probabilistic model

When does it work?

relies on assumptions

user must have enough knowledge

On the web

Excite tried full relevance feedback but it was dropped

Evaluation of relevance feedback strategies

1-2 rounds of feedback

Pseudo relevance feedback

method for automatic local analysis

Indirect relevance feedback

also called implicit
DirectHit used this

Global method for query reformulation

Vocab tools

stemming
thesaurus

Query Expansion

Users give feedback

Automatic thesaurus generation

analyze documents automatically

Relevance Feedback Revisited

This study showed the that relevance feedback worked. Specifically it showed that by using query expansion and query reweighing you could show improvements of 100%. It also showed that the more the merrier when it came to rounds of relevance feedback.

A Study of Methods for Negative Relevance Feedback

All about negative relevance feedback models. They look at a bunch of models. They found that the language model were more effective than those.

Improving the Effectiveness of Information Retrieval with Local Context Analysis

This paper is about local context analysis, a new model that they proposed. It worked with both English AND non-English collections. They found it worked better than any other model. They say they are going to continue on this work and look into when to expand queries.

Muddiest Notes - Unit 6

Unit 6 was hard to understand in general. I'm still taking it all in. At this point I can't really say what the muddiest point was, so much as the whole thing. I guess my biggest problem would be in how we determine what is "relevant". If it truly is by expert or what the user deems relevant that seems...poor.

Tuesday, February 4, 2014

Muddiest Note - Unit 4

This week was pretty straight forward. The only sticking point might be the thought of cosines of angles of vectors with millions of dimensions. This is just because I am a visual person.

Friday, January 31, 2014

Reading Notes - Unit 4

IIR Chap 1.3 - 1.4

This sections talk about Boolean search queries in an inverted index. Section 1.3 talks about the best method for finding the answer: You take the the posting list of each term and do the intersection for AND, the union for OR. It also goes on to talk about dealing with nested Boolean operators. Section 1.4 compares Boolean search with ranked retrieval, for example Westlaw in the early 90's versus Google today.

IIR Chapter 6

Chapter 6 is about scoring weighting documents as opposed to the Boolean approach we've been using thus far

Parametric & Zone indexes

We use a parametric index that includes such items as date of creation
Zone indexes are more about the title and abstract, they are not predefined fields
We use weighted zone scoring on a pair of terms, on the interval [0,1]
Scoring is "learned"

Machine learning

Weighting term importance

Term frequency

The more a term is used, the more likely the document is about it.
However, we can use inverse document frequency to account for stop words which would occur a lot but are not important

Basically it finds if a naturally rare occurring word is occurring too frequently

Vector Space Scoring

Treat the document as a vector containing each of its dictionary terms
Documents can assessed against queries using the dot product

Treating the document as a vectors

Term Weighting for the vector space model

Words that occur 20x more often shouldn't 20x more important
tf Normalization

Tuesday, January 28, 2014

Muddiest Point - Unit 3

With ɣ codes, I understand how they work with the unary code + the offset. I also understand that they clearly help with compression. Its the jump between that I just don't understand.

Thursday, January 23, 2014

Reading Notes - Unit 3

IIR Chapter 4

Chapter 4 is about Index construction

A brief overview of the the general speed of different operations that affect us

seek time
transfer time
processor's clock rate
low-level operations

Because of hardware constraints external sorting algorithms are used

Block sort-based indexing

It does this in blocks to cut down on disk seek
Blocks are sorted individually then merged into one index

Single-pass in-memory indexing

Still unsure about how it is different from Block sort-based indexing

Distributed indexing

This is closest to real world, "Google-style" indexing
MapReduce

Controlled by master node
Uses key-value pairs to break a big problem into a small one
2 phases

Map phase

maps input data to key-value pairs

This then goes gets parsed to a segment file
Reduce phase

The inverters finish the job

This is a safe method, because the master node will coordinate and if necessary reassign jobs from nodes that are taking too long or have failed to others

Dynamic indexing

Used for constantly changing collections of documents

This doesn't include ranked-retreival systems.
Briefly mentions access control lists

IIR Chapter 5

Compression

Does the obvious and saves us space, up to 75%
Also helps with caching

modern systems are so fast, a compressed file can be read and decompressed faster than a large un compressed file

The author goes onto a discussion of statistics properties of terms

Heaps' law

Estimates the number of terms of a collection

Zipf's Law

How terms are distributed across documents

Dictionary Compression

Dictionary as a string

treat the dictionary as one long string

Blocked Storage

treat the dictionary as a string merged with a byte encoding its length

Front Coding

use common beginnings of words with blocked storage to decrease size

Postings File Compression

bytewise compression
bitwise compression
ɣ encoding

uses variable length codes
This cause quite a bit of confusion for me.

Tuesday, January 14, 2014

Muddiest Point - Unit 2

When it comes to permuterm and K-gram for wildcard queries, when would you use one or the other? Permuterm seems it may return a smaller, targeted group of words, where k-grams could return a huge result depending on the word.

Friday, January 10, 2014

Reading Notes - Unit 2

IIR Section1-2:

We looked at building inverted indexes, mostly an introduction.
Tokens, normalized tokens roughly equal words
Inverted index look sort of like linked lists (?)

IIR Chapter 2:

First we read about document types and what constitutes a document

You don't want to consider a collection of books a document
Conversely you don't want to consider a paragraph one either
Try to find a happy medium, or tailor it to your need

Interesting look at the challenges of languages
Token is an instance of a sequence of characters
Type is the class of all token
Term is a token that is included in the IR systems' dictionary

You might not include stop words like "the"

Looked at how to deal with names such as O'Neill
Normalization is the next step
Followed possibly by stemming and lemmatization.

It strikes me as interesting that stemming and lemmatization are still problematic and don't always solve the problem

A discussion of skip pointers follows

I'm still a little hazy on this.

Biword indexes with there ability to weed out most false-positives seem to have a lot of upsides with not so many downsides when it comes to search...

"friends romans", "romans countrymen"

Then positional schemes are introduced, followed by the combination of the biword and positional to find "Michael Jackson"

IIR Chapter 3

We begin with the venerable binary tree and b-tree.

b-trees are nice because it can help when the dictionary is disk based.

Then a discussion of wildcard queries begins

Permuterm indexes.

I still don't understand what's going on in these.

k-gram indexes

Spelling correction

Choose the "nearest one"
Choose the most likely if two words are equally close

Forms of spelling correction

isolated-term: correct a single query at a time
context-sensitive: using the most likely candidate

"flew form Heathrow" becomes "flew FROM Heathrow"

Edit distance uses matrices to determine how far a word is form another

cats" has a maximun Levenshtein distance of 3 from "fast"

k-gram indexes for spelling correction

how many k-grams does it have in common

Phonetic correction (the most interesting in my mind)

make a soundex index
I always wondered if something like this existed

Tuesday, January 7, 2014

Muddiest Notes - Unit 1

Week 1 didn't have anything that was too confusing.

Reading Notes - Unit 1

Finding Out About Section 1.1 Notes:

This chapter was a nice start the course. It provided a larger overview of what we're doing here. Having just come from Human Information Processing with Dr. Hirtle, this was a nice bridge.
The FOA conversation loop was a perfect model for what goes on in a conversation between two people.
They go on to show the model a search engine and it really showed how far we have to come.

Information Retrieval Section 1.1 - 1.2 Notes:

Section 1.1 is a basic introduction to the topic talking mostly about search engines but it does briefly touch on local search.
Figure 1.1 is nice model for what a search engine is doing.
They then go on to talk about systems and efficiency and effectiveness

Modern Information Retrieval Sections 1.1 - 1.4 Notes:

Again, a basic introduction to the topic, with a great discussion of ancient libraries and
Section 1.2-1.3 are a bit more CS heavy than the other readings but provide a concrete, high level model that I feel is the best of the three readings.
Section 1.3 introduces the idea of Ranking as the current best way to search
Section 1.4 has a brief history of the web and web search moving onto the modern era.