Tom's INFSCI 2140 Reading & Muddiest Notes

Tuesday, April 15, 2014

Saturday, April 12, 2014

Reading Notes - Unit 13

IIR - Chapter 13

Text Classification and Naive Bayes

The Text Classification Problem

Supervised learning

Naive Bayes Text Classification

A probabilistic learning method

Relation to Multinomial Unigram Language Model

Formally identical, its a special case

The Bernoulli Model

Equivalent to the binary independence model

Properties of Naive Bayes

An alternative formalization of the multinomial model represents each document

Feature Selection

The process of selecting a subset of the terms occurring in the training set
Mutual information
x^2 Feature selection
Frequency-based feature selection
Feature election of multiple classifiers
Mutual information and x^2 represent rather different feature selection methods

Evaluation of Text Classification

The classic Reuters-21578 collection was the main benchmark for text classification evaluation
We can measure recall, precision, and accuracy

IIR - Chapter 14

Vector Space Classification

Document representations and measure of relatedness in vector spaces
Rocchio classification
k nearest neighbor

Time complexity and optimality of kNN

Linear versus nonlinear classifiers
Classification with more than two classes
The bias-variance tradeoff

IIR - Chapter 16

Flat clustering

Clustering in information retrieval

States the fundamental assumption we make when using clustering in information retrieval

Problem statement: Given (i) a set of document a desired number of clusters K and an objective function that evaluates the qua lit of a clustering, we want to compute an assignment that minimized the objective function

Cardinality - the number of clusters

The evaluation of clustering
K-means

Cluster cardinality in K-means

Model-based clustering

IIR - Chapter 17

Hierarchical clustering

Clustering is efficient but has some drawbacks
Hierarchical agglomerative clustering
Single-link and complete-link clustering

The complexity of the naive HAC algorithm is O(n^3)
Group-average agglomerative clustering

Centroid Clustering

The similarity of two clusters is defined as the similarity of their centroids

Optimality of HAC

Single-link
GAAC
complete-link

Divisive clustering

Cluster hierarchy can be generated top down

Cluster Labeling

human users interact with clusters, need labeling

Implementation notes

Problems require the computation of a large number of dot products

Wednesday, April 9, 2014

Muddiest Point - Unit 12

None this week, since we're going more in depth next week.

Friday, April 4, 2014

Ahn et al. - Personalized Web Exploration with Task Models

This paper is about personalized web search, specifically exploratory web search. Exploratory searches are those that go beyond the typical "how many inches in a foot" type searches that seek a simple answer. This paper covers the testing of a tool the authors came up with called TaskSieve. TaskSieve uses relevance feedback to offer the user personalized search.

Pazzani & Billsus - Content-Based Recommendation Systems

This paper is about content-based recommendation systems. These systems are used everyday from web search to Amazon.com as a way to help the customer find other items they may enjoy and/or to help the retailer sell more product. These systems are usually helped by algorithms that analyze a user's prior history, though sometimes they also have the user enter the information too.

Gauch et al. - User Profiles for Personalized Information Access

This paper dove tails nicely with the previous two of this week. It covers the profiling of users. It covers methods for user identification, and other collection techniques. The paper looks at the need of companies and projects to have access to more specific information about their customers and participants. It is interesting that they only briefly touch on the privacy implications of all the interesting facts that you can glean from the user, both implicitly and explicitly.

Tuesday, April 1, 2014

Muddiest Points - Unit 11

It seems that an easier task would be to translate documents and then search them. Rather than to try and dynamically search documents with queries in a different language?

Thursday, March 27, 2014

Reading Notes - Unit 11

*A note to our regular readers Oard & Diekema's paper Cross-Language Information Retrieval is not able to be retrieved anywhere.

IES Chapter 14

Dealing with large amounts of data, like Google. This chapter looks at ways of dealing with the massive amount of data

Parallel Query Processing - using index partitioning & replication

Document Partitioning

Each server has a subset of the documents

Term Partitioning

Each server has a subset of the index in memory

Hybrid Schemes

Use term & document partitioning
OR use document partitioning and replication

Redundancy & Fault Tolerance

We must assume that with more queries and machines there will be faults
We can handle this with replication and partial replication

MapReduce

The Basic Framework

Highly parallelizable by executing map and reduce at the same time.

Combiners

A reduce function applied to a map shard and not a reduce shard

Secondary Keys

A function of MapReduce that deals with duplicate keys

Machine Failures

the map side deals with failure better than the reduce side

He & Wang - Cross-Language Information Retrieval

This excerpt from a book looks at the challenges of taking queries in one language and returning relevant information that is in another. Currently no search engine supports this, even Google. First, the system must decide on how it will translate the given query. Currently you can use all the usual techniques such as stemming tokenization, phrase identification, stop-word removal, n-gram etc. Then the application must have translation knowledge. This can be done with bilingual dictionaries or corpora. But, the system must be able to deal with acronyms and proper nouns. The chapter then goes into how to take this knowledge to find document that best suits the user using term weighting. Then the system must have some method for being evaluated. There are few evaluation frameworks available such as the CLIR TREC track.

Tuesday, March 25, 2014

Muddiest Point - Unit 10

This was a good week. I didn't have trouble with anything.