Tom's INFSCI 2140 Reading & Muddiest Notes: April 2014

Tuesday, April 15, 2014

Saturday, April 12, 2014

Reading Notes - Unit 13

IIR - Chapter 13

Text Classification and Naive Bayes

The Text Classification Problem

Supervised learning

Naive Bayes Text Classification

A probabilistic learning method

Relation to Multinomial Unigram Language Model

Formally identical, its a special case

The Bernoulli Model

Equivalent to the binary independence model

Properties of Naive Bayes

An alternative formalization of the multinomial model represents each document

Feature Selection

The process of selecting a subset of the terms occurring in the training set
Mutual information
x^2 Feature selection
Frequency-based feature selection
Feature election of multiple classifiers
Mutual information and x^2 represent rather different feature selection methods

Evaluation of Text Classification

The classic Reuters-21578 collection was the main benchmark for text classification evaluation
We can measure recall, precision, and accuracy

IIR - Chapter 14

Vector Space Classification

Document representations and measure of relatedness in vector spaces
Rocchio classification
k nearest neighbor

Time complexity and optimality of kNN

Linear versus nonlinear classifiers
Classification with more than two classes
The bias-variance tradeoff

IIR - Chapter 16

Flat clustering

Clustering in information retrieval

States the fundamental assumption we make when using clustering in information retrieval

Problem statement: Given (i) a set of document a desired number of clusters K and an objective function that evaluates the qua lit of a clustering, we want to compute an assignment that minimized the objective function

Cardinality - the number of clusters

The evaluation of clustering
K-means

Cluster cardinality in K-means

Model-based clustering

IIR - Chapter 17

Hierarchical clustering

Clustering is efficient but has some drawbacks
Hierarchical agglomerative clustering
Single-link and complete-link clustering

The complexity of the naive HAC algorithm is O(n^3)
Group-average agglomerative clustering

Centroid Clustering

The similarity of two clusters is defined as the similarity of their centroids

Optimality of HAC

Single-link
GAAC
complete-link

Divisive clustering

Cluster hierarchy can be generated top down

Cluster Labeling

human users interact with clusters, need labeling

Implementation notes

Problems require the computation of a large number of dot products

Wednesday, April 9, 2014

Muddiest Point - Unit 12

None this week, since we're going more in depth next week.

Friday, April 4, 2014

Ahn et al. - Personalized Web Exploration with Task Models

This paper is about personalized web search, specifically exploratory web search. Exploratory searches are those that go beyond the typical "how many inches in a foot" type searches that seek a simple answer. This paper covers the testing of a tool the authors came up with called TaskSieve. TaskSieve uses relevance feedback to offer the user personalized search.

Pazzani & Billsus - Content-Based Recommendation Systems

This paper is about content-based recommendation systems. These systems are used everyday from web search to Amazon.com as a way to help the customer find other items they may enjoy and/or to help the retailer sell more product. These systems are usually helped by algorithms that analyze a user's prior history, though sometimes they also have the user enter the information too.

Gauch et al. - User Profiles for Personalized Information Access

This paper dove tails nicely with the previous two of this week. It covers the profiling of users. It covers methods for user identification, and other collection techniques. The paper looks at the need of companies and projects to have access to more specific information about their customers and participants. It is interesting that they only briefly touch on the privacy implications of all the interesting facts that you can glean from the user, both implicitly and explicitly.

Tuesday, April 1, 2014

Muddiest Points - Unit 11

It seems that an easier task would be to translate documents and then search them. Rather than to try and dynamically search documents with queries in a different language?

Tom's INFSCI 2140 Reading & Muddiest Notes