Tom's INFSCI 2140 Reading & Muddiest Notes: Reading Notes - Unit 13

Saturday, April 12, 2014

Reading Notes - Unit 13

IIR - Chapter 13

Text Classification and Naive Bayes

The Text Classification Problem

Supervised learning

Naive Bayes Text Classification

A probabilistic learning method

Relation to Multinomial Unigram Language Model

Formally identical, its a special case

The Bernoulli Model

Equivalent to the binary independence model

Properties of Naive Bayes

An alternative formalization of the multinomial model represents each document

Feature Selection

The process of selecting a subset of the terms occurring in the training set
Mutual information
x^2 Feature selection
Frequency-based feature selection
Feature election of multiple classifiers
Mutual information and x^2 represent rather different feature selection methods

Evaluation of Text Classification

The classic Reuters-21578 collection was the main benchmark for text classification evaluation
We can measure recall, precision, and accuracy

IIR - Chapter 14

Vector Space Classification

Document representations and measure of relatedness in vector spaces
Rocchio classification
k nearest neighbor

Time complexity and optimality of kNN

Linear versus nonlinear classifiers
Classification with more than two classes
The bias-variance tradeoff

IIR - Chapter 16

Flat clustering

Clustering in information retrieval

States the fundamental assumption we make when using clustering in information retrieval

Problem statement: Given (i) a set of document a desired number of clusters K and an objective function that evaluates the qua lit of a clustering, we want to compute an assignment that minimized the objective function

Cardinality - the number of clusters

The evaluation of clustering
K-means

Cluster cardinality in K-means

Model-based clustering

IIR - Chapter 17

Hierarchical clustering

Clustering is efficient but has some drawbacks
Hierarchical agglomerative clustering
Single-link and complete-link clustering

The complexity of the naive HAC algorithm is O(n^3)
Group-average agglomerative clustering

Centroid Clustering

The similarity of two clusters is defined as the similarity of their centroids

Optimality of HAC

Single-link
GAAC
complete-link

Divisive clustering

Cluster hierarchy can be generated top down

Cluster Labeling

human users interact with clusters, need labeling

Implementation notes

Problems require the computation of a large number of dot products

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)