- Text Classification and Naive Bayes
- The Text Classification Problem
- Supervised learning
- Naive Bayes Text Classification
- A probabilistic learning method
- Relation to Multinomial Unigram Language Model
- Formally identical, its a special case
- The Bernoulli Model
- Equivalent to the binary independence model
- Properties of Naive Bayes
- An alternative formalization of the multinomial model represents each document
- Feature Selection
- The process of selecting a subset of the terms occurring in the training set
- Mutual information
- x^2 Feature selection
- Frequency-based feature selection
- Feature election of multiple classifiers
- Mutual information and x^2 represent rather different feature selection methods
- Evaluation of Text Classification
- The classic Reuters-21578 collection was the main benchmark for text classification evaluation
- We can measure recall, precision, and accuracy
IIR - Chapter 14
- Vector Space Classification
- Document representations and measure of relatedness in vector spaces
- Rocchio classification
- k nearest neighbor
- Time complexity and optimality of kNN
- Linear versus nonlinear classifiers
- Classification with more than two classes
- The bias-variance tradeoff
IIR - Chapter 16
- Flat clustering
- Clustering in information retrieval
- States the fundamental assumption we make when using clustering in information retrieval
- Problem statement: Given (i) a set of document a desired number of clusters K and an objective function that evaluates the qua lit of a clustering, we want to compute an assignment that minimized the objective function
- Cardinality - the number of clusters
- The evaluation of clustering
- K-means
- Cluster cardinality in K-means
- Model-based clustering
IIR - Chapter 17
- Hierarchical clustering
- Clustering is efficient but has some drawbacks
- Hierarchical agglomerative clustering
- Single-link and complete-link clustering
- The complexity of the naive HAC algorithm is O(n^3)
- Group-average agglomerative clustering
- Centroid Clustering
- The similarity of two clusters is defined as the similarity of their centroids
- Optimality of HAC
- Single-link
- GAAC
- complete-link
- Divisive clustering
- Cluster hierarchy can be generated top down
- Cluster Labeling
- human users interact with clusters, need labeling
- Implementation notes
- Problems require the computation of a large number of dot products
No comments:
Post a Comment