Saturday, April 12, 2014

Reading Notes - Unit 13

IIR - Chapter 13


  • Text Classification and Naive Bayes
    • The Text Classification Problem
      • Supervised learning
    • Naive Bayes Text Classification
      • A probabilistic learning method
    • Relation to Multinomial Unigram Language Model
      • Formally identical, its a special case
    • The Bernoulli Model
      • Equivalent to the binary independence model
    • Properties of Naive Bayes
      • An alternative formalization of the multinomial model represents each document
    • Feature Selection
      • The process of selecting a subset of the terms occurring in the training set
      • Mutual information
      • x^2 Feature selection
      • Frequency-based feature selection
      • Feature election of multiple classifiers
      • Mutual information and x^2 represent rather different feature selection methods
    • Evaluation of Text Classification
      • The classic Reuters-21578 collection was the main benchmark for text classification evaluation
      • We can measure recall, precision, and accuracy


IIR - Chapter 14


  • Vector Space Classification
    • Document representations and measure of relatedness in vector spaces
    • Rocchio classification
    • k nearest neighbor
      • Time complexity and optimality of kNN
    • Linear versus nonlinear classifiers
    • Classification with more than two classes
    • The bias-variance tradeoff


IIR - Chapter 16


  • Flat clustering
    • Clustering in information retrieval
      • States the fundamental assumption we make when using clustering in information retrieval
    • Problem statement: Given (i) a set of document a desired number of clusters K and an objective function that evaluates the qua lit of a clustering, we want to compute an assignment that minimized the objective function
      • Cardinality - the number of clusters
    • The evaluation of clustering
    • K-means
      • Cluster cardinality in K-means
    • Model-based clustering


IIR - Chapter 17


  • Hierarchical clustering
    • Clustering is efficient but has some drawbacks
    • Hierarchical agglomerative clustering
    • Single-link and complete-link clustering
      • The complexity of the naive HAC algorithm is O(n^3)
      • Group-average agglomerative clustering
    • Centroid Clustering
      • The similarity of two clusters is defined as the similarity of their centroids
    • Optimality of HAC
      • Single-link
      • GAAC
      • complete-link
    • Divisive clustering
      • Cluster hierarchy can be generated top down
    • Cluster Labeling
      • human users interact with clusters, need labeling
    • Implementation notes
      • Problems require the computation of a large number of dot products



No comments:

Post a Comment