Saturday, April 12, 2014

Reading Notes - Unit 13

IIR - Chapter 13


  • Text Classification and Naive Bayes
    • The Text Classification Problem
      • Supervised learning
    • Naive Bayes Text Classification
      • A probabilistic learning method
    • Relation to Multinomial Unigram Language Model
      • Formally identical, its a special case
    • The Bernoulli Model
      • Equivalent to the binary independence model
    • Properties of Naive Bayes
      • An alternative formalization of the multinomial model represents each document
    • Feature Selection
      • The process of selecting a subset of the terms occurring in the training set
      • Mutual information
      • x^2 Feature selection
      • Frequency-based feature selection
      • Feature election of multiple classifiers
      • Mutual information and x^2 represent rather different feature selection methods
    • Evaluation of Text Classification
      • The classic Reuters-21578 collection was the main benchmark for text classification evaluation
      • We can measure recall, precision, and accuracy


IIR - Chapter 14


  • Vector Space Classification
    • Document representations and measure of relatedness in vector spaces
    • Rocchio classification
    • k nearest neighbor
      • Time complexity and optimality of kNN
    • Linear versus nonlinear classifiers
    • Classification with more than two classes
    • The bias-variance tradeoff


IIR - Chapter 16


  • Flat clustering
    • Clustering in information retrieval
      • States the fundamental assumption we make when using clustering in information retrieval
    • Problem statement: Given (i) a set of document a desired number of clusters K and an objective function that evaluates the qua lit of a clustering, we want to compute an assignment that minimized the objective function
      • Cardinality - the number of clusters
    • The evaluation of clustering
    • K-means
      • Cluster cardinality in K-means
    • Model-based clustering


IIR - Chapter 17


  • Hierarchical clustering
    • Clustering is efficient but has some drawbacks
    • Hierarchical agglomerative clustering
    • Single-link and complete-link clustering
      • The complexity of the naive HAC algorithm is O(n^3)
      • Group-average agglomerative clustering
    • Centroid Clustering
      • The similarity of two clusters is defined as the similarity of their centroids
    • Optimality of HAC
      • Single-link
      • GAAC
      • complete-link
    • Divisive clustering
      • Cluster hierarchy can be generated top down
    • Cluster Labeling
      • human users interact with clusters, need labeling
    • Implementation notes
      • Problems require the computation of a large number of dot products



Wednesday, April 9, 2014

Friday, April 4, 2014

Reading Notes - Unit 12

Ahn et al. - Personalized Web Exploration with Task Models

     This paper is about personalized web search, specifically exploratory web search.   Exploratory searches are those that go beyond the typical "how many inches in a foot" type searches that seek a simple answer.  This paper covers the testing of a tool the authors came up with called TaskSieve.  TaskSieve uses relevance feedback to offer the user personalized search.

Pazzani & Billsus - Content-Based Recommendation Systems

     This paper is about content-based recommendation systems.  These systems are used everyday from web search to Amazon.com as a way to help the customer find other items they may enjoy and/or to help the retailer sell more product.  These systems are usually helped by algorithms that analyze a user's prior history, though sometimes they also have the user enter the information too.

Gauch et al. - User Profiles for Personalized Information Access

     This paper dove tails nicely with the previous two of this week.  It covers the profiling of users.  It covers methods for user identification, and other collection techniques.  The paper looks at the need of companies and projects to have access to more specific information about their customers and participants.  It is interesting that they only briefly touch on the privacy implications of all the interesting facts that you can glean from the user, both implicitly and explicitly.

Tuesday, April 1, 2014

Muddiest Points - Unit 11

It seems that an easier task would be to translate documents and then search them.  Rather than to try and dynamically search documents with queries in a different language?

Thursday, March 27, 2014

Reading Notes - Unit 11

*A note to our regular readers Oard & Diekema's paper Cross-Language Information Retrieval is not able to be retrieved anywhere.

IES Chapter 14

Dealing with large amounts of data, like Google. This chapter looks at ways of dealing with the massive amount of data

  • Parallel Query Processing - using index partitioning & replication
    • Document Partitioning
      • Each server has a subset of the documents
    • Term Partitioning
      • Each server has a subset of the index in memory
    • Hybrid Schemes
      • Use term & document partitioning
      • OR use document partitioning and replication
    • Redundancy & Fault Tolerance
      • We must assume that with more queries and machines there will be faults
      • We can handle this with replication and partial replication
  • MapReduce
    • The Basic Framework
      • Highly parallelizable by executing map and reduce at the same time.
    • Combiners
      • A reduce function applied to a map shard and not a reduce shard
    • Secondary Keys
      • A function of MapReduce that deals with duplicate keys
    • Machine Failures
      • the map side deals with failure better than the reduce side

He & Wang - Cross-Language Information Retrieval

     This excerpt from a book looks at the challenges of taking queries in one language and returning relevant information that is in another.  Currently no search engine supports this, even Google.  First, the system must decide on how it will translate the given query.  Currently you can use all the usual techniques such as stemming tokenization, phrase identification, stop-word removal, n-gram etc. Then the application must have translation knowledge.  This can be done with bilingual dictionaries or corpora.  But, the system must be able to deal with acronyms and proper nouns.  The chapter then goes into how to take this knowledge to find document that best suits the user using term weighting. Then the system must have some method for being evaluated.  There are few evaluation frameworks available such as the CLIR TREC track.  


Tuesday, March 25, 2014