Tuesday, April 15, 2014
Saturday, April 12, 2014
Reading Notes - Unit 13
IIR - Chapter 13
IIR - Chapter 14
IIR - Chapter 16
IIR - Chapter 17
- Text Classification and Naive Bayes
- The Text Classification Problem
- Supervised learning
- Naive Bayes Text Classification
- A probabilistic learning method
- Relation to Multinomial Unigram Language Model
- Formally identical, its a special case
- The Bernoulli Model
- Equivalent to the binary independence model
- Properties of Naive Bayes
- An alternative formalization of the multinomial model represents each document
- Feature Selection
- The process of selecting a subset of the terms occurring in the training set
- Mutual information
- x^2 Feature selection
- Frequency-based feature selection
- Feature election of multiple classifiers
- Mutual information and x^2 represent rather different feature selection methods
- Evaluation of Text Classification
- The classic Reuters-21578 collection was the main benchmark for text classification evaluation
- We can measure recall, precision, and accuracy
IIR - Chapter 14
- Vector Space Classification
- Document representations and measure of relatedness in vector spaces
- Rocchio classification
- k nearest neighbor
- Time complexity and optimality of kNN
- Linear versus nonlinear classifiers
- Classification with more than two classes
- The bias-variance tradeoff
IIR - Chapter 16
- Flat clustering
- Clustering in information retrieval
- States the fundamental assumption we make when using clustering in information retrieval
- Problem statement: Given (i) a set of document a desired number of clusters K and an objective function that evaluates the qua lit of a clustering, we want to compute an assignment that minimized the objective function
- Cardinality - the number of clusters
- The evaluation of clustering
- K-means
- Cluster cardinality in K-means
- Model-based clustering
IIR - Chapter 17
- Hierarchical clustering
- Clustering is efficient but has some drawbacks
- Hierarchical agglomerative clustering
- Single-link and complete-link clustering
- The complexity of the naive HAC algorithm is O(n^3)
- Group-average agglomerative clustering
- Centroid Clustering
- The similarity of two clusters is defined as the similarity of their centroids
- Optimality of HAC
- Single-link
- GAAC
- complete-link
- Divisive clustering
- Cluster hierarchy can be generated top down
- Cluster Labeling
- human users interact with clusters, need labeling
- Implementation notes
- Problems require the computation of a large number of dot products
Wednesday, April 9, 2014
Friday, April 4, 2014
Reading Notes - Unit 12
Ahn et al. - Personalized Web Exploration with Task Models
This paper is about personalized web search, specifically exploratory web search. Exploratory searches are those that go beyond the typical "how many inches in a foot" type searches that seek a simple answer. This paper covers the testing of a tool the authors came up with called TaskSieve. TaskSieve uses relevance feedback to offer the user personalized search.
Pazzani & Billsus - Content-Based Recommendation Systems
This paper is about content-based recommendation systems. These systems are used everyday from web search to Amazon.com as a way to help the customer find other items they may enjoy and/or to help the retailer sell more product. These systems are usually helped by algorithms that analyze a user's prior history, though sometimes they also have the user enter the information too.
Gauch et al. - User Profiles for Personalized Information Access
This paper dove tails nicely with the previous two of this week. It covers the profiling of users. It covers methods for user identification, and other collection techniques. The paper looks at the need of companies and projects to have access to more specific information about their customers and participants. It is interesting that they only briefly touch on the privacy implications of all the interesting facts that you can glean from the user, both implicitly and explicitly.
This paper is about personalized web search, specifically exploratory web search. Exploratory searches are those that go beyond the typical "how many inches in a foot" type searches that seek a simple answer. This paper covers the testing of a tool the authors came up with called TaskSieve. TaskSieve uses relevance feedback to offer the user personalized search.
Pazzani & Billsus - Content-Based Recommendation Systems
This paper is about content-based recommendation systems. These systems are used everyday from web search to Amazon.com as a way to help the customer find other items they may enjoy and/or to help the retailer sell more product. These systems are usually helped by algorithms that analyze a user's prior history, though sometimes they also have the user enter the information too.
Gauch et al. - User Profiles for Personalized Information Access
This paper dove tails nicely with the previous two of this week. It covers the profiling of users. It covers methods for user identification, and other collection techniques. The paper looks at the need of companies and projects to have access to more specific information about their customers and participants. It is interesting that they only briefly touch on the privacy implications of all the interesting facts that you can glean from the user, both implicitly and explicitly.
Tuesday, April 1, 2014
Muddiest Points - Unit 11
It seems that an easier task would be to translate documents and then search them. Rather than to try and dynamically search documents with queries in a different language?
Thursday, March 27, 2014
Reading Notes - Unit 11
*A note to our regular readers Oard & Diekema's paper Cross-Language Information Retrieval is not able to be retrieved anywhere.
IES Chapter 14
Dealing with large amounts of data, like Google. This chapter looks at ways of dealing with the massive amount of data
IES Chapter 14
Dealing with large amounts of data, like Google. This chapter looks at ways of dealing with the massive amount of data
- Parallel Query Processing - using index partitioning & replication
- Document Partitioning
- Each server has a subset of the documents
- Term Partitioning
- Each server has a subset of the index in memory
- Hybrid Schemes
- Use term & document partitioning
- OR use document partitioning and replication
- Redundancy & Fault Tolerance
- We must assume that with more queries and machines there will be faults
- We can handle this with replication and partial replication
- MapReduce
- The Basic Framework
- Highly parallelizable by executing map and reduce at the same time.
- Combiners
- A reduce function applied to a map shard and not a reduce shard
- Secondary Keys
- A function of MapReduce that deals with duplicate keys
- Machine Failures
- the map side deals with failure better than the reduce side
He & Wang - Cross-Language Information Retrieval
This excerpt from a book looks at the challenges of taking queries in one language and returning relevant information that is in another. Currently no search engine supports this, even Google. First, the system must decide on how it will translate the given query. Currently you can use all the usual techniques such as stemming tokenization, phrase identification, stop-word removal, n-gram etc. Then the application must have translation knowledge. This can be done with bilingual dictionaries or corpora. But, the system must be able to deal with acronyms and proper nouns. The chapter then goes into how to take this knowledge to find document that best suits the user using term weighting. Then the system must have some method for being evaluated. There are few evaluation frameworks available such as the CLIR TREC track.
Tuesday, March 25, 2014
Friday, March 21, 2014
Reading Note - Unit 10
Brin & Page - The Anatomy of a Large-Scale Hypertextual Web Search Engine
This is the rock star of academic papers. This is what every paper hopes that it becomes. This isn't just a paper we can learn from.
This is history.
Whether you like Google or not, very few companies of its size can so clearly point to their beginning in a paper like this. Page and Brin discuss what have become the foundations for Google and web search. They talk about PageRank and how it uses a relatively simple algorithm to determine a page's importance by tracking the pages that link to it. There is then a review of related work that led them to PageRank. They then go on to discuss Google's architecture for storing and crawling the data and the lexicon. There is then a section of results and then conclusions. The last sentence is funny to look back on: "We hope Google will be a resource for searchers and researchers all around the world and will spark the next generation of search engine technology." If it had just been that.
Kleinberg - Authoritative Sources in a Hyperlinked Environment
This paper discusses web search. It would seem to come to the same conclusions as Brin & Page but without the billion dollar company: We can find good information on the web by looking to see who links to whom, but in this case this paper advocates for the use of authoritative papers.
IIR Chapter 19
This might be the first chapter in this book that actually has some practical stuff in it. There is brief discussion of the history of web search. Then it goes on to talk about the model of the Web and how it helped it grow so explosively. The web looks suspiciously like a graph. Then the paper goes on to talk about spam and how it was inevitable because "spam stems from the heterogeneity of motives in content creation on the Web." The chapter goes on to talk about the users that use search. It finishes up this chapter with dealing with duplicates.
IIR Chapter 20
This chapter deals with the structure of the web as dealt with briefly in chapter 19. This appears to be mostly a rehash of the two papers we read for this week .
This is the rock star of academic papers. This is what every paper hopes that it becomes. This isn't just a paper we can learn from.
This is history.
Whether you like Google or not, very few companies of its size can so clearly point to their beginning in a paper like this. Page and Brin discuss what have become the foundations for Google and web search. They talk about PageRank and how it uses a relatively simple algorithm to determine a page's importance by tracking the pages that link to it. There is then a review of related work that led them to PageRank. They then go on to discuss Google's architecture for storing and crawling the data and the lexicon. There is then a section of results and then conclusions. The last sentence is funny to look back on: "We hope Google will be a resource for searchers and researchers all around the world and will spark the next generation of search engine technology." If it had just been that.
Kleinberg - Authoritative Sources in a Hyperlinked Environment
This paper discusses web search. It would seem to come to the same conclusions as Brin & Page but without the billion dollar company: We can find good information on the web by looking to see who links to whom, but in this case this paper advocates for the use of authoritative papers.
IIR Chapter 19
This might be the first chapter in this book that actually has some practical stuff in it. There is brief discussion of the history of web search. Then it goes on to talk about the model of the Web and how it helped it grow so explosively. The web looks suspiciously like a graph. Then the paper goes on to talk about spam and how it was inevitable because "spam stems from the heterogeneity of motives in content creation on the Web." The chapter goes on to talk about the users that use search. It finishes up this chapter with dealing with duplicates.
IIR Chapter 20
This chapter deals with the structure of the web as dealt with briefly in chapter 19. This appears to be mostly a rehash of the two papers we read for this week .
Tuesday, March 4, 2014
Muddiest Notes - Unit 8
No unclear points this week, the lecture was a really great summary of what we've done up to this point, it really pulled it all together.
Friday, February 28, 2014
Reading Note - Unit 8
*Note, somehow reading note for Unit 6 got republished or was saved as a draft...not sure what happened
MIR Chapter 10
This chapter is about how we can present the information to user and how the user interacts with the search system. First it considers Human-Computer interaction. We must offer informative feedback. Reduce working memory load. Provide alternative inference for novice and expert users. Make sure that we use the capabilities of modern computers to visual display information. We must consider how the user will access the information. Basically this chapter covered the issues facing the search professional as we develop models to help users use what we find.
Hearst Chapter 1
This excerpt talks specifically about the design of search user interfaces. Keep it simple. Realize that users are no longer necessarily highly educated professionals, with in depth domain knowledge. we have to take into account how "useable" our interface is. There is very good research that can guide us. We must give the user timely and useful feedback. But we must balance this with doing some things automatically. We can't overwhelm the user's short-term memory. And, we must provide short cuts, or hints. Help the user to reduce errors. Don't forget the small things. It has to look good, too on top of everything else.
Hearst Chapter 11
This excerpt talks specifically about visualization for text analysis. We can use graphics to show relationships. We can visualize concordances, like SeeSoft or tag clouds. We can also visualize relationships between documents, like citations in the scientific literature.
Reading Note - Unit 6
*A note to our regular readers, we skipped unit 5
First up we have chapter 8 of IIR
First up we have chapter 8 of IIR
- Measuring the effectiveness of IR systems
- We need a test collection
- We need a set of test queries
- We need a set of relevance judgments as the book calls them
- Test collections for this purpose
- Cranfield collection
- TREC
- Put together by NIST
- GOV2
- Bigger version of TREC, also done by NIST
- Still 2 orders smaller than that indexed by search engines
- NTCIR
- Focuses on east-asian languages
- CLEF
- European languages
- Reuters
- Newswires
- 20 Newsgroups
- Evaluating unranked retrieval results
- Precision
- fraction of documents retrieved that are relevant
- Recall
- fraction of relevant documents that are retrieved
- F-measure
- A single measure that uses both
- Evaluating ranked retrieval results
- Precision-recall curve
- Why not just use F-measure?
- Many other ways to evaluate results
- Developing reliable and informative test collections
- Using pooling of the top k documents and having them judged by experts
- User utility & the use of document relevance
- Satisfaction of the users is very important
- Maybe more so than whether an expert judges something relevant
- Results snippets
- Just like Google, we should give small snippets of the returned text for each ranked document
Cumulated Gain-Based Evaluation of IR Techniques
This is a paper from 2002 that looks at several techniques for evaluating IR systems or techniques. It talks about recall and precision like Ch. 8 but attempts to go further. The first one uses the relevance scores of the documents in the results. The second, discounts "late-retrieved" documents. The third method looks at the performance of different techniques. They used the TREC-7 data set. This paper would seem to be the basis for our ability to really test different IR systems using different established methods.
What's the value of TREC
Tuesday, February 25, 2014
Friday, February 21, 2014
Reading Notes - Unit 7
IIR Chapter 9
- Relevance feedback and query expansion
- Relevance feedback pseudo relevance feedback
- The Rocchio algorithm for relevance feedback
- classic algorithm for implementing feedback
- models feedback into vector space model
- Probabilistic relevance feedback
- Naive Bayes probabilistic model
- When does it work?
- relies on assumptions
- user must have enough knowledge
- On the web
- Excite tried full relevance feedback but it was dropped
- Evaluation of relevance feedback strategies
- 1-2 rounds of feedback
- Pseudo relevance feedback
- method for automatic local analysis
- Indirect relevance feedback
- also called implicit
- DirectHit used this
- Global method for query reformulation
- Vocab tools
- stemming
- thesaurus
- Query Expansion
- Users give feedback
- Automatic thesaurus generation
- analyze documents automatically
Relevance Feedback Revisited
This study showed the that relevance feedback worked. Specifically it showed that by using query expansion and query reweighing you could show improvements of 100%. It also showed that the more the merrier when it came to rounds of relevance feedback.
A Study of Methods for Negative Relevance Feedback
All about negative relevance feedback models. They look at a bunch of models. They found that the language model were more effective than those.
Improving the Effectiveness of Information Retrieval with Local Context Analysis
This paper is about local context analysis, a new model that they proposed. It worked with both English AND non-English collections. They found it worked better than any other model. They say they are going to continue on this work and look into when to expand queries.
Muddiest Notes - Unit 6
Unit 6 was hard to understand in general. I'm still taking it all in. At this point I can't really say what the muddiest point was, so much as the whole thing. I guess my biggest problem would be in how we determine what is "relevant". If it truly is by expert or what the user deems relevant that seems...poor.
Tuesday, February 4, 2014
Muddiest Note - Unit 4
This week was pretty straight forward. The only sticking point might be the thought of cosines of angles of vectors with millions of dimensions. This is just because I am a visual person.
Friday, January 31, 2014
Reading Notes - Unit 4
IIR Chap 1.3 - 1.4
This sections talk about Boolean search queries in an inverted index. Section 1.3 talks about the best method for finding the answer: You take the the posting list of each term and do the intersection for AND, the union for OR. It also goes on to talk about dealing with nested Boolean operators. Section 1.4 compares Boolean search with ranked retrieval, for example Westlaw in the early 90's versus Google today.
IIR Chapter 6
This sections talk about Boolean search queries in an inverted index. Section 1.3 talks about the best method for finding the answer: You take the the posting list of each term and do the intersection for AND, the union for OR. It also goes on to talk about dealing with nested Boolean operators. Section 1.4 compares Boolean search with ranked retrieval, for example Westlaw in the early 90's versus Google today.
IIR Chapter 6
- Chapter 6 is about scoring weighting documents as opposed to the Boolean approach we've been using thus far
- Parametric & Zone indexes
- We use a parametric index that includes such items as date of creation
- Zone indexes are more about the title and abstract, they are not predefined fields
- We use weighted zone scoring on a pair of terms, on the interval [0,1]
- Scoring is "learned"
- Machine learning
- Weighting term importance
- Term frequency
- The more a term is used, the more likely the document is about it.
- However, we can use inverse document frequency to account for stop words which would occur a lot but are not important
- Basically it finds if a naturally rare occurring word is occurring too frequently
- Vector Space Scoring
- Treat the document as a vector containing each of its dictionary terms
- Documents can assessed against queries using the dot product
- Treating the document as a vectors
- Term Weighting for the vector space model
- Words that occur 20x more often shouldn't 20x more important
- tf Normalization
Tuesday, January 28, 2014
Muddiest Point - Unit 3
With ɣ codes, I understand how they work with the unary code + the offset. I also understand that they clearly help with compression. Its the jump between that I just don't understand.
Thursday, January 23, 2014
Reading Notes - Unit 3
IIR Chapter 4
- Chapter 4 is about Index construction
- A brief overview of the the general speed of different operations that affect us
- seek time
- transfer time
- processor's clock rate
- low-level operations
- Because of hardware constraints external sorting algorithms are used
- Block sort-based indexing
- It does this in blocks to cut down on disk seek
- Blocks are sorted individually then merged into one index
- Single-pass in-memory indexing
- Still unsure about how it is different from Block sort-based indexing
- Distributed indexing
- This is closest to real world, "Google-style" indexing
- MapReduce
- Controlled by master node
- Uses key-value pairs to break a big problem into a small one
- 2 phases
- Map phase
- maps input data to key-value pairs
- This then goes gets parsed to a segment file
- Reduce phase
- The inverters finish the job
- This is a safe method, because the master node will coordinate and if necessary reassign jobs from nodes that are taking too long or have failed to others
- Dynamic indexing
- Used for constantly changing collections of documents
- This doesn't include ranked-retreival systems.
- Briefly mentions access control lists
IIR Chapter 5
- Compression
- Does the obvious and saves us space, up to 75%
- Also helps with caching
- modern systems are so fast, a compressed file can be read and decompressed faster than a large un compressed file
- The author goes onto a discussion of statistics properties of terms
- Heaps' law
- Estimates the number of terms of a collection
- Zipf's Law
- How terms are distributed across documents
- Dictionary Compression
- Dictionary as a string
- treat the dictionary as one long string
- Blocked Storage
- treat the dictionary as a string merged with a byte encoding its length
- Front Coding
- use common beginnings of words with blocked storage to decrease size
- Postings File Compression
- bytewise compression
- bitwise compression
- ɣ encoding
- uses variable length codes
- This cause quite a bit of confusion for me.
Tuesday, January 14, 2014
Muddiest Point - Unit 2
When it comes to permuterm and K-gram for wildcard queries, when would you use one or the other? Permuterm seems it may return a smaller, targeted group of words, where k-grams could return a huge result depending on the word.
Friday, January 10, 2014
Reading Notes - Unit 2
IIR Section1-2:
- We looked at building inverted indexes, mostly an introduction.
- Tokens, normalized tokens roughly equal words
- Inverted index look sort of like linked lists (?)
IIR Chapter 2:
- First we read about document types and what constitutes a document
- You don't want to consider a collection of books a document
- Conversely you don't want to consider a paragraph one either
- Try to find a happy medium, or tailor it to your need
- Interesting look at the challenges of languages
- Token is an instance of a sequence of characters
- Type is the class of all token
- Term is a token that is included in the IR systems' dictionary
- You might not include stop words like "the"
- Looked at how to deal with names such as O'Neill
- Normalization is the next step
- Followed possibly by stemming and lemmatization.
- It strikes me as interesting that stemming and lemmatization are still problematic and don't always solve the problem
- A discussion of skip pointers follows
- I'm still a little hazy on this.
- Biword indexes with there ability to weed out most false-positives seem to have a lot of upsides with not so many downsides when it comes to search...
- "friends romans", "romans countrymen"
- Then positional schemes are introduced, followed by the combination of the biword and positional to find "Michael Jackson"
IIR Chapter 3
- We begin with the venerable binary tree and b-tree.
- b-trees are nice because it can help when the dictionary is disk based.
- Then a discussion of wildcard queries begins
- Permuterm indexes.
- I still don't understand what's going on in these.
- k-gram indexes
- Spelling correction
- Choose the "nearest one"
- Choose the most likely if two words are equally close
- Forms of spelling correction
- isolated-term: correct a single query at a time
- context-sensitive: using the most likely candidate
- "flew form Heathrow" becomes "flew FROM Heathrow"
- Edit distance uses matrices to determine how far a word is form another
- cats" has a maximun Levenshtein distance of 3 from "fast"
- k-gram indexes for spelling correction
- how many k-grams does it have in common
- Phonetic correction (the most interesting in my mind)
- make a soundex index
- I always wondered if something like this existed
Tuesday, January 7, 2014
Reading Notes - Unit 1
Finding Out About Section 1.1 Notes:
Modern Information Retrieval Sections 1.1 - 1.4 Notes:
- This chapter was a nice start the course. It provided a larger overview of what we're doing here. Having just come from Human Information Processing with Dr. Hirtle, this was a nice bridge.
- The FOA conversation loop was a perfect model for what goes on in a conversation between two people.
- They go on to show the model a search engine and it really showed how far we have to come.
Information Retrieval Section 1.1 - 1.2 Notes:
- Section 1.1 is a basic introduction to the topic talking mostly about search engines but it does briefly touch on local search.
- Figure 1.1 is nice model for what a search engine is doing.
- They then go on to talk about systems and efficiency and effectiveness
Modern Information Retrieval Sections 1.1 - 1.4 Notes:
- Again, a basic introduction to the topic, with a great discussion of ancient libraries and
- Section 1.2-1.3 are a bit more CS heavy than the other readings but provide a concrete, high level model that I feel is the best of the three readings.
- Section 1.3 introduces the idea of Ranking as the current best way to search
- Section 1.4 has a brief history of the web and web search moving onto the modern era.
Subscribe to:
Comments (Atom)