Friday, February 28, 2014

Reading Note - Unit 8

*Note, somehow reading note for Unit 6 got republished or was saved as a draft...not sure what happened

MIR Chapter 10

     This chapter is about how we can present the information to user and how the user interacts with the search system.  First it considers Human-Computer interaction.  We must offer informative feedback.  Reduce working memory load. Provide alternative inference for novice and expert users. Make sure that we use the capabilities of modern computers to visual display information.  We must consider how the user will access the information.  Basically this chapter covered the issues facing the search professional as we develop models to help users use what we find.

Hearst Chapter 1

     This excerpt talks specifically about the design of search user interfaces.  Keep it simple.  Realize that users are no longer necessarily highly educated professionals, with in depth domain knowledge.  we have to take into account how "useable" our interface is.  There is very good research that can guide us.  We must give the user timely and useful feedback.  But we must balance this with doing some things automatically. We can't overwhelm the user's short-term memory. And, we must provide short cuts, or hints.  Help the user to reduce errors.  Don't forget the small things.  It has to look good, too on top of everything else.

Hearst Chapter 11

     This excerpt talks specifically about visualization for text analysis.  We can use graphics to show relationships.  We can visualize concordances, like SeeSoft or tag clouds.  We can also visualize relationships between documents, like citations in the scientific literature.

Reading Note - Unit 6

*A note to our regular readers, we skipped unit 5


First up we have chapter 8 of IIR


  • Measuring the effectiveness of IR systems
    • We need a test collection
    • We need a set of test queries
    • We need a set of relevance judgments as the book calls them
  • Test collections for this purpose
    • Cranfield collection
    • TREC
      • Put together by NIST
    • GOV2
      • Bigger version of TREC, also done by NIST
      • Still 2 orders smaller than that indexed by search engines
    • NTCIR
      • Focuses on east-asian languages
    • CLEF
      • European languages
    • Reuters
      • Newswires
    • 20 Newsgroups
  • Evaluating unranked retrieval results
    • Precision
      • fraction of documents retrieved that are relevant
    • Recall
      • fraction of relevant documents that are retrieved
    • F-measure
      • A single measure that uses both
  • Evaluating ranked retrieval results
    • Precision-recall curve
      • Why not just use F-measure?
    • Many other ways to evaluate results
  • Developing reliable and informative test collections
    • Using pooling of the top k documents and having them judged by experts
  • User utility & the use of document relevance
    • Satisfaction of the users is very important
      • Maybe more so than whether an  expert judges something relevant
  • Results snippets
    • Just like Google, we should give small snippets of the returned text for each ranked document

Cumulated Gain-Based Evaluation of IR Techniques

    This is a paper from 2002 that looks at several techniques for evaluating IR systems or techniques.  It talks about recall and precision like Ch. 8 but attempts to go further.  The first one uses the relevance scores of the documents in the results. The second, discounts "late-retrieved" documents. The third method looks at the performance of different techniques.  They used the TREC-7 data set. This paper would seem to be the basis for our ability to really test different IR systems using different established methods.

What's the value of TREC


Tuesday, February 25, 2014

Friday, February 21, 2014

Reading Notes - Unit 7

IIR Chapter 9


  • Relevance feedback and query expansion
    • Relevance feedback pseudo relevance feedback
      • The Rocchio algorithm for relevance feedback
        • classic algorithm for implementing feedback
        • models feedback into vector space model
      • Probabilistic relevance feedback
        • Naive Bayes probabilistic model
      • When does it work?
        • relies on assumptions
          • user must have enough knowledge 
      • On the web
        • Excite tried full relevance feedback but it was dropped
      • Evaluation of relevance feedback strategies
        • 1-2 rounds of feedback
      • Pseudo relevance feedback
        • method for automatic local analysis
      • Indirect relevance feedback
        • also called implicit
        • DirectHit used this
    • Global method for query reformulation
      • Vocab tools
        • stemming
        • thesaurus
      • Query Expansion
        • Users give feedback
      • Automatic thesaurus generation
        • analyze documents automatically
Relevance Feedback Revisited

     This study showed the that relevance feedback worked.  Specifically it showed that by using query expansion and query reweighing you could show improvements of 100%.  It also showed that the more the merrier when it came to rounds of relevance feedback.

A Study of Methods for Negative Relevance Feedback

     All about negative relevance feedback models.  They look at a bunch of models. They found that the language model were more effective than those.


Improving the Effectiveness of Information Retrieval with Local Context Analysis

     This paper is about local context analysis,  a new model that they proposed.  It worked with both English AND non-English collections. They found it worked better than any other model. They say they are going to continue on this work and look into when to expand queries.

Muddiest Notes - Unit 6

Unit 6 was hard to understand in general.  I'm still taking it all in.  At this point I can't really say what the muddiest point was, so much as the whole thing. I guess my biggest problem would be in how we determine what is "relevant".  If it truly is by expert or what the user deems relevant that seems...poor.

Tuesday, February 4, 2014

Muddiest Note - Unit 4

This week was pretty straight forward.  The only sticking point might be the thought of cosines of angles of vectors with millions of dimensions.  This is just because I am a visual person.