First up we have chapter 8 of IIR
- Measuring the effectiveness of IR systems
- We need a test collection
- We need a set of test queries
- We need a set of relevance judgments as the book calls them
- Test collections for this purpose
- Cranfield collection
- TREC
- Put together by NIST
- GOV2
- Bigger version of TREC, also done by NIST
- Still 2 orders smaller than that indexed by search engines
- NTCIR
- Focuses on east-asian languages
- CLEF
- European languages
- Reuters
- Newswires
- 20 Newsgroups
- Evaluating unranked retrieval results
- Precision
- fraction of documents retrieved that are relevant
- Recall
- fraction of relevant documents that are retrieved
- F-measure
- A single measure that uses both
- Evaluating ranked retrieval results
- Precision-recall curve
- Why not just use F-measure?
- Many other ways to evaluate results
- Developing reliable and informative test collections
- Using pooling of the top k documents and having them judged by experts
- User utility & the use of document relevance
- Satisfaction of the users is very important
- Maybe more so than whether an expert judges something relevant
- Results snippets
- Just like Google, we should give small snippets of the returned text for each ranked document
Cumulated Gain-Based Evaluation of IR Techniques
This is a paper from 2002 that looks at several techniques for evaluating IR systems or techniques. It talks about recall and precision like Ch. 8 but attempts to go further. The first one uses the relevance scores of the documents in the results. The second, discounts "late-retrieved" documents. The third method looks at the performance of different techniques. They used the TREC-7 data set. This paper would seem to be the basis for our ability to really test different IR systems using different established methods.
What's the value of TREC
No comments:
Post a Comment