This sections talk about Boolean search queries in an inverted index. Section 1.3 talks about the best method for finding the answer: You take the the posting list of each term and do the intersection for AND, the union for OR. It also goes on to talk about dealing with nested Boolean operators. Section 1.4 compares Boolean search with ranked retrieval, for example Westlaw in the early 90's versus Google today.
IIR Chapter 6
- Chapter 6 is about scoring weighting documents as opposed to the Boolean approach we've been using thus far
- Parametric & Zone indexes
- We use a parametric index that includes such items as date of creation
- Zone indexes are more about the title and abstract, they are not predefined fields
- We use weighted zone scoring on a pair of terms, on the interval [0,1]
- Scoring is "learned"
- Machine learning
- Weighting term importance
- Term frequency
- The more a term is used, the more likely the document is about it.
- However, we can use inverse document frequency to account for stop words which would occur a lot but are not important
- Basically it finds if a naturally rare occurring word is occurring too frequently
- Vector Space Scoring
- Treat the document as a vector containing each of its dictionary terms
- Documents can assessed against queries using the dot product
- Treating the document as a vectors
- Term Weighting for the vector space model
- Words that occur 20x more often shouldn't 20x more important
- tf Normalization