IES Chapter 14
Dealing with large amounts of data, like Google. This chapter looks at ways of dealing with the massive amount of data
- Parallel Query Processing - using index partitioning & replication
- Document Partitioning
- Each server has a subset of the documents
- Term Partitioning
- Each server has a subset of the index in memory
- Hybrid Schemes
- Use term & document partitioning
- OR use document partitioning and replication
- Redundancy & Fault Tolerance
- We must assume that with more queries and machines there will be faults
- We can handle this with replication and partial replication
- MapReduce
- The Basic Framework
- Highly parallelizable by executing map and reduce at the same time.
- Combiners
- A reduce function applied to a map shard and not a reduce shard
- Secondary Keys
- A function of MapReduce that deals with duplicate keys
- Machine Failures
- the map side deals with failure better than the reduce side
He & Wang - Cross-Language Information Retrieval
This excerpt from a book looks at the challenges of taking queries in one language and returning relevant information that is in another. Currently no search engine supports this, even Google. First, the system must decide on how it will translate the given query. Currently you can use all the usual techniques such as stemming tokenization, phrase identification, stop-word removal, n-gram etc. Then the application must have translation knowledge. This can be done with bilingual dictionaries or corpora. But, the system must be able to deal with acronyms and proper nouns. The chapter then goes into how to take this knowledge to find document that best suits the user using term weighting. Then the system must have some method for being evaluated. There are few evaluation frameworks available such as the CLIR TREC track.
No comments:
Post a Comment