Thursday, March 27, 2014

Reading Notes - Unit 11

*A note to our regular readers Oard & Diekema's paper Cross-Language Information Retrieval is not able to be retrieved anywhere.

IES Chapter 14

Dealing with large amounts of data, like Google. This chapter looks at ways of dealing with the massive amount of data

  • Parallel Query Processing - using index partitioning & replication
    • Document Partitioning
      • Each server has a subset of the documents
    • Term Partitioning
      • Each server has a subset of the index in memory
    • Hybrid Schemes
      • Use term & document partitioning
      • OR use document partitioning and replication
    • Redundancy & Fault Tolerance
      • We must assume that with more queries and machines there will be faults
      • We can handle this with replication and partial replication
  • MapReduce
    • The Basic Framework
      • Highly parallelizable by executing map and reduce at the same time.
    • Combiners
      • A reduce function applied to a map shard and not a reduce shard
    • Secondary Keys
      • A function of MapReduce that deals with duplicate keys
    • Machine Failures
      • the map side deals with failure better than the reduce side

He & Wang - Cross-Language Information Retrieval

     This excerpt from a book looks at the challenges of taking queries in one language and returning relevant information that is in another.  Currently no search engine supports this, even Google.  First, the system must decide on how it will translate the given query.  Currently you can use all the usual techniques such as stemming tokenization, phrase identification, stop-word removal, n-gram etc. Then the application must have translation knowledge.  This can be done with bilingual dictionaries or corpora.  But, the system must be able to deal with acronyms and proper nouns.  The chapter then goes into how to take this knowledge to find document that best suits the user using term weighting. Then the system must have some method for being evaluated.  There are few evaluation frameworks available such as the CLIR TREC track.  


Tuesday, March 25, 2014

Friday, March 21, 2014

Reading Note - Unit 10

Brin & Page - The Anatomy of a Large-Scale Hypertextual Web Search Engine

     This is the rock star of academic papers.  This is what every paper hopes that it becomes.  This isn't just a paper we can learn from.

     This is history.

     Whether you like Google or not, very few companies of its size can so clearly point to their beginning in a paper like this.  Page and Brin discuss what have become the foundations for Google and web search.  They talk about PageRank and how it uses a relatively simple algorithm to determine a page's importance by tracking the pages that link to it.  There is then a review of related work that led them to PageRank.  They then go on to discuss Google's architecture for storing and crawling the data and the lexicon.  There is then a section of results and then conclusions.  The last sentence is funny to look back on: "We hope Google will be a resource for searchers and researchers all around the world and will spark the next generation of search engine technology."  If it had just been that.


Kleinberg - Authoritative Sources in a Hyperlinked Environment

     This paper discusses web search.  It would seem to come to the same conclusions as Brin & Page but without the billion dollar company: We can find good information on the web by looking to see who links to whom, but in this case this paper advocates for the use of authoritative papers.

IIR Chapter 19

     This might be the first chapter in this book that actually has some practical stuff in it.  There is brief discussion of the history of web search.  Then it goes on to talk about the model of the Web and how it helped it grow so explosively.  The web looks suspiciously like a graph.  Then the paper goes on to talk about spam and how it was inevitable because "spam stems from the heterogeneity of motives in content creation on the Web."  The chapter goes on to talk about the users that use search.  It finishes up this chapter with dealing with duplicates.

IIR Chapter 20

     This chapter deals with the structure of the web as dealt with briefly in chapter 19.  This appears to be mostly a rehash of the two papers we read for this week .

Tuesday, March 4, 2014

Muddiest Notes - Unit 8

No unclear points this week, the lecture was a really great summary of what we've done up to this point, it really pulled it all together.