prev

next

of 124

Published on

02-Jan-2016View

36Download

1

Embed Size (px)

DESCRIPTION

Special Topic on Image Retrieval. 2014-02. http://staff.ustc.edu.cn/~zhwg/SpecialTopic_IR.html. Text-based Image Retrieval. Retrieval Model Overview. Occurrence based models Boolean Retrieval Model Vector Space Model Probability models BM25 Language Model Hyperlink-based models - PowerPoint PPT Presentation

Transcript

Special Topic on Image Retrieval

Special Topic on Image Retrieval2014-021 http://staff.ustc.edu.cn/~zhwg/SpecialTopic_IR.htmlText-based Image Retrieval

2

3Retrieval Model OverviewOccurrence based modelsBoolean Retrieval ModelVector Space ModelProbability modelsBM25 Language ModelHyperlink-based modelsPageRankMachine learning based modelsLearning to Rank4Boolean RetrievalThe Boolean retrieval model is being able to ask a query that is a Boolean expression:Boolean Queries use AND, OR and NOT to join query termsViews each document as a set of wordsIs precise: document matches condition or not.Perhaps the simplest model to build an IR system onPrimary commercial retrieval tool for 3 decades. Many search systems you still use are Boolean:Email, library catalog, Mac OS X Spotlight5Boolean: ture or false5

6Linear scanning the texts for each queryBoolean RetrievalIndex the documentsTerm-document incidence matrix7Boolean RetrievalTerm-document incidence

1 if doc contains word, 0 otherwiseBrutus AND Caesar BUT NOT Calpurniahttp://www.rhymezone.com/shakespeare/ 8Boolean RetrievalIncidence vectorsSo we have a 0/1 vector for each term.To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) bitwise AND.110100 AND 110111 AND 101111 = 100100. 9Answers to queryAntony and Cleopatra, Act III, Scene iiAgrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain.

Hamlet, Act III, Scene iiLord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.

10Bigger collectionsConsider N = 1 million documents, each with about 1000 words.Avg 6 bytes/word including spaces/punctuation 6GB of data in the documents.Say there are M = 500K distinct terms among these.11Cant build the matrix500K x 1M matrix has half-a-trillion 0s and 1s.Why?But it has no more than one billion 1s.matrix is extremely sparse.Whats a better representation?We only record the 1 positions1212Inverted indexFor each term t, we must store a list of all documents that contain t.Identify each by a docID, a document serial number13DictionaryPostingsSorted by docIDPostingBrutusCalpurniaCaesar12456165713212411314517323117454101TokenizerToken streamFriendsRomansCountrymenInverted index constructionLinguistic modulesModified tokensfriendromancountrymanIndexerInverted indexfriendromancountryman24213161

Documents tobe indexedFriends, Romans, countrymen.14Indexer steps: Token sequenceSequence of (Modified token, Document ID) pairs.I did enact JuliusCaesar I was killed i' the Capitol; Brutus killed me.Doc 1So let it be withCaesar. The nobleBrutus hath told youCaesar was ambitiousDoc 2

15Indexer steps: SortSort by termsAnd then docID

Core indexing step16Indexer steps: Dictionary & PostingsMultiple term entries in a single document are merged.Split into Dictionary and PostingsDoc. frequency information is added.

Why frequency?Will discuss later.

17

Where do we pay in storage?PointersTerms and countsLists of docIDs18Query processing: ANDConsider processing the query:Brutus AND CaesarLocate Brutus in the Dictionary;Retrieve its postings.Locate Caesar in the Dictionary;Retrieve its postings.Merge the two postings:12834248163264123581321BrutusCaesar19Intersecting two postings lists(a merge algorithm)

20The mergeWalk through the two postings simultaneously, in time linear in the total number of postings entries3412824816326412358132112834248163264123581321BrutusCaesar28If list lengths are x and y, merge takes O(x+y) operations.Crucial: postings sorted by docID.21Postings: 21Ranked retrievalThus far, our queries have all been Boolean.Documents either match or dont.Good for expert users with precise understanding of their needs and the collection.Also good for applications: Applications can easily consume 1000s of results.Not good for the majority of users.Most users incapable of writing Boolean queries (or they are, but they think its too much work).Most users dont want to wade through 1000s of results.This is particularly true of web search.

22Problem with Boolean search:feast or famineBoolean queries often result in either too few (=0) or too many (1000s) results.Query 1: standard user dlink 650 200,000 hitsQuery 2: standard user dlink 650 no card found: 0 hitsIt takes a lot of skill to come up with a query that produces a manageable number of hits.AND gives too few; OR gives too many2323Ranked retrieval modelsRather than a set of documents satisfying a query expression, in ranked retrieval, the system returns an ordering over the (top) documents in the collection for a queryFree text queries: Rather than a query language of operators and expressions, the users query is just one or more words in a human languageIn principle, there are two separate choices here, but in practice, ranked retrieval has normally been associated with free text queries and vice versa

24Feast or famine: not a problem in ranked retrievalWhen a system produces a ranked result set, large result sets are not an issueIndeed, the size of the result set is not an issueWe just show the top k resultsWe dont overwhelm the user

Premise: the ranking algorithm works25Scoring as the basis of ranked retrievalWe wish to return in order the documents most likely to be useful to the searcherHow can we rank-order the documents in the collection with respect to a query?Assign a score say in [0, 1] to each documentThis score measures how well document and query match.26Query-document matching scoresWe need a way of assigning a score to a query/document pairLets start with a one-term queryIf the query term does not occur in the document: score should be 0The more frequent the query term in the document, the higher the score (should be)We will look at a number of alternatives for this.27Binary term-document incidence matrix

Each document is represented by a binary vector {0,1}|V|28Term-document count matricesConsider the number of occurrences of a term in a document: Each document is a count vector in v: a column below

29Bag of words modelVector representation doesnt consider the ordering of words in a documentJohn is quicker than Mary and Mary is quicker than John have the same vectorsThis is called the bag of words model.30Term frequency tfThe term frequency tft,d of term t in document d is defined as the number of times that t occurs in d.We want to use tf when computing query-document match scores. But how?Raw term frequency is not what we want:A document with 10 occurrences of the term is more relevant than a document with 1 occurrence of the term.But not 10 times more relevant.Relevance does not increase proportionally with term frequency.NB: frequency = count in IR31Log-frequency weightingThe log frequency weight of term t in d is

0 0, 1 1, 2 1.3, 10 2, 1000 4, etc.Score for a document-query pair: sum over terms t in both q and d:score

The score is 0 if none of the query terms is present in the document.

32Document frequencyRare terms are more informative than frequent termsRecall stop wordsConsider a term in the query that is rare in the collection (e.g., arachnocentric)A document containing this term is very likely to be relevant to the query arachnocentricWe want a high weight for rare terms like arachnocentric.We will use document frequency (df) to capture this.33Document frequency, continuedFrequent terms are less informative than rare termsConsider a query term that is frequent in the collection (e.g., high, increase, line)A document containing such a term is more likely to be relevant than a document that doesntBut its not a sure indicator of relevance. For frequent terms, we want high positive weights for words like high, increase, and lineBut lower weights than for rare terms.We will use document frequency (df) to capture this.

34idf weightdft is the document frequency of t: the number of documents that contain tdft is an inverse measure of the informativeness of tdft NWe define the idf (inverse document frequency) of t by

We use log (N/dft) instead of N/dft to dampen the effect of idf.

35idf example, suppose N = 1 milliontermdftidftcalpurnia16animal1004sunday1,0003fly10,0002under100,0001the1,000,0000There is one idf value for each term t in a collection.

366 4 3 2 1 036tf-idf weightingThe tf-idf weight of a term is the product of its tf weight and its idf weight.

Best known weighting scheme in information retrievalNote: the - in tf-idf is a hyphen, not a minus sign!Alternative names: tf.idf, tf x idfIncreases with the number of occurrences within a documentIncreases with the rarity of the term in the collection

37Score for a document given a queryThere are many variantsHow tf is computed (with/without logs)Whether the terms in the query are also weighted

38Effect of idf on rankingDoes idf have an effect on ranking for one-term queries, likeiPhoneidf has no effect on ranking one term queriesidf affects the ranking of documents for queries with at least two termsFor the query capricious person, idf weighting makes occurrences of capricious count for much more in the final document ranking than occurrences of person.39Collection vs. Document frequencyThe collection frequency of t is the number of occurrences of t in the collection, counting multiple occurrences.Example:

Which word is a better search term (and should get a higher weight)?WordCollection frequencyDocument frequencyinsurance104403997try1042287604040Binary count weight matrix

Each document is now represented by a real-valued vector of tf-idf weights R|V|41Vector Space ModelDocument as vectorsSo we have a |V|-dimensional vector spaceTerms are axes of the spaceDocuments are points or vectors in this spaceQueries as vectorsDo the same for queries: represent them as vectors in the spaceRank documents according to their proximity to the query in this spaceproximity = similarity of vectorsproximity inverse of distance42cosine(query,document)

Dot productUnit vectorsqi is the tf-idf weight of term i in the querydi is the tf-idf weight of term i in the document

cos(q,d) is the cosine similarity of q and d or, equivalently, the cosine of the angle between q and d.4343Cosine similarity illustrated

44Cosine similarity amongst 3 documentstermSaSPaPWHaffection11558 20jealous10711gossip206wuthering0038How similar arethe novelsSaS: Sense andSensibilityPaP: Pride andPrejudice, andWH: WutheringHeights?Term frequencies (counts)Note: To simplify this example, we dont do idf weighting.45453 documents example contd.Log frequency weightingtermSaSPaPWHaffection3.062.762.30jealous2.001.852.04gossip1.3001.78wuthering002.58After length normalizationtermSaSPaPWHaffection0.7890.8320.524jealous0.5150.5550.465gossip0.33500.405wuthering000.588cos(SaS,PaP) 0.789 0.832 + 0.515 0.555 + 0.335 0.0 + 0.0 0.0 0.94cos(SaS,WH) 0.79cos(PaP,WH) 0.6946tf-idf weighting has many variants

Columns headed n are acronyms for weight schemes.Why is the base of the log in idf immaterial?474747Summary vector space rankingRepresent the query as a weighted tf-idf vectorRepresent each document as a weighted tf-idf vectorCompute the cosine similarity score for the query vector and each document vectorRank documents with respect to the query by scoreReturn the top K to the user48Probability ModelsRetrieval as classification

49Bayes ClassifierBayesDecisionRuleAdocumentDisrelevantifP(R|D)>P(NR|D)Estimating probabilitiesuseBayesRule

classifyadocumentasrelevantif

50Estimating P(D|R)Assume independence

Binaryindependencemodeldocument represented by a vector of binary features indicating term occurrence (or nonoccurrence)piisprobabilitythattermioccurs (i.e., has value 1)inrelevantdocument,siisprobabilityofoccurrenceinnonrelevantdocument

51BinaryIndependenceModel

52BinaryIndependenceModelScoring function is

Query provides information about relevant documentsIfweassumepi constant,si approximatedbyentirecollection,getidflikeweight

5353Contingency TableGives scoring function:

54BM25Popularandeffectiverankingalgorithmbasedonbinaryindependencemodeladdsdocumentandquerytermweights

k1,k2andKareparameterswhosevaluesaresetempirically dlisdoclengthTypicalvaluefork1is1.2,k2variesfrom0to1000,b=0.75

55fiqfi55BM25 ExampleQuerywithtwoterms,presidentlincoln,(qf=1)Norelevanceinformation(randRarezero)N=500,000documentspresidentoccursin40,000documents(n1=40,000)lincolnoccursin300documents(n2=300)presidentoccurs15timesindoc(f1=15)lincolnoccurs25times(f2=25)documentlengthis90%oftheaveragelength:dl/avdl=0.9k1=1.2,b=0.75,andk2=100K=1.2(0.25+0.750.9)=1.11

56BM25 Example

57BM25 Example Effectoftermfrequencies

58Retrieval Model OverviewOccurrence based modelsBoolean Retrieval Vector Space ModelProbability modelsBM25 Language ModelHyperlink-based modelsPageRankMachine learning based modelsLearning to Rank59PageRankLinkscanbeviewedasinformationaboutthepopularityofawebpage

60

Larry Page and Sergey BrinPageRankPageRank(PR)ofpageC = PR(A)/2 + PR(B)/1More generally,

whereBuisthesetofpagesthatpointtou, andLvisthenumberofoutgoinglinksfrompagev.

61CS 331 - Data Mining62Example (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)YahooMsoftAmazony 1/2 1/2 0a 1/2 0 1m 0 1/2 0y a my = y /2 + a /2a = y /2 + mm = a /2r = Mr y 1/2 1/2 0 y a = 1/2 0 1 a m 0 1/2 0 mCS 331 - Data Mining63Power IterationSimple iterative scheme (aka relaxation)Suppose there are N web pagesInitialize: r0 = [1,.,1]TIterate: rk+1 = MrkStop when |rk+1 - rk|1 < |x|1 = 1iN|xi| is the L1 norm Can use any other vector norm e.g., Euclidean

63CS 331 - Data Mining64Power Iteration Example (from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)YahooMsoftAmazonya =m11113/21/25/4 1 3/49/822/241/26/56/53/5. . .y 1/2 1/2 0a 1/2 0 1m 0 1/2 0y a mr = Mr64PageRankDont know PageRank values at startAssume equal values (1/3 in this case), then iterate:first iteration: PR(C) = 0.33/2 + 0.33 = 0.5, PR(A) = 0.33, and PR(B) = 0.17second: PR(C) = 0.33/2 + 0.17 = 0.33, PR(A) = 0.5, PR(B) = 0.17third: PR(C) = 0.42, PR(A) = 0.33, PR(B) = 0.25Converges to PR(C) = 0.4, PR(A) = 0.4, and PR(B) = 0.265Matrix NotationMatrix Notationr = B r

PageRank is embedded in the eigenvector of B (normalized adjacency matrix) associated with the eigenvalue 1.Conditional probability

6666Matrix BPrinciple eigenvector

CS 331 - Data Mining67Spider trapsA group of pages is a spider trap if there are no links from within the group to outside the groupRandom surfer gets trappedSpider traps violate the conditions needed for the random walk theorem67CS 331 - Data Mining68Microsoft becomes a spider trap (from http://www.stanford.edu/c...

Recommended

View more >