IDF – Inverse Document Frequency

The use of inverse document frequency is eliminating the false weight of popular terms in the index (all documents).

Example: In a document like : “quick brown fox”, the word count of brown is probably  much higher in the index so it is likely much more common than the word fox. In a search query like “brown fox”, the word “fox” should be weighted heavier than the word “brown”.  To achieve this, we use the IDF.

IDF is also used to eliminate the false weighting of  too popular stop words like the, a, and, for… These are mostly words which are not important for the hit results.

We use a formula like

tekening2

IDF(W) = log[total number of documents / total number of documents containing word W + 1] 

c – count ,w – word ,q – query, M – Total document count in index, df – word count in the matched document

  • +1 is used for division by 0

On a graph, it is easy to see that there is a turning point in the result line where the occurence of a word is not important anymore because it is too common.

tekening1

It is very common to strip most of this class of words when indexing.

The disadvantage of this approach is the repeating words in the same document.

Example: the document  “picture of quick brown cat on a brown ground after a brown tree on a brown door saw a fox” contains 4 times the word brown. In a search query “brown fox” this document should get much higher rank than a document with the text “brown fox” only. This problem is usually fixed with BM25 approach.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s