The use of inverse document frequency is eliminating the false weight of popular terms in the index (all documents).
Example: In a document like : “quick brown fox”, the word count of brown is probably much higher in the index so it is likely much more common than the word fox. In a search query like “brown fox”, the word “fox” should be weighted heavier than the word “brown”. To achieve this, we use the IDF.
IDF is also used to eliminate the false weighting of too popular stop words like the, a, and, for… These are mostly words which are not important for the hit results.
We use a formula like
IDF(W) = log[total number of documents / total number of documents containing word W + 1]
c – count ,w – word ,q – query, M – Total document count in index, df – word count in the matched document
- +1 is used for division by 0
On a graph, it is easy to see that there is a turning point in the result line where the occurence of a word is not important anymore because it is too common.
It is very common to strip most of this class of words when indexing.
The disadvantage of this approach is the repeating words in the same document.
Example: the document “picture of quick brown cat on a brown ground after a brown tree on a brown door saw a fox” contains 4 times the word brown. In a search query “brown fox” this document should get much higher rank than a document with the text “brown fox” only. This problem is usually fixed with BM25 approach.