The similarity between two documents is usually calculated using vectorized document representation in the space of terms. Terms are simple words, phrases, etc. Vector components are some document's term measures.
One of the most known techniques for calculating these measures or term weights is TF-IDF. There are a bunch of options how exactly calculate TF and IDF.
The similarity measure between two documents can be based on those document's vectors.
There are also a number of formulas for calculating it - euclidean distance, cosine similarity, Jaccard Coefficient, Pearson Correlation Coefficient and some other
Here I'd like to descibe an algorithm for calculating pairwise similarity calculation for a big corpus of documents. Given a corpus of millions of articles it is very time consuming to calculate each-to-each similarity for entire set.
Initially we have only (document-its terms) set of data. What additional data we need to calculate similarity between any two documents? The IDF factor measures the relevancy of each particular term in entire document set. It needs to get # of documents where a given term occurs and a total amount of docs. So it is a context sensitive parameter. We need to walk through entire data set to calculate each term statistics before we can apply similarity calculation formula for any document pair. It is sometimes desirable to have term statistics for different document sets inside the global set so we can calculate similarities for different contexts later on.
In a nutshell we need to calculate the sum of TF-IDF factors for each co-occuring term for a document pair:
Similarity(doc1, doc2) = sum_over_common_terms ( weight(term, doc1) * weight(term, doc2) ) / normalization(doc1, doc2)
where normalization is often just a product of vectors length.
As one can see this formula is mostly additive. This gives us a nutural way to parallelize the process.We can calculate the contribution of different term sets to similarity add merge them together as the final step.
The entire process can be parallelized using map-reduce paradigm.
The overall flow and data calculation steps for two processing nodes are shown on the following scheme:
Here [...] is a collection notation.
The main steps are:
- Partition incoming data according to some partition scheme among processing nodes (the most suitable is that each processing node contain the equal amount of data).
- On each processing node process document terms sequentially, aggregating data in maps : [Context -> #docs]
[Term -> [context id, total # of docs with this term]]
[Term -> [doc id, frequency]] - Redistribute (for 'reducing' step) this data among processing nodes based on some partitioning scheme for Term ids. So the first map step emits terms as keys for the reduce step.
Context statistics (how many docs in each context) can be copied to each processing node as the amount of this data is usually quite small. - On each processing node merge the incoming data maps (there will be 3 different maps).
- All the data for each term is now
located locally on processing nodes.
For each term calculate its contribution to the similarity value for a document set where it occurs. Upon completion of this step we have:
[doc_i -> [doc_j, similarity contribution, normalization contribution]]
map where similarity takes into account only terms 'local' for the current node. - Drain all the results to a one node sequentially and merge the partial results.
Note that steps 1-4 don't depend on processing all data from the previous steps like step 5 does (it needs term statistics which is only available after completion of step 4 for entire data corpus).
This fact makes it possible to use pipelining approach for steps 1-4 above.
One of the most known mapreduce engine is Hadoop.
Unfortunately, Hadoop's map-reduce implementation doesn't have this useful ability.
There is the independent branch project Hadoop Online Prototype which is capable of doing such things.
Anyway, we still need to wait for a completion of all data preprocessing on the step 5 above.
It is a common situation when new content is received from time to time and there is a need to update and recalculate our similarity matrix.
It is a big challenge to keep it up-to-date. But that is another story.