A Hamming distance should be done between two strings of equal length and with the order taken into account.
As your documents are certainly of different length and if the words places do not count, cosine similarity is better (please note that depending your needs, better solutions exist). :)
Here is a cosine similarity function of 2 arrays of words:
function cosineSimilarity($tokensA, $tokensB)
{
$a = $b = $c = 0;
$uniqueTokensA = $uniqueTokensB = array();
$uniqueMergedTokens = array_unique(array_merge($tokensA, $tokensB));
foreach ($tokensA as $token) $uniqueTokensA[$token] = 0;
foreach ($tokensB as $token) $uniqueTokensB[$token] = 0;
foreach ($uniqueMergedTokens as $token) {
$x = isset($uniqueTokensA[$token]) ? 1 : 0;
$y = isset($uniqueTokensB[$token]) ? 1 : 0;
$a += $x * $y;
$b += $x;
$c += $y;
}
return $b * $c != 0 ? $a / sqrt($b * $c) : 0;
}
It is fast (isset() instead of in_array() is a killer on large arrays).
As you can see, the results does not take into account the "magnitude" of each the word.
I use it to detect multi-posted messages of "almost" copy-pasted texts. It works well. :)
The best link about string similarity metrics:
http://www.dcs.shef.ac.uk/~sam/stringmetrics.html
For further interesting readings:
http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html
http://bioinformatics.oxfordjournals.org/cgi/content/full/22/18/2298