word2vec cosine similarity greater than 1 arabic text

问题

I have trained my word2vec model from gensim and I am getting the nearest neighbors for some words in the corpus. Here are the similarity scores:

 top neighbors for الاحتلال:
الاحتلال: 1.0000001192092896
الاختلال: 0.9541053175926208
الاهتلال: 0.872565507888794
الاحثلال: 0.8386293649673462
الاكتلال: 0.8209128379821777

It is odd to get a similarity greater than 1. I cannot apply any stemming to my text because the text includes many OCR spelling mistakes (I got the text from ORC-ed documents). How can I fix the issue ?

Note I am using model.similarity(t1, t2)

This is how I trained my Word2Vec Model:

    documents = list()
    tokenize = lambda x: gensim.utils.simple_preprocess(x)
    t1 = time.time()
    docs = read_files(TEXT_DIRS, nb_docs=5000)
    t2 = time.time()
    print('Reading docs took: {:.3f} mins'.format((t2 - t1) / 60))
    print('Number of documents: %i' % len(docs))

    # Training the model
    model = gensim.models.Word2Vec(docs, size=EMBEDDING_SIZE, min_count=5)
    if not os.path.exists(MODEL_DIR):
        os.makedirs(MODEL_DIR)
    model.save(os.path.join(MODEL_DIR, 'word2vec'))

    weights = model.wv.vectors
    index_words = model.wv.index2word

    vocab_size = weights.shape[0]
    embedding_dim = weights.shape[1]

    print('Shape of weights:', weights.shape)
    print('Vocabulary size: %i' % vocab_size)
    print('Embedding size: %i' % embedding_dim)

Below is the read_files function I defined:

def read_files(text_directories, nb_docs):
    """
    Read in text files
    """
    documents = list()
    tokenize = lambda x: gensim.utils.simple_preprocess(x)
    print('started reading ...')
    for path in text_directories:
        count = 0
        # Read in all files in directory
        if os.path.isdir(path):
            all_files = os.listdir(path)
            for filename in all_files:
                if filename.endswith('.txt') and filename[0].isdigit():
                    count += 1
                    with open('%s/%s' % (path, filename), encoding='utf-8') as f:
                        doc = f.read()
                        doc = clean_text_arabic_style(doc)
                        doc = clean_doc(doc)
                        documents.append(tokenize(doc))
                        if count % 100 == 0:
                            print('processed {} files so far from {}'.format(count, path))
                if count >= nb_docs and count <= nb_docs + 200:
                    print('REACHED END')
                    break
        if count >= nb_docs and count <= nb_docs:
            print('REACHED END')
            break

    return documents

I tried this thread but it won't help me because I rather have arabic and misspelled text

Update I tried the following: (getting the similarity between the exact same word)

print(model.similarity('الاحتلال','الاحتلال'))

and it gave me the following result:

1.0000001

回答1:

Definitionally, the cosine-similarity measure should max at 1.0.

But in practice, floating-point number representations in computers have tiny imprecisions in the deep-decimals. And, especially when a number of calculations happen in a row (as with the calculation of this cosine-distance), those will sometimes lead to slight deviations from what the expected maximum or exactly-right answer "should" be.

(Similarly: sometimes calculations that, mathematically, should result in the exact same answer no matter how they are reordered/regrouped deviate slightly when done in different orders.)

But, as these representational errors are typically "very small", they're usually not of practical concern. (They are especially small in the range of numbers around -1.0 to 1.0, but can become quite large when dealing with giant numbers.)

In your original case, the deviation is just 0.000000119209289. In the word-to-itself case, the deviation is just 0.0000001. That is, about one-ten-millionth off. (Your other sub-1.0 values have similar tiny deviations from perfect calculation, but they aren't noticeable.)

In most cases, you should just ignore it.

If you find it distracting to you or your users in numerical displays/logging, simply choosing to display all such values to a limited number of after-the-decimal-point digits – say 4 or even 5 or 6 – will hide those noisy digits. For example, using a Python 3 format-string:

sim = model.similarity('الاحتلال','الاحتلال')
print(f"{sim:.6}")

(Libraries like numpy that work with large arrays of such floats can even set a global default for display precision – see numpy.set_print_options – though that shouldn't affect the raw Python floats you're examining.)

If for some reason you absolutely need the values to be capped at 1.0, you could add extra code to do that. But, it's usually a better idea to choose your tests & printouts to be robust to, & oblivious with regard to, such tiny deviations from perfect math.

来源：https://stackoverflow.com/questions/65311534/word2vec-cosine-similarity-greater-than-1-arabic-text

标签

nlp

arabic

word2vec

similarity