Convert word2vec bin file to text

From the word2vec site I can download GoogleNews-vectors-negative300.bin.gz. The .bin file (about 3.4GB) is a binary format not useful to me. Tomas Mikolov assures us that "It should be fairly straightforward to convert the binary format to text format (though that will take more disk space). Check the code in the distance tool, it's rather trivial to read the binary file." Unfortunately, I don't know enough C to understand http://word2vec.googlecode.com/svn/trunk/distance.c.

Supposedly gensim can do this also, but all the tutorials I've found seem to be about converting from text, not the other way.

Can someone suggest modifications to the C code or instructions for gensim to emit text?

I use this code to load binary model, then save the model to text file,

from gensim.models.keyedvectors import KeyedVectors

model = KeyedVectors.load_word2vec_format('path/to/GoogleNews-vectors-negative300.bin', binary=True)
model.save_word2vec_format('path/to/GoogleNews-vectors-negative300.txt', binary=False)

References: API and nullege.

Note:

Above code is for new version of gensim. For previous version, I used this code:

from gensim.models import word2vec

model = word2vec.Word2Vec.load_word2vec_format('path/to/GoogleNews-vectors-negative300.bin', binary=True)
model.save_word2vec_format('path/to/GoogleNews-vectors-negative300.txt', binary=False)

On the word2vec-toolkit mailing list Thomas Mensink has provided an answer in the form of a small C program that will convert a .bin file to text. This is a modification of the distance.c file. I replaced the original distance.c with Thomas's code below and rebuilt word2vec (make clean; make), and renamed the compiled distance to readbin. Then ./readbin vector.bin will create a text version of vector.bin.

//  Copyright 2013 Google Inc. All Rights Reserved.
//
//  Licensed under the Apache License, Version 2.0 (the "License");
//  you may not use this file except in compliance with the License.
//  You may obtain a copy of the License at
//
//      http://www.apache.org/licenses/LICENSE-2.0
//
//  Unless required by applicable law or agreed to in writing, software
//  distributed under the License is distributed on an "AS IS" BASIS,
//  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
//  See the License for the specific language governing permissions and
//  limitations under the License.

#include <stdio.h>
#include <string.h>
#include <math.h>
#include <malloc.h>

const long long max_size = 2000;         // max length of strings
const long long N = 40;                  // number of closest words that will be shown
const long long max_w = 50;              // max length of vocabulary entries

int main(int argc, char **argv) {
  FILE *f;
  char file_name[max_size];
  float len;
  long long words, size, a, b;
  char ch;
  float *M;
  char *vocab;
  if (argc < 2) {
    printf("Usage: ./distance <FILE>\nwhere FILE contains word projections in the BINARY FORMAT\n");
    return 0;
  }
  strcpy(file_name, argv[1]);
  f = fopen(file_name, "rb");
  if (f == NULL) {
    printf("Input file not found\n");
    return -1;
  }
  fscanf(f, "%lld", &words);
  fscanf(f, "%lld", &size);
  vocab = (char *)malloc((long long)words * max_w * sizeof(char));
  M = (float *)malloc((long long)words * (long long)size * sizeof(float));
  if (M == NULL) {
    printf("Cannot allocate memory: %lld MB    %lld  %lld\n", (long long)words * size * sizeof(float) / 1048576, words, size);
    return -1;
  }
  for (b = 0; b < words; b++) {
    fscanf(f, "%s%c", &vocab[b * max_w], &ch);
    for (a = 0; a < size; a++) fread(&M[a + b * size], sizeof(float), 1, f);
    len = 0;
    for (a = 0; a < size; a++) len += M[a + b * size] * M[a + b * size];
    len = sqrt(len);
    for (a = 0; a < size; a++) M[a + b * size] /= len;
  }
  fclose(f);
  //Code added by Thomas Mensink
  //output the vectors of the binary format in text
  printf("%lld %lld #File: %s\n",words,size,file_name);
  for (a = 0; a < words; a++){
    printf("%s ",&vocab[a * max_w]);
    for (b = 0; b< size; b++){ printf("%f ",M[a*size + b]); }
    printf("\b\b\n");
  }  

  return 0;
}

I removed the "\b\b" from the printf.

By the way, the resulting text file still contained the text word and some unnecessary whitespace which I did not want for some numerical calculations. I removed the initial text column and the trailing blank from each line with bash commands.

cut --complement -d ' ' -f 1 GoogleNews-vectors-negative300.txt > GoogleNews-vectors-negative300_tuples-only.txt
sed 's/ $//' GoogleNews-vectors-negative300_tuples-only.txt

the format is IEEE 754 single-precision binary floating-point format: binary32 http://en.wikipedia.org/wiki/Single-precision_floating-point_format They use little-endian.

Let do an example:

First line is string format: "3000000 300\n" (vocabSize & vecSize, getByte till byte=='\n')
Next line include the vocab word first, and then (300*4 byte of float value, 4 byte for each dimension):
```
getByte till byte==32 (space). (60 47 115 62 32 => <\s>[space])
```
then each next 4 byte will represent one float number

next 4 byte: 0 0 -108 58 => 0.001129150390625.

You can check the wikipedia link to see how, let me do this one as example:

(little-endian -> reverse order) 00111010 10010100 00000000 00000000

first is sign bit => sign = 1 (else = -1)
next 8 bits => 117 => exp = 2^(117-127)
next 23 bits => pre = 0*2^(-1) + 0*2^(-2) + 1*2^(-3) + 1*2^(-5)

value = sign * exp * pre

You can load the binary file in word2vec, and then save the text version like this:

from gensim.models import word2vec
 model = word2vec.Word2Vec.load_word2vec_format('Path/to/GoogleNews-vectors-negative300.bin', binary=True)
 model.save("file.txt")

I am using gensim to work with the GoogleNews-vectors-negative300.bin and I am including a binary = True flag while loading the model.

from gensim import word2vec

model = word2vec.Word2Vec.load_word2vec_format('Path/to/GoogleNews-vectors-negative300.bin', binary=True)

Seems to be working fine.

I had a similar issue, I wanted to get bin/non-bin(gensim) models output as CSV.

here is the code which does that on python, it assumes you have gensim installed:

https://gist.github.com/dav009/10a742de43246210f3ba

Here is the code I use:

import codecs
from gensim.models import Word2Vec

def main():
    path_to_model = 'GoogleNews-vectors-negative300.bin'
    output_file = 'GoogleNews-vectors-negative300_test.txt'
    export_to_file(path_to_model, output_file)


def export_to_file(path_to_model, output_file):
    output = codecs.open(output_file, 'w' , 'utf-8')
    model = Word2Vec.load_word2vec_format(path_to_model, binary=True)
    print('done loading Word2Vec')
    vocab = model.vocab
    for mid in vocab:
        #print(model[mid])
        #print(mid)
        vector = list()
        for dimension in model[mid]:
            vector.append(str(dimension))
        #line = { "mid": mid, "vector": vector  }
        vector_str = ",".join(vector)
        line = mid + "\t"  + vector_str
        #line = json.dumps(line)
        output.write(line + "\n")
    output.close()

if __name__ == "__main__":
    main()
    #cProfile.run('main()') # if you want to do some profiling

convertvec is a small tool to convert vectors between different formats for the word2vec library.

Convert vectors from binary to plain text:

./convertvec bin2txt input.bin output.txt

Convert vectors from plain text to binary:

./convertvec txt2bin input.txt output.bin

If you get the Error:

ImportError: No module named models.word2vec

then it is because there was an API update. This will work:

from gensim.models.keyedvectors import KeyedVectors

model = KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)
model.save_word2vec_format('./GoogleNews-vectors-negative300.txt', binary=False)

Just a quick update as now there is easier way.

If you are using word2vec from https://github.com/dav/word2vec there is additional option called -binary which accept 1 to generate binary file or 0 to generate text file. This example comes from demo-word.sh in the repo:

time ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 0 -iter 15

来源：https://stackoverflow.com/questions/27324292/convert-word2vec-bin-file-to-text

标签

python

gensim

word2vec