I would like to load a big text file (around 1 GB with 3*10^6 rows and 10 - 100 columns) as a 2D np-array containing strings. However, it seems like numpy.loadtxt() onl
There is also read_csv in Pandas, which is fast and supports non-comma column separators and automatic typing by column:
import pandas as pd
df = pd.read_csv('your_file',sep='\t')
It can be converted to a NumPy array if you prefer that type with:
import numpy as np
arr = np.array(df)
This is by far the easiest and most mature text import approach I've come across.
Is it essential that you need a NumPy array? Otherwise you could speed things up by loading the data as a nested list.
def load(fname):
''' Load the file using std open'''
f = open(fname,'r')
data = []
for line in f.readlines():
data.append(line.replace('\n','').split(' '))
f.close()
return data
For a text file with 4000x4000 words this is about 10 times faster than loadtxt.
Use genfromtxt instead. It's a much more general method than loadtxt:
import numpy as np
print np.genfromtxt('col.txt',dtype='str')
Using the file col.txt:
foo bar
cat dog
man wine
This gives:
[['foo' 'bar']
['cat' 'dog']
['man' 'wine']]
If you expect that each row has the same number of columns, read the first row and set the attribute filling_values to fix any missing rows.