Faster huge data-import than Get[“raggedmatrix.mx”]?

匆匆过客 提交于 2019-12-02 23:57:51
Szabolcs

Here's an idea:

You said you have a ragged matrix, i.e. a list of lists of different lengths. I'm assuming floating point numbers.

You could flatten the matrix to get a single long packed 1D array (use Developer`ToPackedArray to pack it if necessary), and store the starting indexes of the sublists separately. Then reconstruct the ragged matrix after the data has been imported.


Here's a demonstration that within Mathematica (i.e. after import), extracting the sublists from a big flattened list is fast.

data = RandomReal[1, 10000000];

indexes = Union@RandomInteger[{1, 10000000}, 10000];    
ranges = #1 ;; (#2 - 1) & @@@ Partition[indexes, 2, 1];

data[[#]] & /@ ranges; // Timing

{0.093, Null}

Alternatively store a sequence of sublist lengths and use Mr.Wizard's dynamicPartition function which does exactly this. My point is that storing the data in a flat format and partitioning it in-kernel is going to add negligible overhead.


Importing packed arrays as MX files is very fast. I only have 2 GB of memory, so I cannot test on very large files, but the import times are always a fraction of a second for packed arrays on my machine. This will solve the problem that importing data that is not packed can be slower (although as I said in the comments on the main question, I cannot reproduce the kind of extreme slowness you mention).


If BinaryReadList were fast (it isn't as fast as reading MX files now, but it looks like it will be significantly sped up in Mathematica 9), you could store the whole dataset as one big binary file, without the need of breaking it into separate MX files. Then you could import relevant parts of the file like this:

First make a test file:

In[3]:= f = OpenWrite["test.bin", BinaryFormat -> True]

In[4]:= BinaryWrite[f, RandomReal[1, 80000000], "Real64"]; // Timing
Out[4]= {9.547, Null}

In[5]:= Close[f]

Open it:

In[6]:= f = OpenRead["test.bin", BinaryFormat -> True]    

In[7]:= StreamPosition[f]

Out[7]= 0

Skip the first 5 million entries:

In[8]:= SetStreamPosition[f, 5000000*8]

Out[8]= 40000000

Read 5 million entries:

In[9]:= BinaryReadList[f, "Real64", 5000000] // Length // Timing    
Out[9]= {0.609, 5000000}

Read all the remaining entries:

In[10]:= BinaryReadList[f, "Real64"] // Length // Timing    
Out[10]= {7.782, 70000000}

In[11]:= Close[f]

(For comparison, Get usually reads the same data from an MX file in less than 1.5 seconds here. I am on WinXP btw.)


EDIT If you are willing to spend time on this, and write some C code, another idea is to create a library function (using Library Link) that will memory-map the file (link for Windows), and copy it directly into an MTensor object (an MTensor is just a packed Mathematica array, as seen from the C side of Library Link).

I think the two best approaches are either:

1) use Get on the *.mx file,

2) or read in that data and save it in some binary format for which you write a LibraryLink code and then read the stuff via that. That, of course, has the disadvantage that you'd need to convert your MX stuff. But perhaps this is an option.

Generally speaking Get with MX files is pretty fast.

Are sure this is not a swapping problem?

Edit 1: You could then use also write in an import converter: tutorial/DevelopingAnImportConverter

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!