Vector push_back only if enough memory is available

问题

I am currently building a code that is dealing with massive amounts of memory using the vector class, dynamically.

The code is building up the vector with push_back, where it is important to notice that the vector is 2 dimensional, representing the data matrix. Depending on circumstances, this matrix can be small, or become exceptionally large.

For instance, data matrix can have few rows, with 1000 columns in every row, or it can get 1000 rows with the same amount of columns, full of double data types. Obviously, this can very easily become a problem, because 1000x1000x8 = 8 000 000 bytes, thus representing 8 MB in memory. But how about 10 times more columns and 10 times more rows? (which can easily happen in my code).

I am solving this by writing the data matrix into HDD, however this approach is rather slow, because I am not using RAM to the fullest.

My question: How can I build this matrix represented by vector< vector<double> > using push_back, but only if there is enough memory that can be allocated.

If the amount of memory is not sufficient, I will continue by exporting the data to HDD into a file, freeing the allocated memory and starting the cycle over. What I don't know is how to check if memory is available with every push_back executed.

Edit: I should have noticed that I am using 64 bit machine that runs Ubuntu. I am not quite sure how and if OS paging is running, however what I am actually doing is numerical computations of particles in presence of electric and magnetic fields. There can be 100 million of particles that are moving in over 1000 time steps, which is lot of GB of data. However, sometimes I am running just few hundred thousands of particles for tests that fit into RAM without a problem, speeding the computation process. I am trying to create somewhat generalized solution that will check if there is enough RAM for another computation, and if not then to move them into a file. Those particles can add into the system or flow out from it, so basically I have no idea of how large the matrix will be at any given time. That is why I need the "okay that's enough, move those data out of here so I we can start over" method.

回答1:

Almost ALL alternatives to "I will push data to disk in my code" are better than that.

That's because the OS itself (if we're talking reasonably modern OS's such as Windows NT family and most variants of Unix, including Linux and MacOS X) has the ability to deal with virtual memory and swapping to disk, and it will do so in a more clever way than you are likely to come up with.

Further (as per Tony D's comment), using "memory mapped file" is a better method than manually reading/writing to a file - this won't work immediately with a std::vector or other standard collections, but is probably a better choice than manually dealing with reading/writing files in your application - you simply say "Here's a file, please give me a pointer to a piece of memory that represents that file", and you use that pointer as if the file was loaded into memory. The OS will take care of managing which parts of the file is ACTUALLY physically present in memory at any given time, similar to swapping in and out if you allocate more memory than there is present in the system.

However, there are of course limits to this (applies to both the "allocate more than there is RAM available for your app and the memory mapped file solution). If you are using a 32-bit machine (or 32-bit OS or 32-bit application), the maximum amount of memory available to your process will be somewhere between 2GB and 4GB - exactly what the limit is depends on the OS (64-bit OS with 32-bit app may give you nearly 4GB, regular setup of 32-bit Windows gives about 2GB total). So if your array gets big enough, there simply won't be "enough bits" in the address to keep track of it. At which point you need to split the work in some way. Or go to 64-bit OS and application (and naturally a 64-bit processor is needed here), in which case the limit to memory size goes to 128 or 256TB (if my mental aithmetics works - 65536 * 4GB) in total - which is probably more than nearly everyone has as disk-space, never mind RAM.

Edit:

Doing some math based on the data you've given: with each particle having X, Y, Z position, a velocity, and "two other properties" would take up 6 * 8 = 48 bytes in double, and 6 * 4 = 24 bytes as float.

Multiply by 100M and we get 4.8GB for one set of data. Times 1000 timesteps, makes 4.8TB of data. That's a huge amount, even if you have a really large amount of memory. Using mememory mapped files is not really going to work to hold all this data in memory at once. If you have a machine with a decent amount of memory (16GB or so), keeping TWO sets in memory at a time would likely work. But you're still producing a lot of data that needs to be stored at some point, which will most likely take most of the time. For a reasonably modern (single) hard disk, somewhere around 50-100MB/s would be a reasonable expectation. It can be improved by certain RAID configurations, but even them, it's hundreds of megabytes per second, not many gigabytes per second. So, to store 1 TB (1000GB) at 100MB/s would take 10000s, or roughly three hours. 15 hours for 4.8TB. That's JUST to store the data, no calculation [although that is probably a minimal part]. Even if we divide the data set by 10, we have more than an hour, and divide by 50 and we're down in the minutes range.

No matter WHAT method you use, storing and retrieving such large data-sets is time-consuming to say the least. Memory mapped files is the "least bad" in many ways, because it copies the data a little less in the process. But it's still "disk speed" that will be the dominant factor for your calculation speed.

来源：https://stackoverflow.com/questions/24547294/vector-push-back-only-if-enough-memory-is-available

标签

c++

vector

matrix

push-back