问题
I have a C++ program that transposes a very large matrix. The matrix is too large to hold in memory, so I was writing each column to a separate temporary file, and then concatenating the temporary files once the whole matrix has been processed. However, I am now finding that I am running up against the problem of having too many open temporary files (i.e. the OS doesn't allow me to open enough temporary files). Is there a system portable method for checking (and hopefully changing) the maximum number of allowed open files?
I realise I could close each temp file and reopen only when needed, but am worried about the performance impact of doing this.
My code works as follows (pseudocode - not guaranteed to work):
int Ncol=5000; // For example - could be much bigger.
int Nrow=50000; // For example - in reality much bigger.
// Stage 1 - create temp files
vector<ofstream *> tmp_files(Ncol); // Vector of temp file pointers.
vector<string> tmp_filenames(Ncol); // Vector of temp file names.
for (unsigned int ui=0; ui<Ncol; ui++)
{
string filename(tmpnam(NULL)); // Get temp filename.
ofstream *tmp_file = new ofstream(filename.c_str());
if (!tmp_file->good())
error("Could not open temp file.\n"); // Call error function
(*tmp_file) << "Column" << ui;
tmp_files[ui] = tmp_file;
tmp_filenames[ui] = filename;
}
// Stage 2 - read input file and write each column to temp file
ifstream input_file(input_filename.c_str());
for (unsigned int s=0; s<Nrow; s++)
{
int input_num;
ofstream *tmp_file;
for (unsigned int ui=0; ui<Ncol; ui++)
{
input_file >> input_num;
tmp_file = tmp_files[ui]; // Get temp file pointer
(*tmp_file) << "\t" << input_num; // Write entry to temp file.
}
}
input_file.close();
// Stage 3 - concatenate temp files into output file and clean up.
ofstream output_file("out.txt");
for (unsigned int ui=0; ui<Ncol; ui++)
{
string tmp_line;
// Close temp file
ofstream *tmp_file = tmp_files[ui];
(*tmp_file) << endl;
tmp_file->close();
// Read from temp file and write to output file.
ifstream read_file(tmp_filenames[ui].c_str());
if (!read_file.good())
error("Could not open tmp file for reading."); // Call error function
getline(read_file, tmp_line);
output_file << tmp_line << endl;
read_file.close();
// Delete temp file.
remove(tmp_filenames[ui].c_str());
}
output_file.close();
Many thanks in advance!
Adam
回答1:
There are at least two limits:
- the operating system may impose a limit; in Unix (sh, bash, and similar shells), use
ulimit
to change the limit, within the bounds allowed by the sysadmin - the C library implementation may have a limit as well; you'll probably need to recompile the library to change that
A better solution is to avoid having so many open files. In one of my own programs, I wrote a wrapper around the file abstraction (this was in Python, but the principle is the same in C), which keeps track of the current file position in each file, and opens/closes files as needed, keeping a pool of currently-open files.
回答2:
There isn't a portable way to change the max number of open files. Limits like this tend to be imposed by the operating system and are therefore OS-specific.
Your best bet is to reduce the number of files you have open at any one time.
回答3:
You could normalize the input file into a temporary file, such that each entry occupies the same amount of characters. You might even consider saving that temporary file as binary (using 4/8 bytes per number instead of 1 byte per decimal digit). That way you can calculate the position of each entry in the file from its coordinates in the matrix. Then you can access specific entries by doing a std::istream::seekg and you don't have to concern yourself with a limit on the number of open files.
回答4:
How about just making 1 big file instead of many small temp files? Seek is a cheap operation. And your columns should all be the same size anyway. You should be able to position your file pointer right where you need it to access the column.
// something like...
column_position = sizeof(double)*Nrows*column ;
is.seekg(column_position) ;
double column[Nrows] ;
for( i = 0 ; i < Nrows ; i++ )
is >> column[i] ;
回答5:
"The matrix is too large to hold in memory". It's very likely that the matrix will fit in your address space, though. (If the matrix doesn't fit in 2^64 bytes, you'll need a very impressive file system to hold all those temporary files.) So, don't worry about temporary files. Let the OS handle how swap to disk works. You just need to make sure that you access memory in a way that's swap-friendly. In practice, that means you need to have some locality of reference. But with 16 GB of RAM, you can have ~4 million pages of RAM mapped in. If your number of columsn is significantly smaller than that, there should be no problem.
(Don't use 32 bit systems for this; it's just not worth the pain)
来源:https://stackoverflow.com/questions/6059919/c-c-system-portable-way-to-change-maximum-number-of-open-files