问题
I have a program which needs to analyze 100,000 files spread over multiple filesystems.
After processing around 3000 files it starts to slow down. I ran it through gprof, but since the slow down doesn't kick in until 30-60 seconds into the analysis, I don't think it tells me much.
How would I track down the cause? top
doesn't show high CPU and the process memory does not increase over time, so I/O?
At top level, we have:
scanner.init(); // build a std::vector<std::string> of pathnames.
scanner.scan(); // analyze those files
Now, init() completes in 1 second. It populates the vector with 70,000 actual filenames and 30,000 symbolic links.
scan() traverses the entries in the vector, looks at the file names, reads the contents (say 1KB of text), and builds a "segment list" [1]
I've read conflicting views on the evils of using std::strings, especially passing them as arguments. All the functions pass &references for both std::strings, structures, etc.
But it does use a lot of string processing to parse filenames, extract substrings and search for substrings. (and if they were evil, the program should be always slow, not just slow down after a while.
Could that be a reason for slowing down over time?
The algorithm is very straightforward and doesn't have any new
/ delete
operators...
Abbreviated, scan():
while (tsFile != mFileMap.end())
{
curFileInfo.filePath = tsFile->second;
mpUtils->parseDateTimeString(tsFile->first, curFileInfo.start);
// Ignore files too small
size_t fs = mpFileActions->fileSize(curFileInfo.filePath);
mDvStorInfo.tsSizeBytes += fs;
if (fileNum++ % 200 == 0)
{
usleep(LONGNAPUSEC); // long nap to give others a turn
}
// collect file information
curFileInfo.locked = isLocked(curFileInfo.filePath);
curFileInfo.sizeBytes = mpFileActions->fileSize(curFileInfo.filePath);
getTsRateAndPktSize(curFileInfo.filePath, curFileInfo.rateBps, curFileInfo.pktSize);
getServiceIdList(curFileInfo.filePath, curFileInfo.svcIdList);
std::string fileBasePath;
fileBasePath = mpUtils->strReplace(".ts", "", curFileInfo.filePath.c_str());
fileBasePath = mpUtils->strReplace(".lockts", "", fileBasePath.c_str()); // chained replace
// Extract the last part of the filename, ie. /mnt/das.b/20160327.104200.to.20160327.104400
getFileEndTimeAndDuration(fileBasePath, curFileInfo);
// Update machine info for both actual ts duration and span including gaps
mDvStorInfo.tsDurationSec += curFileInfo.durSec;
if (!firstTime)
{
// beef is here.
if (hasGap(curFileInfo, prevFileInfo) ||
lockChanged(curFileInfo, prevFileInfo) ||
svcIdListChanged(curFileInfo, prevFileInfo) ||
lastTsFile(tsFile))
{
// This current file differs from those before it so
// close off previous segment and push to list
curSegInfo.prevFileStart = curFileInfo.start;
mSegmentList.push_back(curSegInfo);
prevFileInfo = curFileInfo; // do this before resetting everything!
// initialize the new segment
resetSegmentInfo(curSegInfo);
copyValues(curSegInfo, curFileInfo);
resetFileInfo(curFileInfo);
}
else
{
// still running. Update current segment info
curSegInfo.durSec += curFileInfo.durSec;
curSegInfo.sizeBytes += curFileInfo.sizeBytes;
curSegInfo.end = curFileInfo.end;
curSegInfo.prevFileStart = prevFileInfo.start;
prevFileInfo = curFileInfo;
}
}
else // first time
{
firstTime = false;
prevFileInfo = curFileInfo;
copyValues(curSegInfo, curFileInfo);
resetFileInfo(curFileInfo);
}
++tsFile;
}
where:
curFileInfo/prevFileInfo
is a plain struct. The other functions do string processing, returning a &reference to std::strings
fileSize
is calculated by calling stat()
getServiceIdList
opens the file with fopen
, reads each line and closes the file.
UPDATE
Removing the push_back to the container did not change the performance at all. However, rewriting to use C functions (eg. strstr(), strcpy() etc) now shows constant performance.
Culprit was the std::strings – despite passing as &refs, I guess too many construct/destroy/copy.
[1] the file names are named by YYYYMMDD.HHMMSS date/time, eg 20160612.093200. The purpose of the program is to look for time gaps within the names of the 70,000 files and build a list of contiguous time segments.
回答1:
This could be a heap fragmentation issue. Over time, the heap can turn into Swiss cheese making it much harder for the memory manager to allocate blocks, and potentially forcing swap even if there is free RAM because there aren't any large-enough contiguous free blocks. Here's an MSDN article about heap fragmentation.
You mentioned using std::vector
which guarantees contiguous memory and therefore can be a major culprit in heap fragmentation, as it must free and reallocate each time the collection grows beyond a boundary. If you don't require the contiguous guarantee, you might try a different container.
回答2:
the file names are named by YYYYMMDD.HHMMSS date/time, eg 20160612.093200. The purpose of the program is to look for time gaps within the names of the 70,000 files and build a list of contiguous time segments
Comparing strings is slow; O(N). Comparing integers is fast; O(1). Rather than storing the filenames as strings, consider storing them as integers (or pairs of integers).
And I strongly suggest that you use hash maps, if possible. See std::unordered_set and std::unordered_map. These will greatly cut down on the number of comparisons.
Removing the push_back to the container did not change the performance at all. However, rewriting to use C functions (eg. strstr(), strcpy() etc) now shows constant performance.
std::set<char*>
is sorting pointer addresses, not the strings that they contain.
And don't forget to std::move
your strings to cut down on allocations.
来源:https://stackoverflow.com/questions/37928455/c-slows-over-time-reading-70-000-files