Simplest way to read a CSV file mapped to memory?

后端 未结 2 1027
没有蜡笔的小新
没有蜡笔的小新 2020-12-11 09:49

When I read from files in C++(11) I map them in to memory using:

boost::interprocess::file_mapping* fm = new file_mapping(path, boost::interprocess::read_onl         


        
相关标签:
2条回答
  • 2020-12-11 10:34

    Simply create an istringstream from your memory mapped bytes and parse that using :

    const std::string stringBuffer(bytes, region->get_size());
    std::istringstream is(stringBuffer);
    typedef boost::tokenizer< boost::escaped_list_separator<char> > Tokenizer;
    std::string line;
    std::vector<std::string> parsed;
    while(getline(is, line))
    {
        Tokenizer tokenizer(line);
        parsed.assign(tokenizer.begin(),tokenizer.end());
        for (auto &column: parsed)
        {
            // 
        }
    }
    

    Note that on many systems memory mapping isn't providing any speed benefit compared to sequential read. In both cases you will end up reading the data from the disk page by page, probably with the same amount of read ahead, and both the IO latency and bandwidth will be the same in both cases. Whether you have lots of memory or not won't make any difference. Also, depending on the system, memory_mapping, even read-only, might lead to surprising behaviours (e.g. reserving swap space) that don't that sometimes keep people busy troubleshooting.

    0 讨论(0)
  • 2020-12-11 10:36

    Here's my take on "fast enough". It zips through 116 MiB of CSV (2.5Mio lines[1]) in ~1 second.

    The result is then randomly accessible at zero-copy, so no overhead (unless pages are swapped out).

    For comparison:

    • that's ~3x faster than a naive wc csv.txt takes on the same file
    • it's about as fast as the following perl one liner (which lists the distinct field counts on all lines):

      perl -ne '$fields{scalar split /,/}++; END { map { print "$_\n" } keys %fields  }' csv.txt
      
    • it's only slower than (LANG=C wc csv.txt) which avoids locale functionality (by about 1.5x)

    Here's the parser in all it's glory:

    using CsvField = boost::string_ref;
    using CsvLine  = std::vector<CsvField>;
    using CsvFile  = std::vector<CsvLine>;  // keep it simple :)
    
    struct CsvParser : qi::grammar<char const*, CsvFile()> {
        CsvParser() : CsvParser::base_type(lines)
        {
            using namespace qi;
    
            field = raw [*~char_(",\r\n")] 
                [ _val = construct<CsvField>(begin(_1), size(_1)) ]; // semantic action
            line  = field % ',';
            lines = line  % eol;
        }
        // declare: line, field, fields
    };
    

    The only tricky thing (and the only optimization there) is the semantic action to construct a CsvField from the source iterator with the matches number of characters.

    Here's the main:

    int main()
    {
        boost::iostreams::mapped_file_source csv("csv.txt");
    
        CsvFile parsed;
        if (qi::parse(csv.data(), csv.data() + csv.size(), CsvParser(), parsed))
        {
            std::cout << (csv.size() >> 20) << " MiB parsed into " << parsed.size() << " lines of CSV field values\n";
        }
    }
    

    Printing

    116 MiB parsed into 2578421 lines of CSV values
    

    You can use the values just as std::string:

    for (int i = 0; i < 10; ++i)
    {
        auto l     = rand() % parsed.size();
        auto& line = parsed[l];
        auto c     = rand() % line.size();
    
        std::cout << "Random field at L:" << l << "\t C:" << c << "\t" << line[c] << "\n";
    }
    

    Which prints eg.:

    Random field at L:1979500    C:2    sateen's
    Random field at L:928192     C:1    sackcloth's
    Random field at L:1570275    C:4    accompanist's
    Random field at L:479916     C:2    apparel's
    Random field at L:767709     C:0    pinks
    Random field at L:1174430    C:4    axioms
    Random field at L:1209371    C:4    wants
    Random field at L:2183367    C:1    Klondikes
    Random field at L:2142220    C:1    Anthony
    Random field at L:1680066    C:2    pines
    

    The fully working sample is here Live On Coliru


    [1] I created the file by repeatedly appending the output of

    while read a && read b && read c && read d && read e
    do echo "$a,$b,$c,$d,$e"
    done < /etc/dictionaries-common/words
    

    to csv.txt, until it counted 2.5 million lines.

    0 讨论(0)
提交回复
热议问题