C++: Fast way to read mapped file into a matrix

后端 未结 1 1149
無奈伤痛
無奈伤痛 2020-12-14 13:51

I\'m trying to read a mapped file into a matrix. The file is something like this:

name;phone;city\\n
Luigi Rossi;02341567;Milan\\n
Mario Bianchi;06567890;Rom         


        
相关标签:
1条回答
  • 2020-12-14 14:21

    I have numerous examples doing this/similar written up on SO.

    Let me list the most relevant:

    • I've done quite a few of these benchmarks. Yes, for sequential freading, read/scanf have a tiny edge (see e.g. scanf/iostreams and files vs. mappings, and parsing floats, or read being slightly faster for 1-pass sequential read).

    • An interesting approach is to do parsing lazily (why copy the whole input into memory? What's the point memory mapping then). The answer here shows this approach (emulating a multimap there):

      • Using boost::iostreams::mapped_file_source with std::multimap (approach #2)

    In all other cases, consider slamming a Spirit Qi job on it, potentially using boost::string_ref instead of vector<char> (unless the mapped file is not "const", of course).

    The string_ref is also shown int the last answer linked before. Another interesting example of this (with lazy conversions to un-escaped string values) is here How to parse mustache with Boost.Xpressive correctly?

    DEMO

    Here's that Qi job slammed on it:

    • it parses a 994 MiB file of ~32 million lines in 2.9s into a vector of

      struct Line {
          boost::string_ref name, city;
          long id;
      };
      
    • note that we parse the number, and store the strings by referring to their location in the memory map + length (string_ref)

    • it pretty-prints the data from 10 random lines
    • it can run as fast as 2.5s if you reserve 32m elements in the vector at once; the program does only a single memory allocation in that case.
    • NOTE: on a 64 bit system, the memory representation grows larger than the input size if the average line length is less than 40 bytes. This is because a string_ref is 16 bytes.

    Live On Coliru

    #include <boost/fusion/adapted/struct.hpp>
    #include <boost/spirit/include/qi.hpp>
    #include <boost/iostreams/device/mapped_file.hpp>
    #include <boost/utility/string_ref.hpp>
    
    namespace qi = boost::spirit::qi;
    using sref   = boost::string_ref;
    
    namespace boost { namespace spirit { namespace traits {
        template <typename It>
        struct assign_to_attribute_from_iterators<sref, It, void> {
            static void call(It f, It l, sref& attr) { attr = { f, size_t(std::distance(f,l)) }; }
        };
    } } }
    
    struct Line {
        sref name, city;
        long id;
    };
    
    BOOST_FUSION_ADAPT_STRUCT(Line, (sref,name)(long,id)(sref,city))
    
    int main() {
        boost::iostreams::mapped_file_source mmap("input.txt");
    
        using namespace qi;
    
        std::vector<Line> parsed;
        parsed.reserve(32000000);
        if (phrase_parse(mmap.begin(), mmap.end(), 
                    omit[+graph] >> eol >>
                    (raw[*~char_(";\r\n")] >> ';' >> long_ >> ';' >> raw[*~char_(";\r\n")]) % eol,
                    qi::blank, parsed))
        {
            std::cout << "Parsed " << parsed.size() << " lines\n";
        } else {
            std::cout << "Failed after " << parsed.size() << " lines\n";
        }
    
        std::cout << "Printing 10 random items:\n";
        for(int i=0; i<10; ++i) {
            auto& line = parsed[rand() % parsed.size()];
            std::cout << "city: '" << line.city << "', id: " << line.id << ", name: '" << line.name << "'\n";
        }
    }
    

    With input generated like

    do grep -v "'" /etc/dictionaries-common/words | sort -R | xargs -d\\n -n 3 | while read a b c; do echo "$a $b;$RANDOM;$c"; done
    

    The output is e.g.

    Parsed 31609499 lines
    Printing 10 random items:
    city: 'opted', id: 14614, name: 'baronets theosophy'
    city: 'denominated', id: 24260, name: 'insignia ophthalmic'
    city: 'mademoiselles', id: 10791, name: 'smelter orienting'
    city: 'ducked', id: 32155, name: 'encircled flippantly'
    city: 'garotte', id: 3080, name: 'keeling South'
    city: 'emirs', id: 14511, name: 'Aztecs vindicators'
    city: 'characteristically', id: 5473, name: 'constancy Troy'
    city: 'savvy', id: 3921, name: 'deafer terrifically'
    city: 'misfitted', id: 14617, name: 'Eliot chambray'
    city: 'faceless', id: 24481, name: 'shade forwent'
    
    0 讨论(0)
提交回复
热议问题