I\'m trying to read a mapped file into a matrix. The file is something like this:
name;phone;city\\n
Luigi Rossi;02341567;Milan\\n
Mario Bianchi;06567890;Rom
I have numerous examples doing this/similar written up on SO.
Let me list the most relevant:
I've done quite a few of these benchmarks. Yes, for sequential freading, read/scanf have a tiny edge (see e.g. scanf/iostreams and files vs. mappings, and parsing floats, or read being slightly faster for 1-pass sequential read).
An interesting approach is to do parsing lazily (why copy the whole input into memory? What's the point memory mapping then). The answer here shows this approach (emulating a multimap there):
In all other cases, consider slamming a Spirit Qi job on it, potentially using boost::string_ref instead of vector<char> (unless the mapped file is not "const", of course).
The string_ref is also shown int the last answer linked before. Another interesting example of this (with lazy conversions to un-escaped string values) is here How to parse mustache with Boost.Xpressive correctly?
Here's that Qi job slammed on it:
it parses a 994 MiB file of ~32 million lines in 2.9s into a vector of
struct Line {
boost::string_ref name, city;
long id;
};
note that we parse the number, and store the strings by referring to their location in the memory map + length (string_ref)
string_ref is 16 bytes. Live On Coliru
#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/utility/string_ref.hpp>
namespace qi = boost::spirit::qi;
using sref = boost::string_ref;
namespace boost { namespace spirit { namespace traits {
template <typename It>
struct assign_to_attribute_from_iterators<sref, It, void> {
static void call(It f, It l, sref& attr) { attr = { f, size_t(std::distance(f,l)) }; }
};
} } }
struct Line {
sref name, city;
long id;
};
BOOST_FUSION_ADAPT_STRUCT(Line, (sref,name)(long,id)(sref,city))
int main() {
boost::iostreams::mapped_file_source mmap("input.txt");
using namespace qi;
std::vector<Line> parsed;
parsed.reserve(32000000);
if (phrase_parse(mmap.begin(), mmap.end(),
omit[+graph] >> eol >>
(raw[*~char_(";\r\n")] >> ';' >> long_ >> ';' >> raw[*~char_(";\r\n")]) % eol,
qi::blank, parsed))
{
std::cout << "Parsed " << parsed.size() << " lines\n";
} else {
std::cout << "Failed after " << parsed.size() << " lines\n";
}
std::cout << "Printing 10 random items:\n";
for(int i=0; i<10; ++i) {
auto& line = parsed[rand() % parsed.size()];
std::cout << "city: '" << line.city << "', id: " << line.id << ", name: '" << line.name << "'\n";
}
}
With input generated like
do grep -v "'" /etc/dictionaries-common/words | sort -R | xargs -d\\n -n 3 | while read a b c; do echo "$a $b;$RANDOM;$c"; done
The output is e.g.
Parsed 31609499 lines
Printing 10 random items:
city: 'opted', id: 14614, name: 'baronets theosophy'
city: 'denominated', id: 24260, name: 'insignia ophthalmic'
city: 'mademoiselles', id: 10791, name: 'smelter orienting'
city: 'ducked', id: 32155, name: 'encircled flippantly'
city: 'garotte', id: 3080, name: 'keeling South'
city: 'emirs', id: 14511, name: 'Aztecs vindicators'
city: 'characteristically', id: 5473, name: 'constancy Troy'
city: 'savvy', id: 3921, name: 'deafer terrifically'
city: 'misfitted', id: 14617, name: 'Eliot chambray'
city: 'faceless', id: 24481, name: 'shade forwent'