Parse stream of characters

问题

Say I have an file something like this:

*SP "<something>"
*VER "<something>"

*NAME_MAP
*1 abc
*2 def
...
...

*D_NET *1 <some_value>
*CONN
<whatever>
<whatever>
*CAP
*1:1 *2:2 <whatever_value>
*1:3 *2:4 <whatever_value>
*RES
<whatever>
<whatever>

Let me describe the file once before I start describing my problem. File starts with some header notes. NAME_MAP section has the mapping information about the name and id given to that. That id would be used everywhere later when I want to specify corresponding name.

D_NET section has 3 sub sections, CONN, CAP, RES.

My need is to gather some data from this file. Data I need is related to D_NET. I need

*D_NET *1 <some_value>

mapping of *1 from this line, which in this case would be abc.

Second thing I need is from the CAP section of the D_NET section. Whatever is there in CAP section, I would need it.

Finally, my data would look like an hash as :

*1 -> *1, *2 (In this case, just to make you understand) abc -> abc, def (This is what I want)

Hope I am clear till now.

Since the file size is huge, in multiple Gb's, I have figured out that best way to read an file is by mapping that into memory. Did that using mmap. Just like this:

char* data = (char*)mmap(0, file.st_size, PROT_READ, MAP_PRIVATE, fileno(file), 0);

So, data pointed by mmap is just an stream of characters. Now, I would need to get the above mentioned data from it.

To solve this problem, I think I could use some tokenizer(boost/tokenizer?) here first to split on new line character and then parse those lines to get the desired data. Who all will agree me on that? What else will you suggest me if not agree on that? Please suggest.

How would you suggest to do it? I am open to any fast algorithm.

回答1:

I got curious about the performance gain you hope for by using mmap so I put together two tests, reading files from my media library (treating them as text files). One using the getline approach and one using mmap. The input was:

files: 2012
lines: 135371784
bytes: 33501265769 (31 GiB)

First a helper class used in both tests to read the list of files:

filelist.hpp

#pragma once

#include <fstream>
#include <iterator>
#include <string>
#include <vector>

class Filelist {
    std::vector<std::string> m_strings;
public:
    Filelist(const std::string& file) :
        m_strings()
    {
        std::ifstream is(file);
        for(std::string line; std::getline(is, line);) {
            m_strings.emplace_back(std::move(line));
        }
        /*
        std::copy(
            std::istream_iterator<std::string>(is),
            std::istream_iterator<std::string>(),
            std::back_inserter(m_strings)
        );
        */
    }

    operator std::vector<std::string> () { return m_strings; }
};

getline.cpp

#include <iostream>
#include <fstream>
#include <vector>
#include <string>
#include <iomanip>
#include "filelist.hpp"

int main(int argc, char* argv[]) {
    std::vector<std::string> args(argv+1, argv+argc);
    if(args.size()==0) {
        Filelist tmp("all_files");
        args = tmp;
    }

    unsigned long long total_lines=0;
    unsigned long long total_bytes=0;

    for(const auto& file : args) {
        std::ifstream is(file);
        if(is) {
            unsigned long long lco=0;
            unsigned long long bco=0;
            bool is_good=false;
            for(std::string line; std::getline(is, line); lco+=is_good) {
                is_good = is.good();
                bco += line.size() + is_good;
                // parse here
            }
            std::cout << std::setw(15) << lco << " " << file << "\n";
            total_lines += lco;
            total_bytes += bco;
        }
    }
    std::cout << "files processed: " << args.size() << "\n";
    std::cout << "lines processed: " << total_lines << "\n";
    std::cout << "bytes processed: " << total_bytes << "\n";
}

getline result:

files processed: 2012
lines processed: 135371784
bytes processed: 33501265769

real    2m6.096s
user    0m23.586s
sys     0m20.560s

mmap.cpp

#include <iostream>
#include <fstream>
#include <vector>
#include <iomanip>
#include "filelist.hpp"

#include <sys/mman.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

class File {
    int m_fileno;
public:
    File(const std::string& filename) :
        m_fileno(open(filename.c_str(), O_RDONLY, O_CLOEXEC))
    {
        if(m_fileno==-1)
            throw std::runtime_error("could not open file");
    }
    File(const File&) = delete;
    File(File&& other) :
        m_fileno(other.m_fileno)
    {
        other.m_fileno = -1;
    }
    File& operator=(const File&) = delete;
    File& operator=(File&& other) {
        if(m_fileno!=-1) close(m_fileno);
        m_fileno = other.m_fileno;
        other.m_fileno = -1;
        return *this;
    }
    ~File() {
        if(m_fileno!=-1) close(m_fileno);
    }
    operator int () { return m_fileno; }
};

class Mmap {
    File m_file;
    struct stat m_statbuf;
    char* m_data;
    const char* m_end;

    struct stat pstat(int fd) {
        struct stat rv;
        if(fstat(fd, &rv)==-1)
            throw std::runtime_error("stat failed");
        return rv;
    }
public:
    Mmap(const Mmap&) = delete;
    Mmap(Mmap&& other) :
        m_file(std::move(other.m_file)),
        m_statbuf(std::move(other.m_statbuf)),
        m_data(other.m_data),
        m_end(other.m_end)
    {
        other.m_data = nullptr;
    }
    Mmap& operator=(const Mmap&) = delete;
    Mmap& operator=(Mmap&& other) {
        m_file = std::move(other.m_file);
        m_statbuf = std::move(other.m_statbuf);
        m_data = other.m_data;
        m_end = other.m_end;
        other.m_data = nullptr;
        return *this;
    }

    Mmap(const std::string& filename) :
        m_file(filename),
        m_statbuf(pstat(m_file)),
        m_data(reinterpret_cast<char*>(mmap(0, m_statbuf.st_size, PROT_READ, MAP_PRIVATE, m_file, 0))),
        m_end(nullptr)
    {
        if(m_data==MAP_FAILED)
            throw std::runtime_error("mmap failed");
        m_end = m_data+size();
    }
    ~Mmap() {
        if(m_data!=nullptr)
            munmap(m_data, m_statbuf.st_size);
    }

    inline size_t size() const { return m_statbuf.st_size; }
    operator const char* () { return m_data; }

    inline const char* cbegin() const { return m_data; }
    inline const char* cend() const { return m_end; }
    inline const char* begin() const { return cbegin(); }
    inline const char* end() const { return cend(); }
};

int main(int argc, char* argv[]) {
    std::vector<std::string> args(argv+1, argv+argc);
    if(args.size()==0) {
        Filelist tmp("all_files");
        args = tmp;
    }

    unsigned long long total_lines=0;
    unsigned long long total_bytes=0;

    for(const auto& file : args) {
        try {
            unsigned long long lco=0;
            unsigned long long bco=0;
            Mmap im(file);
            for(auto ch : im) {
                if(ch=='\n') ++lco;
                ++bco;
            }
            std::cout << std::setw(15) << lco << " " << file << "\n";
            total_lines += lco;
            total_bytes += bco;
        } catch(const std::exception& ex) {
            std::clog << "Exception: " << file << " " << ex.what() << "\n";
        }
    }
    std::cout << "files processed: " << args.size() << "\n";
    std::cout << "lines processed: " << total_lines << "\n";
    std::cout << "bytes processed: " << total_bytes << "\n";
}

mmap result:

files processed: 2012
lines processed: 135371784
bytes processed: 33501265769

real    2m8.289s
user    0m51.862s
sys     0m12.335s

I ran the tests right after eachother like this:

% ./mmap
% time ./getline
% time ./mmap

... and they got very similar results. If I were in your shoes, I'd go for the simple getline solution first and try to get the logic in place with that mapping you've got going. If that later feels to slow, go for mmap if you can find some way to make it more effective than I did.

Disclaimer: I don't have much experience with mmap so perhaps I've used it wrong to get the performace it can deliver parsing through text files.

Update: I concatenated all the files into one 31 GiB file and ran the tests again. The result was a bit surprising and I feel that I'm missing something.

getline result:

files processed: 1
lines processed: 135371784
bytes processed: 33501265769

real    2m1.104s
user    0m22.274s
sys     0m19.860s

mmap result:

files processed: 1
lines processed: 135371784
bytes processed: 33501265769

real    2m22.500s
user    0m50.183s
sys     0m13.124s

来源：https://stackoverflow.com/questions/53608876/parse-stream-of-characters

标签

c++

boost

token

mmap