Simple string parsing with C++

这一生的挚爱 提交于 2019-12-17 15:18:45

问题


I've been using C++ for quite a long time now but nevertheless I tend to fall back on scanf when I have to parse simple text files. For example given a config like this (also assuming that the order of the fields could vary):

foo: [3 4 5]
baz: 3.0

I would write something like:

char line[SOME_SIZE];
while (fgets(line, SOME_SIZE, file)) {
    int x, y, z;
    if (3 == sscanf(line, "foo: [%d %d %d]", &x, &y, &z)) {
        continue;
    }
    float w;
    if (1 == sscanf(line, "baz: %f", &w)) {
        continue;
    }
}

What's the most concise way to achieve this in C++? Whenever I try I end up with a lot of scaffolding code.


回答1:


This is a try using only standard C++.

Most of the time I use a combination of std::istringstream and std::getline (which can work to separate words) to get what I want. And if I can I make my config files look like:

foo=1,2,3,4

which makes it easy.

text file is like this:

foo=1,2,3,4
bar=0

And you parse it like this:

int main()
{
    std::ifstream file( "sample.txt" );

    std::string line;
    while( std::getline( file, line ) )   
    {
        std::istringstream iss( line );

        std::string result;
        if( std::getline( iss, result , '=') )
        {
            if( result == "foo" )
            {
                std::string token;
                while( std::getline( iss, token, ',' ) )
                {
                    std::cout << token << std::endl;
                }
            }
            if( result == "bar" )
            {
               //...
    }
}



回答2:


The C++ String Toolkit Library (StrTk) has the following solution to your problem:

#include <string>
#include <deque>
#include "strtk.hpp"

int main()
{
   std::string file_name = "simple.txt";
   strtk::for_each_line(file_name,
                       [](const std::string& line)
                       {
                          std::deque<std::string> token_list;
                          strtk::parse(line,"[]: ",token_list);
                          if (token_list.empty()) return;

                          const std::string& key = token_list[0];

                          if (key == "foo")
                          {
                            //do 'foo' related thing with token_list[1] 
                            //and token_list[2]
                            return;
                          }

                          if (key == "bar")
                          {
                            //do 'bar' related thing with token_list[1]
                            return;
                          }

                       });

   return 0;
}

More examples can be found Here




回答3:


Boost.Spirit is not reserved to parse complicated structure. It is quite good at micro-parsing too, and almost match the compactness of the C + scanf snippet :

#include <boost/spirit/include/qi.hpp>
#include <string>
#include <sstream>

using namespace boost::spirit::qi;


int main()
{
   std::string text = "foo: [3 4 5]\nbaz: 3.0";
   std::istringstream iss(text);

   std::string line;
   while (std::getline(iss, line))
   {
      int x, y, z;
      if(phrase_parse(line.begin(), line.end(), "foo: [">> int_ >> int_ >> int_ >> "]", space, x, y, z))
         continue;
      float w;
      if(phrase_parse(line.begin(), line.end(), "baz: ">> float_, space , w))
         continue;
   }
}

(Why they didn't add a "container" version is beyond me, it would be much more convenient if we could just write :

if(phrase_parse(line, "foo: [">> int_ >> int_ >> int_ >> "]", space, x, y, z))
   continue;

But it's true that :

  • It adds a lot of compile time overhead.
  • Error messages are brutal. If you make a small mistake with scanf, you just run your program and immediately get a segfault or an absurd parsed value. Make a small mistake with spirit and you will get hopeless gigantic error messages from the compiler and it takes a LOT of practice with boost.spirit to understand them.

So ultimately, for simple parsing I use scanf like everyone else...




回答4:


Regular expressions can often be used for parsing strings. Use capture groups (parentheses) to get the various parts of the line being parsed.

For instance, to parse an expression like foo: [3 4 56], use the regular expression (.*): \[(\d+) (\d+) (\d+)\]. The first capture group will contain "foo", the second, third and fourth will contain the numbers 3, 4 and 56.

If there are several possible string formats that need to be parsed, like in the example given by the OP, either apply separate regular expressions one by one and see which one matches, or write a regular expression that matches all the possible variations, typically using the | (set union) operator.

Regular expressions are very flexible, so the expression can be extended to allow more variations, for instance, an arbitrary number of spaces and other whitespace after the : in the example. Or to only allow the numbers to contain a certain number of digits.

As an added bonus, regular expressions provide an implicit validation since they require a perfect match. For instance, if the number 56 in the example above was replaced with 56x, the match would fail. This can also simplify code as, in the example above, the groups containing the numbers can be safely cast to integers without any additional checking being required after a successful match.

Regular expressions usually run at good performance and there are many good libraries to chose from. For instance, Boost.Regex.




回答5:


I think Boost.Spirit is a good way to describe a grammar right in your C++ code. It takes some time to get used to Boost.Spirit but after it is quite easy to use it. It might not be as concise as probably you want but I think it is a handy way of handling simple grammars.Its performance might be a problem so it is likely that in situations where you need speed it might be not a good choice.




回答6:


I feel your pain. I regularly deal with files that have fixed width fields (output via Fortran77 code), so it is always entertaining to attempt to load them with the minimum of fuss. Personally, I'd like to see boost::format supply a scanf implementation. But, barring implementing it myself, I do something similar to @Nikko using boost::tokenizer with offset separators and lexical cast for conversion. For example,

typedef boost::token_iterator_generator< 
                                boost::char_separator<char> >::type tokenizer;

boost::char_separator<char> sep("=,");

std::string line;
std::getline( file_istream, line );
tokenizer tok = boost::make_token_iterator< std::string > (
                                line.begin(), line.end() sep );

std::string var = *tok;  // need to check for tok.at_end() here
++tok;

std::vector< int > vals;
for(;!tok.at_end();++tok){
 vals.push_back( boost::lexical_cast< int >( trimws( *tok ) );
}

Note: boost::lexical_cast does not deal well with leading whitespace (it throws), so I recommend trimming the whitespace of anything you pass it.



来源:https://stackoverflow.com/questions/2880903/simple-string-parsing-with-c

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!