how to include C++ input stream delimiters into result tokens

橙三吉。 提交于 2021-02-09 07:10:27


C++ standard library supports a few ways to introduce custom delimiters for input streams, as I understand recommended way is a using new locale and ctype objects:

first way (inherited from ctype specialization) :

struct csv_whitespace : std::ctype<char>
    bool do_is(mask m, char_type c) const
        if ((m & space) && c == ' ') {
            return false; // space will NOT be classified as whitespace
        if ((m & space) && c == ',') {
            return true; // comma will be classified as whitespace
        return ctype::do_is(m, c); // leave the rest to the parent class
//  for cin stream :
cin.imbue(std::locale(cin.getloc(), new csv_whitespace));

second way (parameterized ctype specialization):

//  getting existing table for ctype<char> specialization
const auto temp = std::ctype<char>::classic_table();
//  create a copy of the table in vector container
std::vector<std::ctype<char>::mask> new_table_vector(temp, temp + std::ctype<char>::table_size);

//  add/remove stream separators using bitwise arithmetic.
//  use char-based indices because ascii codes here are equal to indices
new_table_vector[' '] ^= ctype_base::space;
new_table_vector['\t'] &= ~(ctype_base::space | ctype_base::cntrl);
new_table_vector[':'] |= ctype_base::space;
//  A ctype initialized with new_table_vector would delimit on '\n' and ':' but not ' ' or '\t'.

//  ....
//  usage of the mask above.
cin.imbue(locale(cin.getloc(), new std::ctype<char>(;

But is there way to include a delimiters into a resulted tokens? e.g.



& * %

are delimiters defined using one of methods above. and result strings would be:






so you see - that delimiters are included into result strings. this is a question - how to configure (and is it possible?) input stream for that?

Thank you


The short answer is no, istreams do not provide an inate method for extracting and retaining separators. istreams provide the following extraction methods:

  • operator>> - discards the delimiter
  • get - does not extract a delimiter at all
  • getline - discard a delimiter
  • read - doesn't respect delimiters
  • readsome - doesn't respect delimiters

However, let's assume that you slurpped your istream into string foo, then you could use a regex like this to tokenize:


Live Example

This could be used with a regex_token_iterator like this:

const regex re{ "((?:^|[&*%])[^&*%]*)" };
const vector<string> bar{ sregex_token_iterator(cbegin(foo), cend(foo), re, 1), sregex_token_iterator() };

Live Example

