method for expand a-z to abc…xyz form

前端 未结 4 772
孤街浪徒
孤街浪徒 2020-12-11 10:19

Hi:) what i\'m trying to do is write a simple program to expand from shortest entry

for example

a-z or 0-9 or a-b-c or a-z0-9

相关标签:
4条回答
  • 2020-12-11 10:38

    Based on the fact that the existing function addresses "a-z" and "0-9" sequences just fine, separately, we should explore what happens when they meet. Trace your code (try printing each variable's value at each step -- yes it will be cluttered, so use line breaks), and I believe you will find a logical short-circuit when iterating, for example, from "current token is 'y' and next token is 'z'" to "current token is 'z' and next token is '0'". Explore the if() condition and you will find that it does not cover all possibilities, i.e. you have covered yourself if you are within a<-->z, within 0<-->9, or exactly equal to '-', but you have not considered being at the end of one (a-z or 0-9) with your next character at the start of the next.

    0 讨论(0)
  • 2020-12-11 10:41

    Here is a C version (in about 38 effective lines) that satisfies the same test as my earlier C++ version.

    The full test program including your test cases, mine and some torture test can be seen live on http://ideone.com/sXM7b#info_3915048

    Rationale

    I'm pretty sure I'm overstating the requirements, but

    • this should be an excellent example of how to do parsing in a robust fashion
      • use states in an explicit fashion
      • validate input (!)
        • this version doesn't assume a-c-b can't happen
        • It also doesn't choke or even fail on simple input like 'Hello World' (or (char*) 0)
    • it shows how you can avoid printf("%c", c) each char without using extraneous functions.
    • I put in some comments as to explain what happens why, but overall you'll find that the code is much more legible anyways, by

      • staying away from too many short-named variables
      • avoiding complicated conditionals with un-transparent indexers
      • avoiding the whole string length business: We only need max lookahead of 2 characters, and *it=='-' or predicate(*it) will just return false if it is the null character. Shortcut evaluation prevents us from accessing past-the-end input characters
    • ONE caveat: I haven't implemented a proper check for output buffer overrun (the capacity is hardcoded at 2048 chars). I'll leave it as the proverbial exercise for the reader

    Last but not least, the reason I did this:

    • It will allow me to compare raw performance of the C++ version and this C version, now that they perform equivalent functions. Right now, I fully expect the C version to outperform the C++ by some factor (let's guess: 4x?) but, again, let's just see what suprises the GNU compilers have in store for us. More later Update turns out I wasn't far off: github (code + results)

    Pure C Implementation

    Without further ado, the implementation, including the testcase:

    #include <stdlib.h>
    #include <stdio.h>
    #include <string.h>
    
    int alpha_range(char c) { return (c>='a') && (c<='z'); }
    int digit_range(char c) { return (c>='0') && (c<='9'); }
    
    char* expand(const char* s)
    {
        char buf[2048];
    
        const char* in  = s;
              char* out = buf;
    
        // parser state
        int (*predicate)(char) = 0; // either: NULL (free state), alpha_range (in alphabetic range), digit_range (in digit range)
        char lower=0,upper=0;       // tracks lower and upper bound of character ranges in the range parsing states
    
        // init
        *out = 0;
    
        while (*in)
        {
            if (!predicate)
            {
                // free parsing state
                if (alpha_range(*in) && (in[1] == '-') && alpha_range(in[2]))
                {
                    lower = upper = *in++;
                    predicate = &alpha_range;
                }
                else if (digit_range(*in) && (in[1] == '-') && digit_range(in[2]))
                {
                    lower = upper = *in++;
                    predicate = &digit_range;
                } 
                else *out++ = *in;
            } else
            { 
                // in a range
                if (*in < lower) lower = *in;
                if (*in > upper) upper = *in;
    
                if (in[1] == '-' && predicate(in[2])) 
                    in++; // more coming
                else
                {
                    // end of range mode, dump expansion
                    char c;
                    for (c=lower; c<=upper; *out++ = c++);
                    predicate = 0;
                }
            }
            in++;
        }
    
        *out = 0; // null-terminate buf
        return strdup(buf);
    }
    
    void dotest(const char* const input)
    {
        char* ex = expand(input);
        printf("input : '%s'\noutput: '%s'\n\n", input, ex);
    
        if (ex)
            free(ex);
    }
    
    int main (int argc, char *argv[])
    {
        dotest("a-z or 0-9 or a-b-c or a-z0-9"); // from the original post
        dotest("This is some e-z test in 5-7 steps; this works: a-b-c. This works too: b-k-c-e. Likewise 8-4-6"); // from my C++ answer
        dotest("-x-s a-9 9- a-k-9 9-a-c-7-3"); // assorted torture tests
    
        return 0;
    }
    

    Test output:

    input : 'a-z or 0-9 or a-b-c or a-z0-9'
    output: 'abcdefghijklmnopqrstuvwxyz or 0123456789 or abc or abcdefghijklmnopqrstuvwxyz0123456789'
    
    input : 'This is some e-z test in 5-7 steps; this works: a-b-c. This works too: b-k-c-e. Likewise 8-4-6'
    output: 'This is some efghijklmnopqrstuvwxyz test in 567 steps; this works: abc. This works too: bcdefghijk. Likewise 45678'
    
    input : '-x-s a-9 9- a-k-9 9-a-c-7-3'
    output: '-stuvwx a-9 9- abcdefghijk-9 9-abc-34567'
    

    0 讨论(0)
  • 2020-12-11 10:49

    Ok I tested your program out and it seems to be working for nearly every case. It correctly expands a-z and other expansions with only two letters/numbers. It fails when there are more letters and numbers. The fix is easy, just make a new char to keep the last printed character, if the currently printed character matches the last one skip it. The a-z0-9 scenario didn't work because you forgot a s[i] >= '0' instead of s[i] > '0'. the code is:

    #include <stdio.h>
    #include <string.h>
    
    void expand(char s[])
    {
            int i,g,n,c,l;
        n=c=0;
        int len = strlen(s);
        for(i = 1;s[i] >= '0' && s[i]<= '9' || s[i] >= 'a' && s[i] <= 'z' || s[i]=='-';i++)
        {
            c = s[i-1];
            g = s[i];
            n = s[i+1];
            //printf("\nc = %c g = %c n = %c\n", c,g,n);
            if(s[0] == '-')
                printf("%c",s[0]);
            else if(g == '-')
            {
                if(c<n)
                {
                    if (c != l){
                        while(c <= n)
                        {
                            printf("%c", c);
                            c++;
                        }
                        l = c - 1;
                    //printf("\nl is %c\n", l);
                }
                else
                {
                    c++;
                    while(c <= n)
                    {
                        printf("%c", c);
                        c++;
                    }
                    l = c - 1;
                    //printf("\nl is %c\n", l);
                }
            }
            else if(c == n)
                printf("%c",g);
            else if(n != '-')
                printf("%c",g);
            else if(c != '-')
                printf("%c",g);
        }
        else if(g == n)
        {
    
            while(g == n)
            {
                printf("%c",s[i]);
                g++;
            }
        }
        else if( s[len] == '-')
            printf("%c",s[len]);
        }
        printf("\n");
    }
    
    int main (int argc, char *argv[])
    {
        expand(argv[1]);
    }
    

    Isn't this problem from K&R? I think I saw it there. Anyway I hope I helped.

    0 讨论(0)
  • 2020-12-11 10:55

    Just for fun, I decided to demonstrate to myself that C++ is really just as suited to this kind of thing.

    Test-first, please

    First, let me define the requirements a little more strictly: I assumed it needs to handle these cases:

    int main()
    {
        const std::string in("This is some e-z test in 5-7 steps; this works: a-b-c. This works too: b-k-c-e. Likewise 8-4-6");
        std::cout << "input : " << in         << std::endl;
        std::cout << "output: " << expand(in) << std::endl;
    }
    

    input : This is some e-z test in 5-7 steps; this works: a-b-c. This works too: b-k-c-e. Likewise 8-4-6
    output: This is some efghijklmnopqrstuvwxyz test in 567 steps; this works: abc. This works too: bcdefghijk. Likewise 45678

    C++0x Implementation

    Here is an implementation (actually a few variants) in 14 lines (23 including whitespace, comments) of C++0x code1

    static std::string expand(const std::string& in)
    {
        static const regex re(R"([a-z](?:-[a-z])+|[0-9](?:-[0-9])+)");
    
        std::string out;
    
        auto tail = in.begin();
        for (auto match : make_iterator_range(sregex_iterator(in.begin(), in.end(), re), sregex_iterator()))
        {
            out.append(tail, match[0].first);
    
            // char range bounds: the cost of accepting unordered ranges...
            char a=127, b=0;
            for (auto x=match[0].first; x<match[0].second; x+=2)
                { a = std::min(*x,a); b = std::max(*x,b); }
    
            for (char c=a; c<=b; out.push_back(c++));
            tail = match.suffix().first;
        }
        out.append(tail, in.end());
    
        return out;
    }
    

    Of course I'm cheating a little because I'm using regex iterators from Boost. I will do some timings comparing to the C version for performance. I rather expect the C++ version to compete within a 50% margin. But, let's see what kind of surprises the GNU compiler ahs in store for us :)

    Here is a complete program that demonstrates the sample input. _It also contains some benchmark timings and a few variations that trade-off

    • functional flexibility
    • legibility / performance


    #include <set> // only needed for the 'slow variant'
    #include <boost/regex.hpp>
    #include <boost/range.hpp>
    
    using namespace boost;
    using namespace boost::range;
    
    static std::string expand(const std::string& in)
    {
    //  static const regex re(R"([a-z]-[a-z]|[0-9]-[0-9])"); // "a-c-d" --> "abc-d", "a-c-e-g" --> "abc-efg"
        static const regex re(R"([a-z](?:-[a-z])+|[0-9](?:-[0-9])+)");
    
        std::string out;
        out.reserve(in.size() + 12); // heuristic
    
        auto tail = in.begin();
        for (auto match : make_iterator_range(sregex_iterator(in.begin(), in.end(), re), sregex_iterator()))
        {
            out.append(tail, match[0].first);
    
            // char range bounds: the cost of accepting unordered ranges...
    #if !SIMPLE_BUT_SLOWER
            // debug 15.149s / release 8.258s (at 1024k iterations)
            char a=127, b=0;
            for (auto x=match[0].first; x<match[0].second; x+=2)
            { a = std::min(*x,a); b = std::max(*x,b); }
    
            for (char c=a; c<=b; out.push_back(c++));
    #else   // simpler but slower
            // debug 24.962s / release 10.270s (at 1024k iterations)
            std::set<char> bounds(match[0].first, match[0].second);
            bounds.erase('-');
            for (char c=*bounds.begin(); c<=*bounds.rbegin(); out.push_back(c++));
    #endif
            tail = match.suffix().first;
        }
        out.append(tail, in.end());
    
        return out;
    }
    
    int main()
    {
        const std::string in("This is some e-z test in 5-7 steps; this works: a-b-c. This works too: b-k-c-e. Likewise 8-4-6");
        std::cout << "input : " << in         << std::endl;
        std::cout << "output: " << expand(in) << std::endl;
    }
    

    1 Compiled with g++-4.6 -std=c++0x

    0 讨论(0)
提交回复
热议问题