Awk doesn't match all match all my entries

问题

I'm trying to make "a script" - essentially an awk command - to extract the prototypes of functions of C code in a .c file to generate automatically a header .h. I'm new with awk so I don't get all the details.

This is a sample of the source .c :

dict_t dictup(dict_t d, const char * key, const char * newval)
{

  int i = dictlook(d, key);

  if (i == DICT_NOT_FOUND) {

    fprintf(stderr, "key \"%s\" doesn't exist.\n", key);
    dictdump(d);
  }
  else {

    strncpy(d.entry[i].val, newval, DICTENT_VALLENGTH);
  }

  return d;
}


dict_t* dictrm(dict_t* d, const char * key) {

  int i = dictlook(d, key);

  if (i == DICT_NOT_FOUND) {

    fprintf(stderr, "key \"%s\" doesn't exist.\n", key);
    dictdump(d);
  }
  else {
    d->entry[i] = d->entry[--d.size];
  }
  if ( ((float)d->size)/d.maxsise < 0.25 ) {
    d->maxsize /= 2; 
    d->entry = realloc(d->entry, d->maxsize*sizeof(dictent_t*));
  }

  return d;
}

And what I want to generate :

dict_t dictup(dict_t d, const char * key, const char *newval); 
dict_t* dictrm(dict_t* d, const char * key);

My command with the full regex looks like this :

 awk '/^[a-zA-Z*_]+[:space:]+[a-zA-Z*_]+[:space:]*\(.*?\)/{ print $0 }' dict3.c

But I don't get nothing with it. So I've tried to squeeze it just to see if I can come with something. I've tried this :

awk '/^[a-zA-Z*_]+[:space:]+[a-zA-Z*_]+/{ print $0 }' dict3.c

And I get that :

dictent_t* dictentcreate(const char * key, const char * val) 
dict_t* dictcreate() 
dict_t* dictadd(dict_t* d, const char * key, const char * val) 
dict_t dictup(dict_t d, const char * key, const char * newval) 
dict_t* dictrm(dict_t* d, const char * key) {

And it's source of lots of wonder !

Why doesn't the first regex work?
And why the second has catched some of the declarations, but not all? I assure you that there is no space before any declaration. I guess it didn't catch other part of the code like variables declarations because of the indentation.
Third question, why has it catched all the line where I just need the expression?
Last one, how can I add the ; at the end of each regex?

回答1:

Note: the question has changed substantially since I wrote this answer.

Replace [:space:] with [[:space:]]:

$ awk '/^[a-zA-Z*_]+[[:space:]]+[a-zA-Z*_]+[[:space:]]*[(].*?[)]/{ print $0 }' dict3.c
dictent_t* dictentcreate(const char * key, const char * val)  
dict_t* dictcreate() 
void dictdestroy(*dict_t d) 
void dictdump(dict_t *d) 
int dictlook(dict_t *d, const char * key) 
int dictget(char* s, dict_t *d, const char *key)
dict_t* dictadd(dict_t* d, const char * key, const char * val)
dict_t dictup(dict_t d, const char * key, const char *newval) 
dict_t* dictrm(dict_t* d, const char * key)

The reason is that [:space:] will match any of the characters :, s, p, a, c, or e. This is not what you want.

You want [[:space:]] which will match any whitespace.

Sun/Solaris

The native Sun/Solaris awk is notoriously bug-filled. If you are on that platform, try nawk or /usr/xpg4/bin/awk or /usr/xpg6/bin/awk.

Using sed

A very similar approach can be used with sed. This uses a regex based on yours:

$ sed -n '/^[a-zA-Z_*]\+[ \t]\+[a-zA-Z*]\+ *[(]/p' dict3.c
dictent_t* dictentcreate(const char * key, const char * val)  
dict_t* dictcreate() 
void dictdestroy(*dict_t d) 
void dictdump(dict_t *d) 
int dictlook(dict_t *d, const char * key) 
int dictget(char* s, dict_t *d, const char *key)
dict_t* dictadd(dict_t* d, const char * key, const char * val)
dict_t dictup(dict_t d, const char * key, const char *newval) 
dict_t* dictrm(dict_t* d, const char * key)

The -n option tells sed not to print unless we explicitly ask it to. The construct /.../p tells sed to print the line if the regex inside the slashes is matched.

All the improvements to the regex suggested by Ed Morton apply here also.

Using perl

The above can also be adopted to perl:

perl -ne  'print if /^[a-zA-Z_*]+[ \t]+[a-zA-Z*]+ *[(]/' dict3.c

回答2:

The regexp you're trying to write would be:

$ awk '/^[[:alpha:]_][[:alnum:]_]*\**[[:space:]]+[[:alpha:]_][[:alnum:]_]*[[:space:]]*\([^)]*\)/' file
dictent_t* dictentcreate(const char * key, const char * val)
dict_t* dictcreate()
void dictdestroy(*dict_t d)
void dictdump(dict_t *d)
int dictlook(dict_t *d, const char * key)
int dictget(char* s, dict_t *d, const char *key)
dict_t* dictadd(dict_t* d, const char * key, const char * val)
dict_t dictup(dict_t d, const char * key, const char *newval)
dict_t* dictrm(dict_t* d, const char * key)

which written without character classes and making assumptions about your locale would be:

$ awk '/^[a-zA-Z_][a-zA-Z0-9_]*\**[ \t]+[a-zA-Z_][a-zA-Z0-9_]*[ \t]*\([^)]*\)/' file
dictent_t* dictentcreate(const char * key, const char * val)
dict_t* dictcreate()
void dictdestroy(*dict_t d)
void dictdump(dict_t *d)
int dictlook(dict_t *d, const char * key)
int dictget(char* s, dict_t *d, const char *key)
dict_t* dictadd(dict_t* d, const char * key, const char * val)
dict_t dictup(dict_t d, const char * key, const char *newval)
dict_t* dictrm(dict_t* d, const char * key)

but:

Get/use an awk that has character classes because if it doesn't have that then who knows what else it's missing?
It's always trivial to write a script to find the strings you want but MUCH harder to NOT find the strings you DON'T want. For example, the above will match text inside comments and would fail given a declaration like int foo(int x /* always > 0 (I hope) */). When providing sample input/output you should always include some text that you think will be hard for a script to NOT select given it "looks" a lot like the text you do want to select but in the wrong context for your needs.

Note that C symbols cannot start with a number and so the regexp to match one is not [[:alnum:]_]+ but is instead [[:alpha:]_][[:alnum:]_]*. Also functions can and often do return pointers to pointers to pointers and the * can be next to the function name instead of the function return type so you REALLY should be using a regexp like this (untested since you didn't provide input of the format that this would match) if your function declarations can be any of the normal formats:

awk '/^[[:alpha:]_][[:alnum:]_]*((\*[[:space:]]*)*|(\*[[:space:]]*)*|[[:space:]]+)[[:alpha:]_][[:alnum:]_]*[[:space:]]*\([^)]*\)/' file

That won't of course match declarations that span lines - that is a whole other can of worms.

In general you can't parse C without a C parser but if you want something cheap and cheerful then at least run a C beautifier on the code first to try to get all the various possible layouts into one consistent format (google "C beautifier" and you also need to strip out the comments (see for example https://stackoverflow.com/a/13062682/1745001).

Given your new requirements and your new sample input/output, this is what you are asking for:

$ awk 'match($0,/^[[:alpha:]_][[:alnum:]_]*\**[[:space:]]+[[:alpha:]_][[:alnum:]_]*[[:space:]]*\([^)]*\)/) { print substr($0,RSTART,RLENGTH) ";" }' file
dict_t dictup(dict_t d, const char * key, const char * newval);
dict_t* dictrm(dict_t* d, const char * key);

but again - this is by no means robust given the possible layouts of C code in general. You need a C parser, a C beautifier, and/or a specialized tool to do this job (e.g. googl cscope) robustly.

来源：https://stackoverflow.com/questions/33134957/awk-doesnt-match-all-match-all-my-entries

标签

regex

awk

header-files

text-extraction