Histogram generating function

六月ゝ 毕业季﹏ 提交于 2020-02-08 04:35:17

问题


i was assigned to make some changes to a C program written by someone else...i want to understand it first to work on it properly...i came upon a function that generates the histogram of ASCII values from a given long string of data. it is something like this.

//load the symbols the old data
  for(int k = 0;k < 256;++k)
  {
    sym[k].Symbol = k;
    sym[k].Count  = 0;
  }

  //Creating the probability distribution for each of the source symbols.
  for(int k = size;k;--k)
  {
    sym[*in ++].Count ++;
  }

here 'in' is the char array (string) containing the characters to be counted. sym is a struct variable. i can't quite understand how this works. can anyone tell me how how the second loop generated the count of the symbols 1 to 255 (ASCII) in the string?


回答1:


for(int k = 0; k < size; k++)
  {
    sym[in[k]].Count++;
  }

This is basically what that second loop is doing.

They just dereference and then move to the next ascii value in one step, and increment the counter for that ascii value.




回答2:


In a word, poorly. The basic idea is pretty simple, but the code is needlessly complex. In particular, his Symbol member is completely useless.

What you'd normally want to do is something like this:

int counts[UCHAR_MAX] = {0};

size_t len = strlen(input_string);
for (int i=0; i<len; i++)
    ++counts[unsigned char(input_string[i])];

So, the basic idea here is pretty simple: walk through the string, and for each item in the string, increment the count for that character.

He's doing pretty much the same thing, but keeping the Count as a member of a structure, along with the Symbol. Since the Symbol is always equal to the subscript of that item, storing it is pointless and wasteful.

Other than that, he's counting down in his loop -- probably a micro-optimization, because (at least on some machines) the zero flag will be set based on the value of the counter when it's decremented, so counting down to zero avoids a comparison in the loop. Given the amount he's wasting with his structure and unnecessarily storing the Symbol values, this makes no sense at all.

If you honestly cared about the code being close to optimum, you could write something more like this:

int counts[UCHAR_MAX] = {0}:

while (*in)
    ++counts[(unsigned char)*in++];

For anybody wondering about the cast, it's unnecessary if you're sure your input will always be true ASCII, that never has the high-bit set. Since you can rarely guarantee much about the input, however, it's generally safer to cast to unsigned char. Otherwise, a character with its top-bit set will typically be interpreted as a negative number, and index outside the array bound. Of course, it's possible for char to be unsigned by default, but it's pretty rare. On a typical (two's complement) machine, the cast doesn't require any extra operations; it just governs how the existing bit pattern will be interpreted.




回答3:


If 'in' is the input string then *in++ is taking each character in the string and looking up the entry in the ascii list sym[] corrsponding to that character value.

So if the string starts with 'A' then (*in) is 65 and it is referencing sym[65]

edit: sym[k].symbol is a bit redundant you can jsut have an array of 256 integers to represent the ascii chart, since sym[n] must be for symbol numbered 'n'




回答4:


in++ increments in, a pointer to the character being read.

*in++, parsed as *(in++), is the char currently read. It's also a number, and the algorithm takes advantage of this to use it as an index in an array. The appropriate count (the count of the character just read) sym[*in ++].Count is incremented.




回答5:


The second loop uses the value of the character pointed to by the pointer in to index the count array.

A really good way to investigate this code is to put a few printf statements around it. Print the value of *in, print the count after it is incremented. You will soon get the picture this way.

Another option is to run code that you don't understand through the debugger.




回答6:


something++ means "add 1 to something, and return its value before the addition".

in is a pointer to the first character of the input.

So, *in++ means "move the input pointer one item onwards, and return the item it was pointing to".

So you can see that

sym[*in ++].Count ++;

means "move the input pointer one item onwards, and increment the Count field of element in the array sym corresponding to the character that was at the current input pointer position the item it was pointing to";

and the enclosing loop does this size times, thus processing the input.



来源:https://stackoverflow.com/questions/1869062/histogram-generating-function

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!