char[] to hex string exercise

北慕城南 提交于 2020-01-11 05:17:47

问题


Below is my current char* to hex string function. I wrote it as an exercise in bit manipulation. It takes ~7ms on a AMD Athlon MP 2800+ to hexify a 10 million byte array. Is there any trick or other way that I am missing?

How can I make this faster?

Compiled with -O3 in g++

static const char _hex2asciiU_value[256][2] =
     { {'0','0'}, {'0','1'}, /* snip..., */ {'F','E'},{'F','F'} };

std::string char_to_hex( const unsigned char* _pArray, unsigned int _len )
{
    std::string str;
    str.resize(_len*2);
    char* pszHex = &str[0];
    const unsigned char* pEnd = _pArray + _len;

    clock_t stick, etick;
    stick = clock();
    for( const unsigned char* pChar = _pArray; pChar != pEnd; pChar++, pszHex += 2 ) {
        pszHex[0] = _hex2asciiU_value[*pChar][0];
        pszHex[1] = _hex2asciiU_value[*pChar][1];
    }
    etick = clock();

    std::cout << "ticks to hexify " << etick - stick << std::endl;

    return str;
}

Updates

Added timing code

Brian R. Bondy: replace the std::string with a heap alloc'd buffer and change ofs*16 to ofs << 4 - however the heap allocated buffer seems to slow it down? - result ~11ms

Antti Sykäri:replace inner loop with

 int upper = *pChar >> 4;
 int lower = *pChar & 0x0f;
 pszHex[0] = pHex[upper];
 pszHex[1] = pHex[lower];

result ~8ms

Robert: replace _hex2asciiU_value with a full 256-entry table, sacrificing memory space but result ~7ms!

HoyHoy: Noted it was producing incorrect results


回答1:


At the cost of more memory you can create a full 256-entry table of the hex codes:

static const char _hex2asciiU_value[256][2] =
    { {'0','0'}, {'0','1'}, /* ..., */ {'F','E'},{'F','F'} };

Then direct index into the table, no bit fiddling required.

const char *pHexVal = pHex[*pChar];
pszHex[0] = pHexVal[0];
pszHex[1] = pHexVal[1];



回答2:


This assembly function (based off my previous post here, but I had to modify the concept a bit to get it to actually work) processes 3.3 billion input characters per second (6.6 billion output characters) on one core of a Core 2 Conroe 3Ghz. Penryn is probably faster.

%include "x86inc.asm"

SECTION_RODATA
pb_f0: times 16 db 0xf0
pb_0f: times 16 db 0x0f
pb_hex: db 48,49,50,51,52,53,54,55,56,57,65,66,67,68,69,70

SECTION .text

; int convert_string_to_hex( char *input, char *output, int len )

cglobal _convert_string_to_hex,3,3
    movdqa xmm6, [pb_f0 GLOBAL]
    movdqa xmm7, [pb_0f GLOBAL]
.loop:
    movdqa xmm5, [pb_hex GLOBAL]
    movdqa xmm4, [pb_hex GLOBAL]
    movq   xmm0, [r0+r2-8]
    movq   xmm2, [r0+r2-16]
    movq   xmm1, xmm0
    movq   xmm3, xmm2
    pand   xmm0, xmm6 ;high bits
    pand   xmm2, xmm6
    psrlq  xmm0, 4
    psrlq  xmm2, 4
    pand   xmm1, xmm7 ;low bits
    pand   xmm3, xmm7
    punpcklbw xmm0, xmm1
    punpcklbw xmm2, xmm3
    pshufb xmm4, xmm0
    pshufb xmm5, xmm2
    movdqa [r1+r2*2-16], xmm4
    movdqa [r1+r2*2-32], xmm5
    sub r2, 16
    jg .loop
    REP_RET

Note it uses x264 assembly syntax, which makes it more portable (to 32-bit vs 64-bit, etc). To convert this into the syntax of your choice is trivial: r0, r1, r2 are the three arguments to the functions in registers. Its a bit like pseudocode. Or you can just get common/x86/x86inc.asm from the x264 tree and include that to run it natively.

P.S. Stack Overflow, am I wrong for wasting time on such a trivial thing? Or is this awesome?




回答3:


Faster C Implmentation

This runs nearly 3x faster than the C++ implementation. Not sure why as it's pretty similar. For the last C++ implementation that I posted it took 6.8 seconds to run through a 200,000,000 character array. The implementation took only 2.2 seconds.

#include <stdio.h>
#include <stdlib.h>

char* char_to_hex(const unsigned char* p_array, 
                  unsigned int p_array_len,
                  char** hex2ascii)
{
    unsigned char* str = malloc(p_array_len*2+1);
    const unsigned char* p_end = p_array + p_array_len;
    size_t pos=0;
    const unsigned char* p;
    for( p = p_array; p != p_end; p++, pos+=2 ) {
       str[pos] = hex2ascii[*p][0];
       str[pos+1] = hex2ascii[*p][1];
    }
    return (char*)str;
}

int main()
{
  size_t hex2ascii_len = 256;
  char** hex2ascii;
  int i;
  hex2ascii = malloc(hex2ascii_len*sizeof(char*));
  for(i=0; i<hex2ascii_len; i++) {
    hex2ascii[i] = malloc(3*sizeof(char));    
    snprintf(hex2ascii[i], 3,"%02X", i);
  }
  size_t len = 8;
  const unsigned char a[] = "DO NOT WANT";
  printf("%s\n", char_to_hex((const unsigned char*)a, len, (char**)hex2ascii));
}




回答4:


Operate on 32 bits at a time (4 chars), then deal with the tail if needed. When I did this exercise with url encoding a full table lookup for each char was slightly faster than logic constructs, so you may want to test this in context as well to take caching issues into account.




回答5:


It works for me with unsigned char:

unsigned char  c1 =  byteVal >> 4;
unsigned char  c2 =  byteVal & 0x0f;

c1 +=  c1 <= 9 ? '0' : ('a' - 10);
c2 +=  c2 <= 9 ? '0' : ('a' - 10);

std::string sHex("  ");
sHex[0] = c1 ;
sHex[1] = c2 ;


//sHex - contain what we need. For example "0f"



回答6:


For one, instead of multiplying by 16 do a bitshift << 4

Also don't use the std::string, instead just create a buffer on the heap and then delete it. It will be more efficient than the object destruction that is needed from the string.




回答7:


not going to make a lot of difference... *pChar-(ofs*16) can be done with [*pCHar & 0x0F]




回答8:


This is my version, which, unlike the OP's version, doesn't assume that std::basic_string has its data in contiguous region:

#include <string>

using std::string;

static char const* digits("0123456789ABCDEF");

string
tohex(string const& data)
{
    string result(data.size() * 2, 0);
    string::iterator ptr(result.begin());
    for (string::const_iterator cur(data.begin()), end(data.end()); cur != end; ++cur) {
        unsigned char c(*cur);
        *ptr++ = digits[c >> 4];
        *ptr++ = digits[c & 15];
    }
    return result;
}



回答9:


I assume this is Windows+IA32.
Try to use short int instead of the two hexadecimal letters.

short int hex_table[256] = {'0'*256+'0', '1'*256+'0', '2'*256+'0', ..., 'E'*256+'F', 'F'*256+'F'};
unsigned short int* pszHex = &str[0];

stick = clock();

for (const unsigned char* pChar = _pArray; pChar != pEnd; pChar++) 
    *pszHex++ = hex_table[*pChar];

etick = clock();



回答10:


Changing

    ofs = *pChar >> 4;
    pszHex[0] = pHex[ofs];
    pszHex[1] = pHex[*pChar-(ofs*16)];

to

    int upper = *pChar >> 4;
    int lower = *pChar & 0x0f;
    pszHex[0] = pHex[upper];
    pszHex[1] = pHex[lower];

results in roughly 5% speedup.

Writing the result two bytes at time as suggested by Robert results in about 18% speedup. The code changes to:

_result.resize(_len*2);
short* pszHex = (short*) &_result[0];
const unsigned char* pEnd = _pArray + _len;

const char* pHex = _hex2asciiU_value;
for(const unsigned char* pChar = _pArray;
    pChar != pEnd;
    pChar++, ++pszHex )
{
    *pszHex = bytes_to_chars[*pChar];
}

Required initialization:

short short_table[256];

for (int i = 0; i < 256; ++i)
{
    char* pc = (char*) &short_table[i];
    pc[0] = _hex2asciiU_value[i >> 4];
    pc[1] = _hex2asciiU_value[i & 0x0f];
}

Doing it 2 bytes at a time or 4 bytes at a time will probably result in even greater speedups, as pointed out by Allan Wind, but then it gets trickier when you have to deal with the odd characters.

If you're feeling adventurous, you might try to adapt Duff's device to do this.

Results are on an Intel Core Duo 2 processor and gcc -O3.

Always measure that you actually get faster results — a pessimization pretending to be an optimization is less than worthless.

Always test that you get the correct results — a bug pretending to be an optimization is downright dangerous.

And always keep in mind the tradeoff between speed and readability — life is too short for anyone to maintain unreadable code.

(Obligatory reference to coding for the violent psychopath who knows where you live.)




回答11:


Make sure your compiler optimization is turned on to the highest working level.

You know, flags like '-O1' to '-03' in gcc.




回答12:


I have found that using an index into an array, rather than a pointer, can speed things up a tick. It all depends on how your compiler chooses to optimize. The key is that the processor has instructions to do complex things like [i*2+1] in a single instruction.




回答13:


If you're rather obsessive about speed here, you can do the following:

Each character is one byte, representing two hex values. Thus, each character is really two four-bit values.

So, you can do the following:

  1. Unpack the four-bit values to 8-bit values using a multiplication or similar instruction.
  2. Use pshufb, the SSSE3 instruction (Core2-only though). It takes an array of 16 8-bit input values and shuffles them based on the 16 8-bit indices in a second vector. Since you have only 16 possible characters, this fits perfectly; the input array is a vector of 0 through F characters, and the index array is your unpacked array of 4-bit values.

Thus, in a single instruction, you will have performed 16 table lookups in fewer clocks than it normally takes to do just one (pshufb is 1 clock latency on Penryn).

So, in computational steps:

  1. A B C D E F G H I J K L M N O P (64-bit vector of input values, "Vector A") -> 0A 0B 0C 0D 0E 0F 0G 0H 0I 0J 0K 0L 0M 0N 0O 0P (128-bit vector of indices, "Vector B"). The easiest way is probably two 64-bit multiplies.
  2. pshub [0123456789ABCDEF], Vector B



回答14:


I'm not sure doing it more bytes at a time will be better... you'll probably just get tons of cache misses and slow it down significantly.

What you might try is to unroll the loop though, take larger steps and do more characters each time through the loop, to remove some of the loop overhead.




回答15:


Consistently getting ~4ms on my Athlon 64 4200+ (~7ms with original code)

for( const unsigned char* pChar = _pArray; pChar != pEnd; pChar++) {
    const char* pchars = _hex2asciiU_value[*pChar];
    *pszHex++ = *pchars++;
    *pszHex++ = *pchars;
}



回答16:


The function as it is shown when I'm writing this produces incorrect output even when _hex2asciiU_value is fully specified. The following code works, and on my 2.33GHz Macbook Pro runs in about 1.9 seconds for 200,000,000 million characters.

#include <iostream>

using namespace std;

static const size_t _h2alen = 256;
static char _hex2asciiU_value[_h2alen][3];

string char_to_hex( const unsigned char* _pArray, unsigned int _len )
{
    string str;
    str.resize(_len*2);
    char* pszHex = &str[0];
    const unsigned char* pEnd = _pArray + _len;
    const char* pHex = _hex2asciiU_value[0];
    for( const unsigned char* pChar = _pArray; pChar != pEnd; pChar++, pszHex += 2 ) {
       pszHex[0] = _hex2asciiU_value[*pChar][0];
       pszHex[1] = _hex2asciiU_value[*pChar][1];
    }
    return str;
}


int main() {
  for(int i=0; i<_h2alen; i++) {
    snprintf(_hex2asciiU_value[i], 3,"%02X", i);
  }
  size_t len = 200000000;
  char* a = new char[len];
  string t1;
  string t2;
  clock_t start;
  srand(time(NULL));
  for(int i=0; i<len; i++) a[i] = rand()&0xFF;
  start = clock();
  t1=char_to_hex((const unsigned char*)a, len);
  cout << "char_to_hex conversion took ---> " << (clock() - start)/(double)CLOCKS_PER_SEC << " seconds\n";
}


来源:https://stackoverflow.com/questions/69115/char-to-hex-string-exercise

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!