Decoding COMP-3 packed fields in an ASCII file in Python?

问题

I have a file that was formerly an EBCDIC-encoded file, which was converted to ASCII using dd. However, some lines contain COMP-3 packed fields which I would like to read.

For example, the string representation of one of the lines I would like to decode is:

'15\x00\x00\x00\x04@\x00\x00\x00\x00\x0c\x00\x00\x00\x00\x0c777093020141204NNNNNNNNYNNNN\n'

The field I would like to read is specified by PIC S9(09) COMP-3 POS. 3, that is, the field that starts with the third byte and is nine bytes long when decoded (and therefore, five bytes long when encoded, according to the COMP-3 spec).

I understand the COMP-3 spec and I also know that for this particular line the integer value of this field should be 315, but I can't figure out what to do in order to actually decode the field. I'm also not sure if the fact that the file was converted with dd to ASCII is a problem here or not.

Has anyone worked on a similar issue before, or is there something obvious I'm missing? Thank you!

回答1:

Yes, it is a problem that a file contains non-character data and has been converted from EBCDIC to ASCII at the file or record-level. It is not a problem what tool has been used to do that.

By far the easiest thing for you is to request that the data be given to you in character-only. Where the data contains signed fields, the sign should be separate, and where there are implied decimal places these should be actual, or indicated by a scaling value (whichever is more convenient to you).

Then you need to convert nothing. I can never understand how people think they can just give you EBCDIC data containing "whatever" and expect you to sort it out.

If you click on the EBCDIC tag you will find some other solutions you may be able to apply if, for some idiotic reason, the character data cannot be made available from the EBCDIC source. Since they've given you crap already, they may be able to come up with some moronic reason. If so, document it (politely) to your boss.

If you get character data, then you can dd or whatever to convert it (if you still get funny-looking stuff, check the code-pages).

The reason things get pickled if you convert non-character data is exemplified by this:

05  a-packed-decimal-positive-five COMP-3 PIC S9 VALUE +5.
05  a-character-asterisk PIC X VALUE "*".

Both of those, in EBCDIC, have the hexadecimal value 5C. Both will be converted to an ASCII asterisk. The COMP-3 value of five has then been lost. Note that a COMP-3 can, outside of the low-order sign, take any pair of numeric digits for each of its bytes. Pickle when you happen to hit a control character. Same for "binary" fields, worse indeed because more possibilities of accidental hit.

回答2:

If the reverse character encoding conversion were to be performed, then the value may be able to be determined; because there is [good reason to] doubt to that effect, the best thing to do is as Bill Woodger suggested and get a new copy of the data in a text format, or get a new copy of the original data but do not corrupt the data with a character translation of the inherently binary [portions of the] data. In this specific case, I am confident the value is determinable; but as 0d377 (+377) rather than 0d315 (+315).
Hopefully sense can be made of the following:

ASCII string (given\xEncoded):

'15\x00\x00\x00\x04@\x00\x00\x00\x00\x0c\x00\x00\x00\x00\x0c777093020141204NNNNNNNNYNNNN\n'

ASCII (hex):

  ....+....1....+....2....+....3....+....4....+....5....+....6....+....7....+....8....+....9....+
X'31350000000440000000000C000000000C3737373039333032303134313230344E4E4E4E4E4E4E4E594E4E4E4E0A'
           -04-    ASCII x04->x37 in EBCDIC [control character End of Transmission (EOT)]
             -40-  ASCII x40->x7C in EBCDIC [or xB5 or x80 or xEC or ?? per @ is a variant character in EBCDIC]

EBCDIC:

  ....+....1....+....2....+....3....+....4....+....5....+....6....+....7....+....8....+....9....+
x'F1F5000000377C000000000C000000000CF7F7F7F0F9F3F0F2F0F1F4F1F2F0F4D5D5D5D5D5D5D5D5E8D5D5D5D525'
           -37-    EBCDIC x37->x04 in ASCII [control character End of Transmission (EOT)]
             -7C-  EBCDIC x7C->x40 in ASCII [or A7 or 25 or ?? per x7C does not represent an invariant character in EBCDIC]

The bytes of data in the PIC S9(09) COMP-3 POS. 3 that are the Packed Binary Coded Decimal (BCD), for five bytes from positions five to fourteen [in the scale lines shown; ten hex digits 000000377C], represent the positive decimal integer value 377. I have little doubt, that was the original value.

By chance, the conversion from EBCDIC to ASCII, for that particular string, was not corrupted due to an inability to round-trip the character conversion. The next two values in the record are also presumably defined the same, and those too are unaffected by data loss in a conversion both to and from EBCDIC; i.e. the control character with code-point x0C is the same in both EBCDIC and ASCII, and both have the decimal value of positive zero.

While there may have been other possible Code Page from which to try the round-trip, the CP00037 provided a strong contender [with x7C with a valid sign nibble] and a valid conversion; the value of 315 seems quite improbable as the reserved EBCDIC control character x31 would have had to translate into ASCII x04 instead of either x91 or xBA, and the most likely EBCDIC x5C inexplicably would have had to convert to ASCII x40 instead of into x2A [or as a negative value x5D inexplicably convert to ASCII x40 instead of into x29; any non-preferred signage possibilities were not contemplated], neither of which makes any sense.

回答3:

With a lot of Trial and Error, what I noticed is, a direct encoding into Ascii format will result in the correct number except for the last digit and sign. There is a conversion table to do the translation for that last digit. Here is what I did with some quick and dirty code that works for my use case. My file is loaded into a data frame in pandas and I am calling this function to do the translation for me by passing in the value and the number of decimal places.

sign = {'{': 1,'A': 1,'B': 1,'C': 1,'D': 1,'E': 1,'F': 1,'G': 1,'H': 1,'I': 1,'}': -1,'J': -1,'K': -1,
'L': -1,'M': -1,'N': -1,'O': -1,'P': -1,'Q': -1,'R': -1 }

last_digit = {'{': 0,'A': 1,'B': 2,'C': 3,'D': 4,'E': 5,'F': 6,'G': 7,'H': 8,'I': 9,'}': 0,'J': 1,'K': 2,
'L': 3,'M': 4,'N': 5,'O': 6,'P': 7,'Q': 8,'R': 9 }

def unpack(value,decimal):

    l = value.str[-1:]
    s = l.map(sign)
    d = l.map(last_digit)
    num = value.str[:-1]
    return (num.apply(int)*10+d)*s/10**decimal

Now your new field in the dataframe can be:

df['unpacked'] = unpack(df['Packed'],2)

来源：https://stackoverflow.com/questions/29232656/decoding-comp-3-packed-fields-in-an-ascii-file-in-python

标签

python

ebcdic

comp-3