how to write a valid decoding file based on a given .proto, reading from a .pb

有些话、适合烂在心里 提交于 2019-11-29 16:32:38

Updated; the confusion here is two points:

  • the root object is Relation, not Document (in fact, only Relation and RelationMentionRef are even used)
  • the pb file is actually multiple objects, each varint-delimited, i.e. prefixed by their length expressed as a varint

As such, Relation.parseDelimitedFrom should work. Processing it manually, I get:

test-multiple.pb, 96678 Relation objects parsed
testNegative.pb, 94917 Relation objects parsed
testPositive.pb, 1950 Relation objects parsed
trainNegative.pb, 63596 Relation objects parsed
trainPositive.pb, 4700 Relation objects parsed

Old; outdated; exploratory:

I extracted your 4 documents and ran them through a little test rig:

        ProcessFile("testNegative.pb");
        ProcessFile("testPositive.pb");
        ProcessFile("trainNegative.pb");
        ProcessFile("trainPositive.pb");

where ProcessFile first dumps the first 10 bytes as hex, and then tries to process it via a ProtoReader. Here's the results:

Processing: testNegative.pb
dc 16 0a 26 2f 67 75 69 64 2f
> Document
Unexpected end-group in source data; this usually means the source data is corru
pt

Yep; agreed; DC is wire-type 4 (end-group), field 27; your document does not define field 27, and even if it did: it is meaningless to start with an end-group.

Processing: testPositive.pb
d5 0f 0a 26 2f 67 75 69 64 2f
> Document
250: Fixed32, Unexpected field
14: Fixed32, Unexpected field
6: String, Unexpected field
6: Variant, Unexpected field
Unexpected end-group in source data; this usually means the source data is corru
pt

Here we can't see the offending data in the hex dump, but again: there initial fields look nothing like your data and the reader readily confirms that the data is corrupt.

Processing: trainNegative.pb
d1 09 0a 26 2f 67 75 69 64 2f
> Document
154: Fixed64, Unexpected field
7: Fixed64, Unexpected field
6: Variant, Unexpected field
6: Variant, Unexpected field
Unexpected end-group in source data; this usually means the source data is corru
pt

Same as above.

Processing: trainPositive.pb
cf 75 0a 26 2f 67 75 69 64 2f
> Document
1881: 7, Unexpected field
Invalid wire-type; this usually means you have over-written a file without trunc
ating or setting the length; see http://stackoverflow.com/q/2152978/23354

CF 75 is a two-byte varint with wire-type 7 (which is not defined in the specification).

Your data is well and truly garbage. Sorry.


And with the bonus round of test-multiple.pb from comments (after gz decompression):

Processing: test-multiple.pb
dc 16 0a 26 2f 67 75 69 64 2f
> Document
Unexpected end-group in source data; this usually means the source data is corru
pt

This starts identically to testNegative.pb, and hence fails for exactly the same reason.

I know it's been over two years, but here I provide a general way to read this delimited protocol buffers in python. The function you mention: parseDelimitedFrom, is not available in the python implementation of the protocol buffers. But here is small workaround for whoever might need it. This code is an adaptation of that found in: https://www.datadoghq.com/blog/engineering/protobuf-parsing-in-python/

def read_serveral_pbfs(filename, class_of_pb):
result = []
with open(filename, 'rb') as f:
    buf = f.read()
    n = 0
    while n < len(buf):
        msg_len, new_pos = _DecodeVarint32(buf, n)
        n = new_pos
        msg_buf = buf[n:n+msg_len]
        n += msg_len
        read_data = class_of_pb()
        read_data.ParseFromString(msg_buf)
        result.append(read_data)

return result

and a usage example using one of the files of the OP:

import Document_pb2
from google.protobuf.internal.encoder import _VarintBytes
from google.protobuf.internal.decoder import _DecodeVarint32
filename = "trainPositive.pb"
relations = read_serveral_pbfs(filename,Document_pb2.Relation)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!