What are the advantages of using tf.train.SequenceExample over tf.train.Example for variable length features?

后端 未结 2 1336
我在风中等你
我在风中等你 2020-12-31 17:09

Recently I read this guide on undocumented featuers in TensorFlow, as I needed to pass variable length sequences as input. However, I found the protocol for tf.train.S

2条回答
  •  一向
    一向 (楼主)
    2020-12-31 17:57

    Here are the definitions of the Example and SequenceExample protocol buffers, and all the protos they may contain:

    message BytesList { repeated bytes value = 1; }
    message FloatList { repeated float value = 1 [packed = true]; }
    message Int64List { repeated int64 value = 1 [packed = true]; }
    message Feature {
        oneof kind {
            BytesList bytes_list = 1;
            FloatList float_list = 2;
            Int64List int64_list = 3;
        }
    };
    message Features { map feature = 1; };
    message Example { Features features = 1; };
    
    message FeatureList { repeated Feature feature = 1; };
    message FeatureLists { map feature_list = 1; };
    message SequenceExample {
      Features context = 1;
      FeatureLists feature_lists = 2;
    };
    

    An Example contains a Features, which contains a mapping from feature name to Feature, which contains either a bytes list, or a float list or an int64 list.

    A SequenceExample also contains a Features, but it also contains a FeatureLists, which contains a mapping from list name to FeatureList, which contains a list of Feature. So it can do everything an Example can do, and more. But do you really need that extra functionality? What does it do?

    Since each Feature contains a list of values, a FeatureList is a list of lists. And that's the key: if you need lists of lists of values, then you need SequenceExample.

    For example, if you handle text, you can represent it as one big string:

    from tensorflow.train import BytesList
    
    BytesList(value=[b"This is the first sentence. And here's another."])
    

    Or you could represent it as a list of words and tokens:

    BytesList(value=[b"This", b"is", b"the", b"first", b"sentence", b".", b"And", b"here",
                     b"'s", b"another", b"."])
    

    Or you could represent each sentence separately. That's where you would need a list of lists:

    from tensorflow.train import BytesList, Feature, FeatureList
    
    s1 = BytesList(value=[b"This", b"is", b"the", b"first", b"sentence", b"."])
    s2 = BytesList(value=[b"And", b"here", b"'s", b"another", b"."])
    fl = FeatureList(feature=[Feature(bytes_list=s1), Feature(bytes_list=s2)])
    

    Then create the SequenceExample:

    from tensorflow.train import SequenceExample, FeatureLists
    
    seq = SequenceExample(feature_lists=FeatureLists(feature_list={
        "sentences": fl
    }))
    

    And you can serialize it and perhaps save it to a TFRecord file.

    data = seq.SerializeToString()
    

    Later, when you read the data, you can parse it using tf.io.parse_single_sequence_example().

提交回复
热议问题