I’m talking about the audio features dataset available at https://research.google.com/audioset/download.html as a tar.gz archive consisting of frame-level audio tfrecords.
Extracting everything else from the tfrecord files works fine (I could extract the keys: video_id, start_time_seconds, end_time_seconds, labels), but the actual embeddings needed for training do not seem to be there at all. When I iterate over the contents of any tfrecord file from the dataset, only the four keys video_id, start_time_seconds, end_time_seconds, and labels, are printed.
This is the code I'm using:
import tensorflow as tf
import numpy as np
def readTfRecordSamples(tfrecords_filename):
record_iterator = tf.python_io.tf_record_iterator(path=tfrecords_filename)
for string_record in record_iterator:
example = tf.train.Example()
example.ParseFromString(string_record)
print(example) # this prints the abovementioned 4 keys but NOT audio_embeddings
# the first label can be then parsed like this:
label = (example.features.feature['labels'].int64_list.value[0])
print('label 1: ' + str(label))
# this, however, does not work:
#audio_embedding = (example.features.feature['audio_embedding'].bytes_list.value[0])
readTfRecordSamples('embeddings/01.tfrecord')
Is there any trick to extracting the 128-dimensional embeddings? Or are they really not in this dataset?
Solved it, the tfrecord files need to be read as sequence examples, not as examples. The above code works if the line
example = tf.train.Example()
is replaced by
example = tf.train.SequenceExample()
The embeddings and all other content can then be viewed by simply running
print(example)
来源:https://stackoverflow.com/questions/46204992/how-can-i-extract-the-audio-embeddings-features-from-google-s-audioset