Google Cloud ML Engine + Tensorflow perform preprocessing/tokenization in input_fn()

问题

I want to perform basic preprocessing and tokenization within my input function. My data is contained in csv's in a google cloud storage bucket location (gs://) that I cannot modify. Further, I to perform any modifications on input text within my ml-engine package so that the behavior can be replicated at serving time.

my input function follows the basic structure below:

filename_queue = tf.train.string_input_producer(filenames)
reader = tf.TextLineReader()
_, rows = reader.read_up_to(filename_queue, num_records=batch_size)
text, label = tf.decode_csv(rows, record_defaults = [[""],[""]])

# add logic to filter special characters
# add logic to make all words lowercase
words = tf.string_split(text) # splits based on white space

Are there any options that avoid performing this preprocessing on the entire data set in advance? This post suggests that tf.py_func() can be used to make these transformations, however they suggest that "The drawback is that as it is not saved in the graph, I cannot restore my saved model" so I am not convinced that this will be useful at serving time. If I am defining my own tf.py_func() to do preprocessing and it is defined in the trainer package that I am uploading to the cloud will I run into any issues? Are there any alternative options that I am not considering?

回答1:

Best practice is to write a function that you call from both the training/eval input_fn and from your serving input_fn.

For example:

def add_engineered(features):
  text = features['text']
  features['words'] = tf.string_split(text)
  return features

Then, in your input_fn, wrap the features you return with a call to add_engineered:

def input_fn():
  features = ...
  label = ...
  return add_engineered(features), label

and in your serving_input fn, make sure to similarly wrap the returned features (NOT the feature_placeholders) with a call to add_engineered:

def serving_input_fn():
    feature_placeholders = ...
    features = ...
    return tflearn.utils.input_fn_utils.InputFnOps(
      add_engineered(features),
      None,
      feature_placeholders
    )

Your model would use 'words'. However, your JSON input at prediction time would only need to contain 'text' i.e. the raw values.

Here's a complete working example:

https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/feateng/taxifare/trainer/model.py#L107

来源：https://stackoverflow.com/questions/45661300/google-cloud-ml-engine-tensorflow-perform-preprocessing-tokenization-in-input

标签

python