问题
I want to perform basic preprocessing and tokenization within my input function. My data is contained in csv's in a google cloud storage bucket location (gs://) that I cannot modify. Further, I to perform any modifications on input text within my ml-engine package so that the behavior can be replicated at serving time.
my input function follows the basic structure below:
filename_queue = tf.train.string_input_producer(filenames)
reader = tf.TextLineReader()
_, rows = reader.read_up_to(filename_queue, num_records=batch_size)
text, label = tf.decode_csv(rows, record_defaults = [[""],[""]])
# add logic to filter special characters
# add logic to make all words lowercase
words = tf.string_split(text) # splits based on white space
Are there any options that avoid performing this preprocessing on the entire data set in advance? This post suggests that tf.py_func() can be used to make these transformations, however they suggest that "The drawback is that as it is not saved in the graph, I cannot restore my saved model" so I am not convinced that this will be useful at serving time. If I am defining my own tf.py_func() to do preprocessing and it is defined in the trainer package that I am uploading to the cloud will I run into any issues? Are there any alternative options that I am not considering?
回答1:
Best practice is to write a function that you call from both the training/eval input_fn and from your serving input_fn.
For example:
def add_engineered(features):
text = features['text']
features['words'] = tf.string_split(text)
return features
Then, in your input_fn, wrap the features you return with a call to add_engineered:
def input_fn():
features = ...
label = ...
return add_engineered(features), label
and in your serving_input fn, make sure to similarly wrap the returned features (NOT the feature_placeholders) with a call to add_engineered:
def serving_input_fn():
feature_placeholders = ...
features = ...
return tflearn.utils.input_fn_utils.InputFnOps(
add_engineered(features),
None,
feature_placeholders
)
Your model would use 'words'. However, your JSON input at prediction time would only need to contain 'text' i.e. the raw values.
Here's a complete working example:
https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/feateng/taxifare/trainer/model.py#L107
来源:https://stackoverflow.com/questions/45661300/google-cloud-ml-engine-tensorflow-perform-preprocessing-tokenization-in-input