Feature Columns Embedding lookup

和自甴很熟 提交于 2019-11-30 07:55:44

I think you have some misunderstanding. For text classification task, if your input is a piece of text (a sentence), you should treat the entire sentence as a single feature column. Thus every data point has only a single textual column NOT a lot of columns. The value in this column is usually a combined embedding of all the tokens. That's the way we convert a var-length sparse feature (unknown number of text tokens) into one dense feature (e.g., a fixed 256 dimensional float vector).

Let's start with a _CategoricalColumn.

cat_column_with_vocab = tf.feature_column.categorical_column_with_vocabulary_list(
    key='my-text',
    vocabulary_list=vocab_list)

Note if your vocabulary is huge, your should use categorical_column_with_vocabulary_file.

We create an embedding column by using an initializer to read from a checkpoint (if we have pre-trained embedding) or randomize.

embedding_initializer = None
if has_pretrained_embedding:     
  embedding_initializer=tf.contrib.framework.load_embedding_initializer(
        ckpt_path=xxxx)
else:
  embedding_initializer=tf.random_uniform_initializer(-1.0, 1.0)
embed_column = embedding_column(
    categorical_column=cat_column_with_vocab,
    dimension=256,   ## this is your pre-trained embedding dimension
    initializer=embedding_initializer,
    trainable=False)

Suppose you have another dense feature price:

price_column = tf.feature_column.numeric_column('price')

Create your feature columns

columns = [embed_column, price_column]

Build the model:

features = tf.parse_example(..., 
    features=make_parse_example_spec(columns))
dense_tensor = tf.feature_column.input_layer(features, columns)
for units in [128, 64, 32]:
  dense_tensor = tf.layers.dense(dense_tensor, units, tf.nn.relu)
prediction = tf.layers.dense(dense_tensor, 1)

By the way, for tf.parse_example to work, this assumes your input data is tf.Example like this (text protobuf):

features {
  feature {
    key: "price"
    value { float_list {
      value: 29.0
    }}
  }
  feature {
    key: "my-text"
    value { bytes_list {
      value: "this"
      value: "product"
      value: "is"
      value: "for sale"
      value: "within"
      value: "us"
    }}
  }
}

That is, I assume you have two feature types, one is the product price, and the other is the text description of the product. Your vocabulary list would be a superset of

["this", "product", "is", "for sale", "within", "us"].
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!