Perform feature selection in Tensorflow

微笑、不失礼 提交于 2020-07-18 07:22:16

问题


Started with Machine learning in Python recently.

I am working on a Python project involving regression to predict some values. The input is a data set consisting of 70 features which are a mix of categorical and ordinal variables. The dependent variable is continuous.

The input would be data and the number of significant variables.

i had some questions which are mentioned below.

1] Is there a way to perform feature selection using forward selection technique in Tensorflow?

2] are there alternatives to feature selection ?

Any help would be appreciated !


回答1:


Problem

I have N number of features (N = 70 for example) and I want to select the K top features. (1) How could I do this in TensorFlow and (2) what alternatives are there to feature selection.

Discussion

I will show one way to limit the number of N features to at most K using a variant of L1 loss. As for alternatives to feature selection there are many depending on what you want to achieve. If you can go outside of TensorFlow then you could use Decision Trees or a Random Forest and simply constrain the number of leaves to use at most K features. If you must use TensorFlow and you want an alternative to top K features that regularizes your weights you could use random dropout or L2 loss. Again, it really depends on what you want to achieve when looking for an alternative to top K features.

Solution to Top K of N features.

Supposed our TensorFlow graph is defined as follows

import tensorflow as tf

size = 4

x_in = tf.placeholder( shape=[None,size] , dtype=tf.float32 )
y_in = tf.placeholder( shape=[None] , dtype=tf.float32 )
l1_weight = tf.placeholder( shape=[] , dtype=tf.float32 )

m = tf.Variable( tf.random_uniform( shape=[size,1] , minval=0.1 , maxval=0.9 ) , dtype=tf.float32 )
m = tf.nn.relu(m)
b = tf.Variable([-10], dtype=tf.float32 )

predict = tf.squeeze( tf.nn.xw_plus_b(x_in,m,b) )

l1_loss = tf.reduce_sum(tf.abs(m))
loss = tf.reduce_mean( tf.square( y_in - predict ) ) + l1_loss * l1_weight

optimizer = tf.train.GradientDescentOptimizer(1e-4)
train = optimizer.minimize(loss)

m_all_0 = tf.zeros( [size,1] , dtype=tf.float32 )
zerod_feature_count = tf.reduce_sum( tf.cast( tf.equal( m , m_all_0 ) , dtype=tf.float32 ) )
k = size - zerod_feature_count

Let's define some data and use this

import numpy as np
data = np.array([[5.1,3.5,1.4,0.2,0],[4.9,3.0,1.4,0.2,0],[4.7,3.2,1.3,0.2,0],[4.6,3.1,1.5,0.2,0],[5.0,3.6,1.4,0.2,0],[5.4,3.9,1.7,0.4,0],[4.6,3.4,1.4,0.3,0],[5.0,3.4,1.5,0.2,0],[4.4,2.9,1.4,0.2,0],[4.9,3.1,1.5,0.1,0],[5.4,3.7,1.5,0.2,0],[4.8,3.4,1.6,0.2,0],[4.8,3.0,1.4,0.1,0],[4.3,3.0,1.1,0.1,0],[5.8,4.0,1.2,0.2,0],[5.7,4.4,1.5,0.4,0],[5.4,3.9,1.3,0.4,0],[5.1,3.5,1.4,0.3,0],[5.7,3.8,1.7,0.3,0],[5.1,3.8,1.5,0.3,0],[5.4,3.4,1.7,0.2,0],[5.1,3.7,1.5,0.4,0],[4.6,3.6,1.0,0.2,0],[5.1,3.3,1.7,0.5,0],[4.8,3.4,1.9,0.2,0],[5.0,3.0,1.6,0.2,0],[5.0,3.4,1.6,0.4,0],[5.2,3.5,1.5,0.2,0],[5.2,3.4,1.4,0.2,0],[4.7,3.2,1.6,0.2,0],[4.8,3.1,1.6,0.2,0],[5.4,3.4,1.5,0.4,0],[5.2,4.1,1.5,0.1,0],[5.5,4.2,1.4,0.2,0],[4.9,3.1,1.5,0.1,0],[5.0,3.2,1.2,0.2,0],[5.5,3.5,1.3,0.2,0],[4.9,3.1,1.5,0.1,0],[4.4,3.0,1.3,0.2,0],[5.1,3.4,1.5,0.2,0],[5.0,3.5,1.3,0.3,0],[4.5,2.3,1.3,0.3,0],[4.4,3.2,1.3,0.2,0],[5.0,3.5,1.6,0.6,0],[5.1,3.8,1.9,0.4,0],[4.8,3.0,1.4,0.3,0],[5.1,3.8,1.6,0.2,0],[4.6,3.2,1.4,0.2,0],[5.3,3.7,1.5,0.2,0],[5.0,3.3,1.4,0.2,0],[7.0,3.2,4.7,1.4,1],[6.4,3.2,4.5,1.5,1],[6.9,3.1,4.9,1.5,1],[5.5,2.3,4.0,1.3,1],[6.5,2.8,4.6,1.5,1],[5.7,2.8,4.5,1.3,1],[6.3,3.3,4.7,1.6,1],[4.9,2.4,3.3,1.0,1],[6.6,2.9,4.6,1.3,1],[5.2,2.7,3.9,1.4,1],[5.0,2.0,3.5,1.0,1],[5.9,3.0,4.2,1.5,1],[6.0,2.2,4.0,1.0,1],[6.1,2.9,4.7,1.4,1],[5.6,2.9,3.6,1.3,1],[6.7,3.1,4.4,1.4,1],[5.6,3.0,4.5,1.5,1],[5.8,2.7,4.1,1.0,1],[6.2,2.2,4.5,1.5,1],[5.6,2.5,3.9,1.1,1],[5.9,3.2,4.8,1.8,1],[6.1,2.8,4.0,1.3,1],[6.3,2.5,4.9,1.5,1],[6.1,2.8,4.7,1.2,1],[6.4,2.9,4.3,1.3,1],[6.6,3.0,4.4,1.4,1],[6.8,2.8,4.8,1.4,1],[6.7,3.0,5.0,1.7,1],[6.0,2.9,4.5,1.5,1],[5.7,2.6,3.5,1.0,1],[5.5,2.4,3.8,1.1,1],[5.5,2.4,3.7,1.0,1],[5.8,2.7,3.9,1.2,1],[6.0,2.7,5.1,1.6,1],[5.4,3.0,4.5,1.5,1],[6.0,3.4,4.5,1.6,1],[6.7,3.1,4.7,1.5,1],[6.3,2.3,4.4,1.3,1],[5.6,3.0,4.1,1.3,1],[5.5,2.5,4.0,1.3,1],[5.5,2.6,4.4,1.2,1],[6.1,3.0,4.6,1.4,1],[5.8,2.6,4.0,1.2,1],[5.0,2.3,3.3,1.0,1],[5.6,2.7,4.2,1.3,1],[5.7,3.0,4.2,1.2,1],[5.7,2.9,4.2,1.3,1],[6.2,2.9,4.3,1.3,1],[5.1,2.5,3.0,1.1,1],[5.7,2.8,4.1,1.3,1]])

x = data[:,0:4]
y = data[:,-1]

sess = tf.Session()
sess.run(tf.global_variables_initializer())

Let's define a reusable trial function that can test different weights of l1_loss

def trial(weight=1):
    print "initial m,loss",sess.run([m,loss],feed_dict={x_in:x, y_in:y, l1_weight:0})
    for _ in range(10000):
        sess.run(train,feed_dict={x_in:x, y_in:y, l1_weight:0})
    print "after training m,loss",sess.run([m,loss],feed_dict={x_in:x, y_in:y, l1_weight:0})
    for _ in range(10000):
        sess.run(train,feed_dict={x_in:x, y_in:y, l1_weight:weight})
        if sess.run(k) <= 3 :
            break
    print "after l1 loss m",sess.run([m,loss],feed_dict={x_in:x, y_in:y, l1_weight:weight})

Then we'll try it out

print "The number of non-zero parameters is",sess.run(k)
print "Doing a training session"
trial()
print "The number of non-zero parameters is",sess.run(k)

And the results look good

The number of non-zero parameters is 4.0
Doing a training session
initial m,loss [array([[0.75030506],
       [0.4959089 ],
       [0.646675  ],
       [0.44027993]], dtype=float32), 8.228347]
after training m,loss [array([[1.1898338 ],
       [1.0033115 ],
       [0.15164669],
       [0.16414128]], dtype=float32), 0.68957466]
after l1 loss m [array([[1.250356  ],
       [0.92532456],
       [0.10235767],
       [0.        ]], dtype=float32), 2.9621665]
The number of non-zero parameters is 3.0


来源:https://stackoverflow.com/questions/49843886/perform-feature-selection-in-tensorflow

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!