feature-engineering

KMeans clustering unbalanced data

只谈情不闲聊 提交于 2021-01-28 18:57:55
问题 I have a set of data with 50 features (c1, c2, c3 ...), with over 80k rows. Each row contains normalised numerical values (ranging 0-1). It is actually a normalised dummy variable, whereby some rows have only few features, 3-4 (i.e. 0 is assigned if there is no value). Most rows have about 10-20 features. I used KMeans to cluster the data, always resulting in a cluster with a large number of members. Upon analysis, I noticed that rows with fewer than 4 features tends to get clustered together

Categorical features correlation

霸气de小男生 提交于 2020-05-24 15:53:33
问题 I have some categorical features in my data along with continuous ones. Is it a good or absolutely bad idea to hot encode category features to find correlation of it to labels along with other continuous creatures? 回答1: There is a way to calculate the correlation coefficient without one-hot encoding the category variable. Cramers V statistic is one method for calculating the correlation of categorical variables. It can be calculated as follows. The following link is helpful. Using pandas,

Categorical features correlation

旧街凉风 提交于 2020-05-24 15:53:31
问题 I have some categorical features in my data along with continuous ones. Is it a good or absolutely bad idea to hot encode category features to find correlation of it to labels along with other continuous creatures? 回答1: There is a way to calculate the correlation coefficient without one-hot encoding the category variable. Cramers V statistic is one method for calculating the correlation of categorical variables. It can be calculated as follows. The following link is helpful. Using pandas,

Categorical features correlation

萝らか妹 提交于 2020-05-24 15:53:13
问题 I have some categorical features in my data along with continuous ones. Is it a good or absolutely bad idea to hot encode category features to find correlation of it to labels along with other continuous creatures? 回答1: There is a way to calculate the correlation coefficient without one-hot encoding the category variable. Cramers V statistic is one method for calculating the correlation of categorical variables. It can be calculated as follows. The following link is helpful. Using pandas,

LabelEncoder for categorical features?

无人久伴 提交于 2020-05-05 15:36:13
问题 This might be a beginner question but I have seen a lot of people using LabelEncoder() to replace categorical variables with ordinality. A lot of people using this feature by passing multiple columns at a time, however I have some doubt about having wrong ordinality in some of my features and how it will be effecting my model. Here is an example: Input import pandas as pd import numpy as np from sklearn.preprocessing import LabelEncoder a = pd.DataFrame(['High','Low','Low','Medium']) le =

Combining two financial datasets, with interactive account balance variable over time

放肆的年华 提交于 2020-01-05 07:12:27
问题 I have a question related to a financial transactions dataset. I have two datasets: The first one containing financial transactions with timestamp. Account_from Account_to Value Timestamp 1 1 2 25 1 2 1 3 25 1 3 2 1 50 2 4 2 3 20 2 5 2 4 25 2 6 1 2 40 3 7 3 1 20 3 8 2 4 25 3 The other dataset contains account information: Account_id initial deposit 1 1 200 2 2 100 3 3 150 4 4 200 Now I would like to create a dataset, with financial transactions and the balance of the original account.

Combining two financial datasets, with interactive account balance variable over time

て烟熏妆下的殇ゞ 提交于 2020-01-05 07:12:13
问题 I have a question related to a financial transactions dataset. I have two datasets: The first one containing financial transactions with timestamp. Account_from Account_to Value Timestamp 1 1 2 25 1 2 1 3 25 1 3 2 1 50 2 4 2 3 20 2 5 2 4 25 2 6 1 2 40 3 7 3 1 20 3 8 2 4 25 3 The other dataset contains account information: Account_id initial deposit 1 1 200 2 2 100 3 3 150 4 4 200 Now I would like to create a dataset, with financial transactions and the balance of the original account.

converting dictionary to binary in python

耗尽温柔 提交于 2019-12-14 03:55:39
问题 I have a dictionary with keys as my customer ID and values as my movie id. Though the customer has watched the same movie many times, I want it to make as one. Here I need to convert my dictionary to binary data. In all the rows I need the customers ID's and columns as movie id's, where if the customer has watched the movie, it gives 1 else 0. d = {'121212121' : 111, 222, 333, 333,444, 444, '212121212' : 222, 555, 555, 666, '212123322' : 555, 666, 666, 666, 777} Desired output : customer ID

creating new column based on whether the letter 'l' or 'L' is in the string of another column

痴心易碎 提交于 2019-12-11 16:12:26
问题 I am working with the Open Food Facts dataset which is very messy. There is a column called quantity in which in information about the quantity of respective food. the entries look like: 365 g (314 ml) 992 g 2.46 kg 0,33 litre 15.87oz 250 ml 1 L 33 cl ... and so on (very messy!!!) I want to create a new column called is_liquid . My idea is that if the quantity string contains an l or L the is_liquid field in this row should get a 1 and if not 0. Here is what I've tried: I wrote this function:

create multi-hot SparseTensor by categorical feature array column from CSV in TensorFlow

Deadly 提交于 2019-12-11 01:46:58
问题 This is a typical way of handling sparse features (such as some ID features) in recommendation system. I'm looking for a convenient way to prepare the data for TensorFlow pipeline. I did lots of search, but yet find the good solution yet. Below is the one which seems to be close to what I need, but not working yet. See ####### part below The data file is like: csv = [ '1221,cc,1', '213,aa|cc|ff,1', ] for the second row, i need some SparseTensor like multi-hot aa bb cc dd ee ff | 0 0 1 0 0 0 |