feature-engineering | 易学教程

KMeans clustering unbalanced data

阅读更多关于 KMeans clustering unbalanced data

问题 I have a set of data with 50 features (c1, c2, c3 ...), with over 80k rows. Each row contains normalised numerical values (ranging 0-1). It is actually a normalised dummy variable, whereby some rows have only few features, 3-4 (i.e. 0 is assigned if there is no value). Most rows have about 10-20 features. I used KMeans to cluster the data, always resulting in a cluster with a large number of members. Upon analysis, I noticed that rows with fewer than 4 features tends to get clustered together

Categorical features correlation

阅读更多关于 Categorical features correlation

问题 I have some categorical features in my data along with continuous ones. Is it a good or absolutely bad idea to hot encode category features to find correlation of it to labels along with other continuous creatures? 回答1: There is a way to calculate the correlation coefficient without one-hot encoding the category variable. Cramers V statistic is one method for calculating the correlation of categorical variables. It can be calculated as follows. The following link is helpful. Using pandas,

Categorical features correlation

阅读更多关于 Categorical features correlation

Categorical features correlation

阅读更多关于 Categorical features correlation

LabelEncoder for categorical features?

阅读更多关于 LabelEncoder for categorical features?

问题 This might be a beginner question but I have seen a lot of people using LabelEncoder() to replace categorical variables with ordinality. A lot of people using this feature by passing multiple columns at a time, however I have some doubt about having wrong ordinality in some of my features and how it will be effecting my model. Here is an example: Input import pandas as pd import numpy as np from sklearn.preprocessing import LabelEncoder a = pd.DataFrame(['High','Low','Low','Medium']) le =

Combining two financial datasets, with interactive account balance variable over time

阅读更多关于 Combining two financial datasets, with interactive account balance variable over time

问题 I have a question related to a financial transactions dataset. I have two datasets: The first one containing financial transactions with timestamp. Account_from Account_to Value Timestamp 1 1 2 25 1 2 1 3 25 1 3 2 1 50 2 4 2 3 20 2 5 2 4 25 2 6 1 2 40 3 7 3 1 20 3 8 2 4 25 3 The other dataset contains account information: Account_id initial deposit 1 1 200 2 2 100 3 3 150 4 4 200 Now I would like to create a dataset, with financial transactions and the balance of the original account.

Combining two financial datasets, with interactive account balance variable over time

阅读更多关于 Combining two financial datasets, with interactive account balance variable over time

converting dictionary to binary in python

阅读更多关于 converting dictionary to binary in python

问题 I have a dictionary with keys as my customer ID and values as my movie id. Though the customer has watched the same movie many times, I want it to make as one. Here I need to convert my dictionary to binary data. In all the rows I need the customers ID's and columns as movie id's, where if the customer has watched the movie, it gives 1 else 0. d = {'121212121' : 111, 222, 333, 333,444, 444, '212121212' : 222, 555, 555, 666, '212123322' : 555, 666, 666, 666, 777} Desired output : customer ID

creating new column based on whether the letter 'l' or 'L' is in the string of another column

阅读更多关于 creating new column based on whether the letter 'l' or 'L' is in the string of another column

问题 I am working with the Open Food Facts dataset which is very messy. There is a column called quantity in which in information about the quantity of respective food. the entries look like: 365 g (314 ml) 992 g 2.46 kg 0,33 litre 15.87oz 250 ml 1 L 33 cl ... and so on (very messy!!!) I want to create a new column called is_liquid . My idea is that if the quantity string contains an l or L the is_liquid field in this row should get a 1 and if not 0. Here is what I've tried: I wrote this function:

create multi-hot SparseTensor by categorical feature array column from CSV in TensorFlow

阅读更多关于 create multi-hot SparseTensor by categorical feature array column from CSV in TensorFlow

问题 This is a typical way of handling sparse features (such as some ID features) in recommendation system. I'm looking for a convenient way to prepare the data for TensorFlow pipeline. I did lots of search, but yet find the good solution yet. Below is the one which seems to be close to what I need, but not working yet. See ####### part below The data file is like: csv = [ '1221,cc,1', '213,aa|cc|ff,1', ] for the second row, i need some SparseTensor like multi-hot aa bb cc dd ee ff | 0 0 1 0 0 0 |