dataframe | 易学教程

In DataFrame.withColumn, how can I check if the column's value is null as a condition for the second parameter?

阅读更多关于 In DataFrame.withColumn, how can I check if the column's value is null as a condition for the second parameter?

问题 If I have a DataFrame called df that looks like: +----+----+ | a1+ a2| +----+----+ | foo| bar| | N/A| baz| |null| etc| +----+----+ I can selectively replace values like so: val df2 = df.withColumn("a1", when($"a1" === "N/A", $"a2")) so that df2 looks like: +----+----+ | a1+ a2| +----+----+ | foo| bar| | baz| baz| |null| etc| +----+----+ but why can't I check if it's null, like: val df3 = df2.withColumn("a1", when($"a1" === null, $"a2")) so that I get: +----+----+ | a1+ a2| +----+----+ | foo|

Imputing missing values using sklearn IterativeImputer class for MICE

阅读更多关于 Imputing missing values using sklearn IterativeImputer class for MICE

问题 I'm trying to learn how to implement MICE in imputing missing values for my datasets. I've heard about fancyimpute's MICE, but I also read that sklearn's IterativeImputer class can accomplish similar results. From sklearn's docs: Our implementation of IterativeImputer was inspired by the R MICE package (Multivariate Imputation by Chained Equations) [1], but differs from it by returning a single imputation instead of multiple imputations. However, IterativeImputer can also be used for multiple

Creating a custom cumulative sum that calculates the downstream quantities given a list of locations and their order

阅读更多关于 Creating a custom cumulative sum that calculates the downstream quantities given a list of locations and their order

问题 I am trying to come up with some code that will essentially calculate the cumulative value at locations below it. Taking the cumulative sum almost accomplishes this, but some locations contribute to the same downstream point. Additionally, the most upstream points (or starting points) will not have any values contributing to them and can remain their starting value in the final cumulative DataFrame. Let's say I have the following DataFrame for each site. df = pd.DataFrame({ "Site 1": np

How to delete words from a dataframe column that are present in dictionary in Pandas

阅读更多关于 How to delete words from a dataframe column that are present in dictionary in Pandas

问题 An extension to : Removing list of words from a string I have following dataframe and I want to delete frequently occuring words from df.name column: df : name Bill Hayden Rock Clinton Bill Gates Vishal James James Cameroon Micky James Michael Clark Tony Waugh Tom Clark Tom Bill Avinash Clinton Shreyas Clinton Ramesh Clinton Adam Clark I'm creating a new dataframe with words and their frequency with following code : df = pd.DataFrame(data.name.str.split(expand=True).stack().value_counts()) df

how to count categorical values including zero occurrence?

阅读更多关于 how to count categorical values including zero occurrence?

问题 I want to count number of code by month. This is my example dataframe. id month code 0 sally 0 s_A 1 sally 0 s_B 2 sally 0 s_C 3 sally 0 s_D 4 sally 0 s_E 5 sally 0 s_A 6 sally 0 s_A 7 sally 0 s_B 8 sally 0 s_C 9 sally 0 s_A I transformed to this Series using count(). df.groupby(['id', 'code', 'month']).month.count() id code month count sally s_A 0 12 1 10 2 3 7 15 But, I want to include zero occurrence, like this. id code month count sally s_A 0 12 1 10 2 3 3 0 4 0 5 0 6 0 7 15 8 0 9 0 10 0

How to filter string in multiple conditions python pandas

阅读更多关于 How to filter string in multiple conditions python pandas

问题 I have following dataframe import pandas as pd data=['5Star','FiveStar','five star','fiv estar'] data = pd.DataFrame(data,columns=["columnName"]) When I try to filter with one condition it works fine. data[data['columnName'].str.contains("5")] Output: columnName 0 5Star But It gives an error when doing with multiple conditions. How to filter it for conditions five and 5 ? Expected Output: columnName 0 5Star 2 five star 回答1: Use str.contains with a string with values separated by '|' : print

How to filter string in multiple conditions python pandas

阅读更多关于 How to filter string in multiple conditions python pandas

How to replace value in specific index in each row with corresponding value in numpy array

阅读更多关于 How to replace value in specific index in each row with corresponding value in numpy array

问题 My dataframe looks like this: datetime1 datetime2 datetime3 datetime4 id 1 5 6 5 5 2 7 2 3 5 3 4 2 3 2 4 6 4 4 7 5 7 3 8 9 and I have a numpy array like this: index_arr = [3, 2, 0, 1, 2] This numpy array refers to the index in each row, respectively, that I want to replace. The values I want to use in the replacement are in another numpy array: replace_arr = [14, 12, 23, 17, 15] so that the updated dataframe looks like this: datetime1 datetime2 datetime3 datetime4 id 1 5 6 5 14 2 7 2 12 5 3

Converting Pandas DataFrame to sparse matrix

阅读更多关于 Converting Pandas DataFrame to sparse matrix

问题 Here is my code: data=pd.get_dummies(data['movie_id']).groupby(data['user_id']).apply(max) df=pd.DataFrame(data) replace=df.replace(0,np.NaN) t=replace.fillna(-1) sparse=sp.csr_matrix(t.values) My data consist of two columns which are movie_id and user_id. user_id movie_id 5 1000 6 1007 I want to convert the data to a sparse matrix. I first created an interaction matrix where rows indicate user_id and columns indicate movie_id with positive interaction as +1 and negative interaction as -1.

Converting Pandas DataFrame to sparse matrix

阅读更多关于 Converting Pandas DataFrame to sparse matrix