dataframe

In DataFrame.withColumn, how can I check if the column's value is null as a condition for the second parameter?

六眼飞鱼酱① 提交于 2021-02-08 04:59:26
问题 If I have a DataFrame called df that looks like: +----+----+ | a1+ a2| +----+----+ | foo| bar| | N/A| baz| |null| etc| +----+----+ I can selectively replace values like so: val df2 = df.withColumn("a1", when($"a1" === "N/A", $"a2")) so that df2 looks like: +----+----+ | a1+ a2| +----+----+ | foo| bar| | baz| baz| |null| etc| +----+----+ but why can't I check if it's null, like: val df3 = df2.withColumn("a1", when($"a1" === null, $"a2")) so that I get: +----+----+ | a1+ a2| +----+----+ | foo|

Imputing missing values using sklearn IterativeImputer class for MICE

…衆ロ難τιáo~ 提交于 2021-02-08 04:57:29
问题 I'm trying to learn how to implement MICE in imputing missing values for my datasets. I've heard about fancyimpute's MICE, but I also read that sklearn's IterativeImputer class can accomplish similar results. From sklearn's docs: Our implementation of IterativeImputer was inspired by the R MICE package (Multivariate Imputation by Chained Equations) [1], but differs from it by returning a single imputation instead of multiple imputations. However, IterativeImputer can also be used for multiple

Creating a custom cumulative sum that calculates the downstream quantities given a list of locations and their order

╄→尐↘猪︶ㄣ 提交于 2021-02-08 04:41:42
问题 I am trying to come up with some code that will essentially calculate the cumulative value at locations below it. Taking the cumulative sum almost accomplishes this, but some locations contribute to the same downstream point. Additionally, the most upstream points (or starting points) will not have any values contributing to them and can remain their starting value in the final cumulative DataFrame. Let's say I have the following DataFrame for each site. df = pd.DataFrame({ "Site 1": np

How to delete words from a dataframe column that are present in dictionary in Pandas

僤鯓⒐⒋嵵緔 提交于 2021-02-08 03:45:28
问题 An extension to : Removing list of words from a string I have following dataframe and I want to delete frequently occuring words from df.name column: df : name Bill Hayden Rock Clinton Bill Gates Vishal James James Cameroon Micky James Michael Clark Tony Waugh Tom Clark Tom Bill Avinash Clinton Shreyas Clinton Ramesh Clinton Adam Clark I'm creating a new dataframe with words and their frequency with following code : df = pd.DataFrame(data.name.str.split(expand=True).stack().value_counts()) df

how to count categorical values including zero occurrence?

无人久伴 提交于 2021-02-08 03:38:05
问题 I want to count number of code by month. This is my example dataframe. id month code 0 sally 0 s_A 1 sally 0 s_B 2 sally 0 s_C 3 sally 0 s_D 4 sally 0 s_E 5 sally 0 s_A 6 sally 0 s_A 7 sally 0 s_B 8 sally 0 s_C 9 sally 0 s_A I transformed to this Series using count(). df.groupby(['id', 'code', 'month']).month.count() id code month count sally s_A 0 12 1 10 2 3 7 15 But, I want to include zero occurrence, like this. id code month count sally s_A 0 12 1 10 2 3 3 0 4 0 5 0 6 0 7 15 8 0 9 0 10 0

How to filter string in multiple conditions python pandas

断了今生、忘了曾经 提交于 2021-02-08 03:32:50
问题 I have following dataframe import pandas as pd data=['5Star','FiveStar','five star','fiv estar'] data = pd.DataFrame(data,columns=["columnName"]) When I try to filter with one condition it works fine. data[data['columnName'].str.contains("5")] Output: columnName 0 5Star But It gives an error when doing with multiple conditions. How to filter it for conditions five and 5 ? Expected Output: columnName 0 5Star 2 five star 回答1: Use str.contains with a string with values separated by '|' : print

How to filter string in multiple conditions python pandas

≯℡__Kan透↙ 提交于 2021-02-08 03:31:57
问题 I have following dataframe import pandas as pd data=['5Star','FiveStar','five star','fiv estar'] data = pd.DataFrame(data,columns=["columnName"]) When I try to filter with one condition it works fine. data[data['columnName'].str.contains("5")] Output: columnName 0 5Star But It gives an error when doing with multiple conditions. How to filter it for conditions five and 5 ? Expected Output: columnName 0 5Star 2 five star 回答1: Use str.contains with a string with values separated by '|' : print

How to replace value in specific index in each row with corresponding value in numpy array

心已入冬 提交于 2021-02-08 03:29:07
问题 My dataframe looks like this: datetime1 datetime2 datetime3 datetime4 id 1 5 6 5 5 2 7 2 3 5 3 4 2 3 2 4 6 4 4 7 5 7 3 8 9 and I have a numpy array like this: index_arr = [3, 2, 0, 1, 2] This numpy array refers to the index in each row, respectively, that I want to replace. The values I want to use in the replacement are in another numpy array: replace_arr = [14, 12, 23, 17, 15] so that the updated dataframe looks like this: datetime1 datetime2 datetime3 datetime4 id 1 5 6 5 14 2 7 2 12 5 3

Converting Pandas DataFrame to sparse matrix

吃可爱长大的小学妹 提交于 2021-02-08 02:15:43
问题 Here is my code: data=pd.get_dummies(data['movie_id']).groupby(data['user_id']).apply(max) df=pd.DataFrame(data) replace=df.replace(0,np.NaN) t=replace.fillna(-1) sparse=sp.csr_matrix(t.values) My data consist of two columns which are movie_id and user_id. user_id movie_id 5 1000 6 1007 I want to convert the data to a sparse matrix. I first created an interaction matrix where rows indicate user_id and columns indicate movie_id with positive interaction as +1 and negative interaction as -1.

Converting Pandas DataFrame to sparse matrix

拟墨画扇 提交于 2021-02-08 02:12:56
问题 Here is my code: data=pd.get_dummies(data['movie_id']).groupby(data['user_id']).apply(max) df=pd.DataFrame(data) replace=df.replace(0,np.NaN) t=replace.fillna(-1) sparse=sp.csr_matrix(t.values) My data consist of two columns which are movie_id and user_id. user_id movie_id 5 1000 6 1007 I want to convert the data to a sparse matrix. I first created an interaction matrix where rows indicate user_id and columns indicate movie_id with positive interaction as +1 and negative interaction as -1.