dataframe

Count including null in PySpark Dataframe Aggregation

╄→尐↘猪︶ㄣ 提交于 2021-02-07 19:44:29
问题 I am trying to get some counts on a DataFrame using agg and count. from pyspark.sql import Row ,functions as F row = Row("Cat","Date") df = (sc.parallelize ([ row("A",'2017-03-03'), row('A',None), row('B','2017-03-04'), row('B','Garbage'), row('A','2016-03-04') ]).toDF()) df = df.withColumn("Casted", df['Date'].cast('date')) df.show() ( df.groupby(df['Cat']) .agg ( #F.count(col('Date').isNull() | col('Date').isNotNull()).alias('Date_Count'), F.count('Date').alias('Date_Count'), F.count(

Count including null in PySpark Dataframe Aggregation

╄→гoц情女王★ 提交于 2021-02-07 19:44:27
问题 I am trying to get some counts on a DataFrame using agg and count. from pyspark.sql import Row ,functions as F row = Row("Cat","Date") df = (sc.parallelize ([ row("A",'2017-03-03'), row('A',None), row('B','2017-03-04'), row('B','Garbage'), row('A','2016-03-04') ]).toDF()) df = df.withColumn("Casted", df['Date'].cast('date')) df.show() ( df.groupby(df['Cat']) .agg ( #F.count(col('Date').isNull() | col('Date').isNotNull()).alias('Date_Count'), F.count('Date').alias('Date_Count'), F.count(

How to read multiple partitioned .gzip files into a Spark Dataframe?

扶醉桌前 提交于 2021-02-07 19:41:50
问题 I have the following folder of partitioned data- my_folder |--part-0000.gzip |--part-0001.gzip |--part-0002.gzip |--part-0003.gzip I try to read this data into a dataframe using- >>> my_df = spark.read.csv("/path/to/my_folder/*") >>> my_df.show(5) +--------------------+ | _c0| +--------------------+ |��[I���...| |��RUu�[*Ք��g��T...| |�t��� �qd��8~��...| |�(���b4�:������I�...| |���!y�)�PC��ќ\�...| +--------------------+ only showing top 5 rows Also tried using this to check the data- >>> rdd =

Convert Julian dates to normal dates in a dataframe?

痞子三分冷 提交于 2021-02-07 19:18:07
问题 I have a date column in a pandas DF with Julian dates. How can I convert these Julian dates to mm-dd-yyyy format. Sample data ORG CHAIN_NBR SEQ_NBR INT_STATUS BLOCK_CODE_1 DATA_BLOCK_CODE_1 0 523 1 0 A C 2012183 1 523 2 1 I A 2013025 2 521 3 1 A H 2007067 3 513 4 1 D H 2001046 4 513 5 1 8 I 2006075 I was using jd2gcal function but it's not working. I was also trying to write a code like this but of no use. for i,row in amna.iterrows(): amna['DATE_BLOCK_CODE_1'] = datetime.datetime.strptime

Convert Julian dates to normal dates in a dataframe?

一笑奈何 提交于 2021-02-07 19:17:23
问题 I have a date column in a pandas DF with Julian dates. How can I convert these Julian dates to mm-dd-yyyy format. Sample data ORG CHAIN_NBR SEQ_NBR INT_STATUS BLOCK_CODE_1 DATA_BLOCK_CODE_1 0 523 1 0 A C 2012183 1 523 2 1 I A 2013025 2 521 3 1 A H 2007067 3 513 4 1 D H 2001046 4 513 5 1 8 I 2006075 I was using jd2gcal function but it's not working. I was also trying to write a code like this but of no use. for i,row in amna.iterrows(): amna['DATE_BLOCK_CODE_1'] = datetime.datetime.strptime

Convert Julian dates to normal dates in a dataframe?

让人想犯罪 __ 提交于 2021-02-07 19:17:06
问题 I have a date column in a pandas DF with Julian dates. How can I convert these Julian dates to mm-dd-yyyy format. Sample data ORG CHAIN_NBR SEQ_NBR INT_STATUS BLOCK_CODE_1 DATA_BLOCK_CODE_1 0 523 1 0 A C 2012183 1 523 2 1 I A 2013025 2 521 3 1 A H 2007067 3 513 4 1 D H 2001046 4 513 5 1 8 I 2006075 I was using jd2gcal function but it's not working. I was also trying to write a code like this but of no use. for i,row in amna.iterrows(): amna['DATE_BLOCK_CODE_1'] = datetime.datetime.strptime

How to populate columns of a dataframe using a subset of another dataframe?

邮差的信 提交于 2021-02-07 15:35:45
问题 I have two dataframes like this import pandas as pd import numpy as np df1 = pd.DataFrame({ 'key': list('AAABBCCAAC'), 'prop1': list('xyzuuyxzzz'), 'prop2': list('mnbnbbnnnn') }) df2 = pd.DataFrame({ 'key': list('ABBCAA'), 'prop1': [np.nan] * 6, 'prop2': [np.nan] * 6, 'keep_me': ['stuff'] * 6 }) key prop1 prop2 0 A x m 1 A y n 2 A z b 3 B u n 4 B u b 5 C y b 6 C x n 7 A z n 8 A z n 9 C z n key prop1 prop2 keep_me 0 A NaN NaN stuff 1 B NaN NaN stuff 2 B NaN NaN stuff 3 C NaN NaN stuff 4 A NaN

How to populate columns of a dataframe using a subset of another dataframe?

落爺英雄遲暮 提交于 2021-02-07 15:33:18
问题 I have two dataframes like this import pandas as pd import numpy as np df1 = pd.DataFrame({ 'key': list('AAABBCCAAC'), 'prop1': list('xyzuuyxzzz'), 'prop2': list('mnbnbbnnnn') }) df2 = pd.DataFrame({ 'key': list('ABBCAA'), 'prop1': [np.nan] * 6, 'prop2': [np.nan] * 6, 'keep_me': ['stuff'] * 6 }) key prop1 prop2 0 A x m 1 A y n 2 A z b 3 B u n 4 B u b 5 C y b 6 C x n 7 A z n 8 A z n 9 C z n key prop1 prop2 keep_me 0 A NaN NaN stuff 1 B NaN NaN stuff 2 B NaN NaN stuff 3 C NaN NaN stuff 4 A NaN

How to populate columns of a dataframe using a subset of another dataframe?

与世无争的帅哥 提交于 2021-02-07 15:31:57
问题 I have two dataframes like this import pandas as pd import numpy as np df1 = pd.DataFrame({ 'key': list('AAABBCCAAC'), 'prop1': list('xyzuuyxzzz'), 'prop2': list('mnbnbbnnnn') }) df2 = pd.DataFrame({ 'key': list('ABBCAA'), 'prop1': [np.nan] * 6, 'prop2': [np.nan] * 6, 'keep_me': ['stuff'] * 6 }) key prop1 prop2 0 A x m 1 A y n 2 A z b 3 B u n 4 B u b 5 C y b 6 C x n 7 A z n 8 A z n 9 C z n key prop1 prop2 keep_me 0 A NaN NaN stuff 1 B NaN NaN stuff 2 B NaN NaN stuff 3 C NaN NaN stuff 4 A NaN

Apply function to pandas dataframe row using values in other rows

送分小仙女□ 提交于 2021-02-07 14:53:55
问题 I have a situation where I have a dataframe row to perform calculations with, and I need to use values in following (potentially preceding) rows to do these calculations (essentially a perfect forecast based on the real data set). I get each row from an earlier df.apply call, so I could pass the whole df along to the downstream objects, but that seems less than ideal based on the complexity of objects in my analysis. I found one closely related question and answer [1], but the problem is