spark-dataframe

Setting column equal to value depending on another column pandas

三世轮回 提交于 2019-12-08 06:03:45
问题 I am stuck on how to set the value of solvent column in each row to a number in the num column in the data frame. eg I need num to be equal to 9 when solvent is Nonane and num equal to 8 when solvent is Octane etc. Any help would be great. 回答1: use .loc with a boolean mask df.loc[df['solvent'] == 'NONANE', 'num'] = 9 df.loc[df['solvent'] == 'OCTANE', 'num'] = 8 Another method is do define a dict and call map : d = {'NONANE':9, 'OCTANE':8, 'HEPTANE':7, 'HEXANE':6} df['num'] = df['solvent'].map

Spark XML parsing

只谈情不闲聊 提交于 2019-12-08 05:50:49
问题 I'm trying to parse a large XML file using com.databricks.spark.xml Dataset<Row> df = spark.read().format("com.databricks.spark.xml") .option("rowTag", "row").load("../1000.xml"); df.show(10); The output I get is as follows ++ || ++ ++ am I missing something? this is my sample XML row. <row Id="7" PostTypeId="2" ParentId="4" CreationDate="2008-07-31T22:17:57.883" Score="316" Body="<p>An explicit cast to double isn't necessary.</p> <pre><code>double trans = (double)trackBar1.Value / 5000.0; <

Scala spark - Dealing with Hierarchy data tables

旧时模样 提交于 2019-12-08 05:02:58
问题 I have data table with hierarchy data model with tree structures. For example: Here is a sample data row: ------------------------------------------- Id | name |parentId | path | depth ------------------------------------------- 55 | Canada | null | null | 0 77 | Ontario | 55 | /55 | 1 100| Toronto | 77 | /55/77 | 2 104| Brampton| 100 | /55/77/100 | 3 I am looking to convert those rows into flattening version, sample output would be: ----------------------------------- Id | name | parentId |

Spark Dataframe except method Issue

放肆的年华 提交于 2019-12-08 02:25:39
问题 I have a use case to minus two dataframes . So i have used the dataframe except() method. This is working fine locally on smaller set of data. But when I ran over AWS S3 bucket ,the except() method is not making minus as expected . Is there anything needs to be taken care on distributed environment ? Does anyone faced this similar issue ? Here is my sample code val values = List(List("One", "2017-07-01T23:59:59.000", "2017-11-04T23:59:58.000", "A", "Yes") , List("Two", "2017-07-01T23:59:59

How to check in Python if cell value of pyspark dataframe column in UDF function is none or NaN for implementing forward fill?

喜欢而已 提交于 2019-12-08 02:12:10
问题 I am basically trying to do a forward fill imputation. Below is the code for that. df = spark.createDataFrame([(1,1, None), (1,2, 5), (1,3, None), (1,4, None), (1,5, 10), (1,6, None)], ('session',"timestamp", "id")) PRV_RANK = 0.0 def fun(rank): ########How to check if None or Nan? ############### if rank is None or rank is NaN: return PRV_RANK else: PRV_RANK = rank return rank fuN= F.udf(fun, IntegerType()) df.withColumn("ffill_new", fuN(df["id"])).show() I am getting weird error in the log.

AttributeError: module 'pandas' has no attribute 'to_csv'

耗尽温柔 提交于 2019-12-08 01:49:06
问题 I took some rows from csv file like this pd.DataFrame(CV_data.take(5), columns=CV_data.columns) and performed some functions on it. now i want to save it in csv again but it is giving error module 'pandas' has no attribute 'to_csv' I am trying to save it like this pd.to_csv(CV_data, sep='\t', encoding='utf-8') here is my full code. how can i save my resulting data in csv or excel? # Disable warnings, set Matplotlib inline plotting and load Pandas package import warnings warnings

Spark: Applying UDF to Dataframe Generating new Columns based on Values in DF

帅比萌擦擦* 提交于 2019-12-07 23:52:49
问题 I am having problems transposing values in a DataFrame in Scala. My initial DataFrame looks like this: +----+----+----+----+ |col1|col2|col3|col4| +----+----+----+----+ | A| X| 6|null| | B| Z|null| 5| | C| Y| 4|null| +----+----+----+----+ col1 and col2 are type String and col3 and col4 are Int . And the result should look like this: +----+----+----+----+------+------+------+ |col1|col2|col3|col4|AXcol3|BZcol4|CYcol4| +----+----+----+----+------+------+------+ | A| X| 6|null| 6| null| null| |

PySpark - Convert to JSON row by row

泄露秘密 提交于 2019-12-07 17:17:05
问题 I have a very large pyspark data frame. I need to convert the dataframe into a JSON formatted string for each row then publish the string to a Kafka topic. I originally used the following code. for message in df.toJSON().collect(): kafkaClient.send(message) However the dataframe is very large so it fails when trying to collect() . I was thinking of using a UDF since it processes it row by row. from pyspark.sql.functions import udf, struct def get_row(row): json = row.toJSON() kafkaClient.send

Setting column equal to value depending on another column pandas

一曲冷凌霜 提交于 2019-12-07 15:07:22
I am stuck on how to set the value of solvent column in each row to a number in the num column in the data frame. eg I need num to be equal to 9 when solvent is Nonane and num equal to 8 when solvent is Octane etc. Any help would be great. use .loc with a boolean mask df.loc[df['solvent'] == 'NONANE', 'num'] = 9 df.loc[df['solvent'] == 'OCTANE', 'num'] = 8 Another method is do define a dict and call map : d = {'NONANE':9, 'OCTANE':8, 'HEPTANE':7, 'HEXANE':6} df['num'] = df['solvent'].map(d) 来源: https://stackoverflow.com/questions/34158320/setting-column-equal-to-value-depending-on-another

SPARK read.json throwing java.io.IOException: Too many bytes before newline

Deadly 提交于 2019-12-07 14:42:43
问题 I am getting following error on reading a large 6gb single line json file: Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost): java.io.IOException: Too many bytes before newline: 2147483648 spark does not read json files with new lines hence the entire 6 gb json file is on a single line: jf = sqlContext.read.json("jlrn2.json") configuration: spark.driver.memory 20g 回答1: Yep, you have more than Integer.MAX