spark-dataframe | 易学教程

Setting column equal to value depending on another column pandas

阅读更多关于 Setting column equal to value depending on another column pandas

问题 I am stuck on how to set the value of solvent column in each row to a number in the num column in the data frame. eg I need num to be equal to 9 when solvent is Nonane and num equal to 8 when solvent is Octane etc. Any help would be great. 回答1: use .loc with a boolean mask df.loc[df['solvent'] == 'NONANE', 'num'] = 9 df.loc[df['solvent'] == 'OCTANE', 'num'] = 8 Another method is do define a dict and call map : d = {'NONANE':9, 'OCTANE':8, 'HEPTANE':7, 'HEXANE':6} df['num'] = df['solvent'].map

Spark XML parsing

阅读更多关于 Spark XML parsing

问题 I'm trying to parse a large XML file using com.databricks.spark.xml Dataset<Row> df = spark.read().format("com.databricks.spark.xml") .option("rowTag", "row").load("../1000.xml"); df.show(10); The output I get is as follows ++ || ++ ++ am I missing something? this is my sample XML row. <row Id="7" PostTypeId="2" ParentId="4" CreationDate="2008-07-31T22:17:57.883" Score="316" Body="<p>An explicit cast to double isn't necessary.</p> <pre><code>double trans = (double)trackBar1.Value / 5000.0; <

Scala spark - Dealing with Hierarchy data tables

阅读更多关于 Scala spark - Dealing with Hierarchy data tables

Spark Dataframe except method Issue

阅读更多关于 Spark Dataframe except method Issue

问题 I have a use case to minus two dataframes . So i have used the dataframe except() method. This is working fine locally on smaller set of data. But when I ran over AWS S3 bucket ,the except() method is not making minus as expected . Is there anything needs to be taken care on distributed environment ? Does anyone faced this similar issue ? Here is my sample code val values = List(List("One", "2017-07-01T23:59:59.000", "2017-11-04T23:59:58.000", "A", "Yes") , List("Two", "2017-07-01T23:59:59

How to check in Python if cell value of pyspark dataframe column in UDF function is none or NaN for implementing forward fill?

阅读更多关于 How to check in Python if cell value of pyspark dataframe column in UDF function is none or NaN for implementing forward fill?

问题 I am basically trying to do a forward fill imputation. Below is the code for that. df = spark.createDataFrame([(1,1, None), (1,2, 5), (1,3, None), (1,4, None), (1,5, 10), (1,6, None)], ('session',"timestamp", "id")) PRV_RANK = 0.0 def fun(rank): ########How to check if None or Nan? ############### if rank is None or rank is NaN: return PRV_RANK else: PRV_RANK = rank return rank fuN= F.udf(fun, IntegerType()) df.withColumn("ffill_new", fuN(df["id"])).show() I am getting weird error in the log.

AttributeError: module 'pandas' has no attribute 'to_csv'

阅读更多关于 AttributeError: module 'pandas' has no attribute 'to_csv'

问题 I took some rows from csv file like this pd.DataFrame(CV_data.take(5), columns=CV_data.columns) and performed some functions on it. now i want to save it in csv again but it is giving error module 'pandas' has no attribute 'to_csv' I am trying to save it like this pd.to_csv(CV_data, sep='\t', encoding='utf-8') here is my full code. how can i save my resulting data in csv or excel? # Disable warnings, set Matplotlib inline plotting and load Pandas package import warnings warnings

Spark: Applying UDF to Dataframe Generating new Columns based on Values in DF

阅读更多关于 Spark: Applying UDF to Dataframe Generating new Columns based on Values in DF

问题 I am having problems transposing values in a DataFrame in Scala. My initial DataFrame looks like this: +----+----+----+----+ |col1|col2|col3|col4| +----+----+----+----+ | A| X| 6|null| | B| Z|null| 5| | C| Y| 4|null| +----+----+----+----+ col1 and col2 are type String and col3 and col4 are Int . And the result should look like this: +----+----+----+----+------+------+------+ |col1|col2|col3|col4|AXcol3|BZcol4|CYcol4| +----+----+----+----+------+------+------+ | A| X| 6|null| 6| null| null| |

PySpark - Convert to JSON row by row

阅读更多关于 PySpark - Convert to JSON row by row

问题 I have a very large pyspark data frame. I need to convert the dataframe into a JSON formatted string for each row then publish the string to a Kafka topic. I originally used the following code. for message in df.toJSON().collect(): kafkaClient.send(message) However the dataframe is very large so it fails when trying to collect() . I was thinking of using a UDF since it processes it row by row. from pyspark.sql.functions import udf, struct def get_row(row): json = row.toJSON() kafkaClient.send

Setting column equal to value depending on another column pandas

阅读更多关于 Setting column equal to value depending on another column pandas

I am stuck on how to set the value of solvent column in each row to a number in the num column in the data frame. eg I need num to be equal to 9 when solvent is Nonane and num equal to 8 when solvent is Octane etc. Any help would be great. use .loc with a boolean mask df.loc[df['solvent'] == 'NONANE', 'num'] = 9 df.loc[df['solvent'] == 'OCTANE', 'num'] = 8 Another method is do define a dict and call map : d = {'NONANE':9, 'OCTANE':8, 'HEPTANE':7, 'HEXANE':6} df['num'] = df['solvent'].map(d) 来源： https://stackoverflow.com/questions/34158320/setting-column-equal-to-value-depending-on-another

SPARK read.json throwing java.io.IOException: Too many bytes before newline

阅读更多关于 SPARK read.json throwing java.io.IOException: Too many bytes before newline

问题 I am getting following error on reading a large 6gb single line json file: Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost): java.io.IOException: Too many bytes before newline: 2147483648 spark does not read json files with new lines hence the entire 6 gb json file is on a single line: jf = sqlContext.read.json("jlrn2.json") configuration: spark.driver.memory 20g 回答1: Yep, you have more than Integer.MAX