pyspark

Convert pyspark dataframe into list of python dictionaries

北战南征 提交于 2021-02-10 04:50:24
问题 Hi I'm new to pyspark and I'm trying to convert pyspark.sql.dataframe into list of dictionaries. Below is my dataframe, the type is <class 'pyspark.sql.dataframe.DataFrame'>: +------------------+----------+------------------------+ | title|imdb_score|Worldwide_Gross(dollars)| +------------------+----------+------------------------+ | The Eight Hundred| 7.2| 460699653| | Bad Boys for Life| 6.6| 426505244| | Tenet| 7.8| 334000000| |Sonic the Hedgehog| 6.5| 308439401| | Dolittle| 5.6| 245229088|

spark - set null when column not exist in dataframe

假如想象 提交于 2021-02-09 02:51:04
问题 I'm loading many versions of JSON files to spark DataFrame. some of the files holds columns A,B and some A,B,C or A,C.. If I run this command from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.sql("SELECT A,B,C FROM table") after loading several I can get error "column not exist" I loaded only files that are not holding column C. How can set this value to null instead of getting error? 回答1: DataFrameReader.json method provides optional schema argument you can use

spark - set null when column not exist in dataframe

狂风中的少年 提交于 2021-02-09 02:49:16
问题 I'm loading many versions of JSON files to spark DataFrame. some of the files holds columns A,B and some A,B,C or A,C.. If I run this command from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.sql("SELECT A,B,C FROM table") after loading several I can get error "column not exist" I loaded only files that are not holding column C. How can set this value to null instead of getting error? 回答1: DataFrameReader.json method provides optional schema argument you can use

spark - set null when column not exist in dataframe

↘锁芯ラ 提交于 2021-02-09 02:48:23
问题 I'm loading many versions of JSON files to spark DataFrame. some of the files holds columns A,B and some A,B,C or A,C.. If I run this command from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.sql("SELECT A,B,C FROM table") after loading several I can get error "column not exist" I loaded only files that are not holding column C. How can set this value to null instead of getting error? 回答1: DataFrameReader.json method provides optional schema argument you can use

Creating datetime from string column in Pyspark [duplicate]

守給你的承諾、 提交于 2021-02-08 12:11:23
问题 This question already has answers here : Convert pyspark string to date format (6 answers) Closed 3 years ago . Suppose I have the following datetime column as shown below. I want to convert the column in string to a datetime type so I can extract months, days and year and such. +---+------------+ |agg| datetime| +---+------------+ | A|1/2/17 12:00| | B| null| | C|1/4/17 15:00| +---+------------+ I have tried the following code below, but the returning values in the datetime column are nulls,

How to add a validation in azure data factory pipeline to check file size?

感情迁移 提交于 2021-02-08 11:49:17
问题 I have multiple data sources I want to add a validation in azure data factory before loading into tables it should check for file size so that it is not empty. So if the file size is more than 10 kb or if it is not empty loading should start and if it is empty then loading should not start. I checked validation activity in Azure Data Factory but it is not showing size for multiple files in a folder. Any suggestions appreciated basically if I can add any python notebook for this validation

Flatten Spark Dataframe column of map/dictionary into multiple columns

情到浓时终转凉″ 提交于 2021-02-08 11:33:31
问题 We have a DataFrame that looks like this: DataFrame[event: string, properties: map<string,string>] Notice that there are two columns: event and properties . How do we split or flatten the properties column into multiple columns based on the key values in the map ? I notice I can do something like this: newDf = df.withColumn("foo", col("properties")["foo"]) which produce a Dataframe of DataFrame[event: string, properties: map<string,string>, foo: String] But then I would have to do these for

Pyspark TextParsingException while loading a file

限于喜欢 提交于 2021-02-08 11:26:30
问题 I am loading a csv file having 1 million records using pyspark, but getting the error. TextParsingException: Length of parsed input (1000001) exceeds the maximum number of characters defined in your parser settings (1000000) I checked if any of my record in the file has data greater than 1000000 characters, but none of the record is like that. maximum record length in my file is 850. Please help.... CODE SNIPPET: input_df = spark.read.format('com.databricks.spark.csv').option("delimiter","

pyspark split dataframe by two columns without creating a folder structure for the 2nd

[亡魂溺海] 提交于 2021-02-08 11:05:56
问题 Two part question. I have a pyspark dataframe that I'm reading from a list of JSON files in my azure blob storage. after some simple ETL I need to move this from blob storage to a datalake as a parquet file, simple so far. I'm unsucessfully trying to efficiently write this into a folder defined by two columns, one which is a date column and the other an ID using partitionBy gets me close id | date | nested_json_data | path 1 | 2019-01-01 12:01:01 | {data : [data]} | dbfs:\mnt\.. 1 | 2019-01

Problem in building a docker image with pyspark lib

笑着哭i 提交于 2021-02-08 11:00:57
问题 I'm trying to build a docker image using s2i and Jenkins. I have the following dependencies in the requirement.txt file scikit-learn==0.21.2 scipy==0.18.1 pandas==0.24.2 seldon-core==0.3.0 pypandoc pyspark==2.4.1 But my build process fails when it tries to install pyspark with the following error message Downloading https://repo.company.com/repository/pypi-all/packages/f2/64/a1df4440483df47381bbbf6a03119ef66515cf2e1a766d9369811575454b/pyspark-2.4.1.tar.gz (215.7MB) Complete output from