pyspark

dynamically folder creation in s3 bucket from pyspark job

与世无争的帅哥 提交于 2021-01-29 09:01:37
问题 I am writing data into s3 bucket and creating parquet files using pyspark . MY bucket structure looks like below: s3a://rootfolder/subfolder/table/ subfolder and table these two folders should be created at run time if folders do not exist , and if folders exist parquet files should inside folder table . when I am running pyspark program from local machine it creates extra folder with _$folder$ (like table_$folder$ ) but if same program is run from emr it creates with _SUCCESS . writing into

Trouble reading avro files in Jupyter notebook using pyspark

邮差的信 提交于 2021-01-29 08:51:02
问题 I am trying to read an avro file in Jupyter notebook using pyspark. When I read the file i am getting an error. I have downloaded spark-avro_2.11:4.0.0.jar, i am not sure where in my code I should be inserting the avro package. Any suggestions would be great. This is an example of the code I am using to read the avro file df_avro_example = sqlContext.read.format("com.databricks.spark.avro").load("example_file.avro") This is the error I get AnalysisException: 'Failed to find data source: com

Save a dataframe view after groupBy using pyspark

谁都会走 提交于 2021-01-29 08:11:27
问题 My homework is giving me a hard time with pyspark. I have this view of my "df2" after a groupBy: df2.groupBy('years').count().show() +-----+-----+ |years|count| +-----+-----+ | 2003|11904| | 2006| 3476| | 1997| 3979| | 2004|13362| | 1996| 3180| | 1998| 4969| | 1995| 1995| | 2001|11532| | 2005|11389| | 2000| 7462| | 1999| 6593| | 2002|11799| +-----+-----+ Every attempt to save this (and then load with pandas) to a file gives back the original source data text file form I read with pypspark

Pyspark explode json string

拜拜、爱过 提交于 2021-01-29 08:04:04
问题 Input_dataframe id name collection 111 aaaaa {"1":{"city":"city_1","state":"state_1","country":"country_1"}, "2":{"city":"city_2","state":"state_2","country":"country_2"}, "3":{"city":"city_3","state":"state_3","country":"country_3"} } 222 bbbbb {"1":{"city":"city_1","state":"state_1","country":"country_1"}, "2":{"city":"city_2","state":"state_2","country":"country_2"}, "3":{"city":"city_3","state":"state_3","country":"country_3"} } here id ==> string name ==> string collection ==> string

Spark writing Parquet array<string> converts to a different datatype when loading into BigQuery

拥有回忆 提交于 2021-01-29 07:37:03
问题 Spark Dataframe Schema: StructType( [StructField("a", StringType(), False), StructField("b", StringType(), True), StructField("c" , BinaryType(), False), StructField("d", ArrayType(StringType(), False), True), StructField("e", TimestampType(), True) ]) When I write the data frame to parquet and load it into BigQuery, it interprets the schema differently. It is a simple load from JSON and write to parquet using spark dataframe. BigQuery Schema: [ { "type": "STRING", "name": "a", "mode":

Spark writing Parquet array<string> converts to a different datatype when loading into BigQuery

让人想犯罪 __ 提交于 2021-01-29 07:31:20
问题 Spark Dataframe Schema: StructType( [StructField("a", StringType(), False), StructField("b", StringType(), True), StructField("c" , BinaryType(), False), StructField("d", ArrayType(StringType(), False), True), StructField("e", TimestampType(), True) ]) When I write the data frame to parquet and load it into BigQuery, it interprets the schema differently. It is a simple load from JSON and write to parquet using spark dataframe. BigQuery Schema: [ { "type": "STRING", "name": "a", "mode":

How to recover from checkpoint when using python spark direct approach?

一笑奈何 提交于 2021-01-29 07:19:36
问题 After read official docs, i tried using checkpoint with getOrCreate in spark streaming. Some snippets: def get_ssc(): sc = SparkContext("yarn-client") ssc = StreamingContext(sc, 10) # calc every 10s ks = KafkaUtils.createDirectStream( ssc, ['lucky-track'], {"metadata.broker.list": KAFKA_BROKER}) process_data(ks) ssc.checkpoint(CHECKPOINT_DIR) return ssc if __name__ == '__main__': ssc = StreamingContext.getOrCreate(CHECKPOINT_DIR, get_ssc) ssc.start() ssc.awaitTermination() The code works fine

How do I get pyspark working in Jupyter Notebook in a virtual environment on Windows?

落爺英雄遲暮 提交于 2021-01-29 07:09:40
问题 I'm receiving the dreaded 'Exception: Java gateway process exited before sending its port number' error but I've followed everything I can find already and it's still not working. The worst thing is I swear this set up worked last week and somehow doesn't anymore. I can run pyspark perfectly fine in the virtual env from the command line and outside of the virutal environment (I'm using Pipenv) so it must be something to do with Jupyter Notebook. Has anyone solved this problem on Windows who

Jupyter Cassandra Save Problem - java.lang.NoClassDefFoundError: com/twitter/jsr166e/LongAdder

风格不统一 提交于 2021-01-29 06:40:23
问题 I am using Jupyter notebook and want to save csv file to cassandra db. There is no problem while getting data and showing it, But when I try to save this csv data to cassandra db it throws below exception. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost, executor driver): java.lang.NoClassDefFoundError: com/twitter/jsr166e/LongAdder I dowloaded maven package manually both

Pyspark: dynamically generate condition for when() clause during runtime

孤人 提交于 2021-01-29 06:37:29
问题 I have read a csv file into pyspark dataframe . Now if I apply conditions in when() clause, it works fine when the conditions are given before runtime . import pandas as pd from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql import functions from pyspark.sql.functions import col sc = SparkContext('local', 'example') sql_sc = SQLContext(sc) pandas_df = pd.read_csv('file.csv') # assuming the file contains a header # Sample content of csv file # col1,value # 1,aa