pyspark | 易学教程

dynamically folder creation in s3 bucket from pyspark job

阅读更多关于 dynamically folder creation in s3 bucket from pyspark job

问题 I am writing data into s3 bucket and creating parquet files using pyspark . MY bucket structure looks like below: s3a://rootfolder/subfolder/table/ subfolder and table these two folders should be created at run time if folders do not exist , and if folders exist parquet files should inside folder table . when I am running pyspark program from local machine it creates extra folder with _$folder$ (like table_$folder$ ) but if same program is run from emr it creates with _SUCCESS . writing into

Trouble reading avro files in Jupyter notebook using pyspark

阅读更多关于 Trouble reading avro files in Jupyter notebook using pyspark

问题 I am trying to read an avro file in Jupyter notebook using pyspark. When I read the file i am getting an error. I have downloaded spark-avro_2.11:4.0.0.jar, i am not sure where in my code I should be inserting the avro package. Any suggestions would be great. This is an example of the code I am using to read the avro file df_avro_example = sqlContext.read.format("com.databricks.spark.avro").load("example_file.avro") This is the error I get AnalysisException: 'Failed to find data source: com

Save a dataframe view after groupBy using pyspark

阅读更多关于 Save a dataframe view after groupBy using pyspark

问题 My homework is giving me a hard time with pyspark. I have this view of my "df2" after a groupBy: df2.groupBy('years').count().show() +-----+-----+ |years|count| +-----+-----+ | 2003|11904| | 2006| 3476| | 1997| 3979| | 2004|13362| | 1996| 3180| | 1998| 4969| | 1995| 1995| | 2001|11532| | 2005|11389| | 2000| 7462| | 1999| 6593| | 2002|11799| +-----+-----+ Every attempt to save this (and then load with pandas) to a file gives back the original source data text file form I read with pypspark

Pyspark explode json string

阅读更多关于 Pyspark explode json string

问题 Input_dataframe id name collection 111 aaaaa {"1":{"city":"city_1","state":"state_1","country":"country_1"}, "2":{"city":"city_2","state":"state_2","country":"country_2"}, "3":{"city":"city_3","state":"state_3","country":"country_3"} } 222 bbbbb {"1":{"city":"city_1","state":"state_1","country":"country_1"}, "2":{"city":"city_2","state":"state_2","country":"country_2"}, "3":{"city":"city_3","state":"state_3","country":"country_3"} } here id ==> string name ==> string collection ==> string

Spark writing Parquet array<string> converts to a different datatype when loading into BigQuery

阅读更多关于 Spark writing Parquet array converts to a different datatype when loading into BigQuery

问题 Spark Dataframe Schema: StructType( [StructField("a", StringType(), False), StructField("b", StringType(), True), StructField("c" , BinaryType(), False), StructField("d", ArrayType(StringType(), False), True), StructField("e", TimestampType(), True) ]) When I write the data frame to parquet and load it into BigQuery, it interprets the schema differently. It is a simple load from JSON and write to parquet using spark dataframe. BigQuery Schema: [ { "type": "STRING", "name": "a", "mode":

Spark writing Parquet array<string> converts to a different datatype when loading into BigQuery

阅读更多关于 Spark writing Parquet array converts to a different datatype when loading into BigQuery

How to recover from checkpoint when using python spark direct approach?

阅读更多关于 How to recover from checkpoint when using python spark direct approach?

问题 After read official docs, i tried using checkpoint with getOrCreate in spark streaming. Some snippets: def get_ssc(): sc = SparkContext("yarn-client") ssc = StreamingContext(sc, 10) # calc every 10s ks = KafkaUtils.createDirectStream( ssc, ['lucky-track'], {"metadata.broker.list": KAFKA_BROKER}) process_data(ks) ssc.checkpoint(CHECKPOINT_DIR) return ssc if __name__ == '__main__': ssc = StreamingContext.getOrCreate(CHECKPOINT_DIR, get_ssc) ssc.start() ssc.awaitTermination() The code works fine

How do I get pyspark working in Jupyter Notebook in a virtual environment on Windows?

阅读更多关于 How do I get pyspark working in Jupyter Notebook in a virtual environment on Windows?

问题 I'm receiving the dreaded 'Exception: Java gateway process exited before sending its port number' error but I've followed everything I can find already and it's still not working. The worst thing is I swear this set up worked last week and somehow doesn't anymore. I can run pyspark perfectly fine in the virtual env from the command line and outside of the virutal environment (I'm using Pipenv) so it must be something to do with Jupyter Notebook. Has anyone solved this problem on Windows who

Jupyter Cassandra Save Problem - java.lang.NoClassDefFoundError: com/twitter/jsr166e/LongAdder

阅读更多关于 Jupyter Cassandra Save Problem - java.lang.NoClassDefFoundError: com/twitter/jsr166e/LongAdder

问题 I am using Jupyter notebook and want to save csv file to cassandra db. There is no problem while getting data and showing it, But when I try to save this csv data to cassandra db it throws below exception. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost, executor driver): java.lang.NoClassDefFoundError: com/twitter/jsr166e/LongAdder I dowloaded maven package manually both

Pyspark: dynamically generate condition for when() clause during runtime

阅读更多关于 Pyspark: dynamically generate condition for when() clause during runtime

问题 I have read a csv file into pyspark dataframe . Now if I apply conditions in when() clause, it works fine when the conditions are given before runtime . import pandas as pd from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql import functions from pyspark.sql.functions import col sc = SparkContext('local', 'example') sql_sc = SQLContext(sc) pandas_df = pd.read_csv('file.csv') # assuming the file contains a header # Sample content of csv file # col1,value # 1,aa