pyspark | 易学教程

How to combine small parquet files to one large parquet file? [duplicate]

阅读更多关于 How to combine small parquet files to one large parquet file? [duplicate]

问题 This question already has answers here : Spark dataframe write method writing many small files (6 answers) Closed last year . I have some partitioned hive tables which point to parquet files. Now I have lot of small parquet files for each partition, each of size around 5kb and I want to merge those small files into one large file per partition. How can I achieve this to increase my hive performance? I have tried reading all the parquet files in the partition to a pyspark dataframe and

Spark HiveContext : Insert Overwrite the same table it is read from

阅读更多关于 Spark HiveContext : Insert Overwrite the same table it is read from

问题 I want to apply SCD1 and SCD2 using PySpark in HiveContext. In my approach, I am reading incremental data and target table. After reading, I am joining them for upsert approach. I am doing registerTempTable on all the source dataframes. I am trying to write final dataset into target table and I am facing the issue that Insert overwrite is not possible in the table it is read from. Please suggest some solution for this. I do not want to write intermediate data into a physical table and read it

How to save data in cassandra table using spark's saveToCassandra?

阅读更多关于 How to save data in cassandra table using spark's saveToCassandra?

问题 I'm using Cassandra with spark and I want to save data in Cassandra table. I want to insert data into a below table - cqlsh:users> select * from subscription ; pk | a | b ----+---+--- (0 rows) cqlsh:users> desc subscription ; CREATE TABLE users.subscription ( pk uuid PRIMARY KEY, a text, b text ) Program Code (consumer_demo.py)- from pyspark import SparkConf import pyspark_cassandra from pyspark_cassandra import CassandraSparkContext conf = SparkConf().set("spark.cassandra.connection.host",

Delete azure sql database rows from azure databricks

阅读更多关于 Delete azure sql database rows from azure databricks

问题 I have a table in Azure SQL database from which I want to either delete selected rows based on some criteria or entire table from Azure Databricks. Currently I am using the truncate property of JDBC to truncate the entire table without dropping it and then re-write it with new dataframe. df.write \ .option('user', jdbcUsername) \ .option('password', jdbcPassword) \ .jdbc('<connection_string>', '<table_name>', mode = 'overwrite', properties = {'truncate' : 'true'} ) But going forward I don't

My Datastax Spark doesn't work with my current python version and I have no idea why?

阅读更多关于 My Datastax Spark doesn't work with my current python version and I have no idea why?

问题 Below is my error message. When I use python 2.7 in Datastax Spark with the code below it doesn't work. I don't know why. Would be very grateful for some suggestions. Thanks vi /etc/dse/spark/spark-env.sh export PYTHONHOME=/usr/local export PYTHONPATH=/usr/local/lib/python2.7 export PYSPARK_PYTHON=/usr/local/bin/python2.7 Error message: Error from python worker: /usr/local/bin/python2.7: /usr/local/lib/python2.7/lib-dynload/_io.so: undefined symbol: _PyCodec_LookupTextEncoding PYTHONPATH was:

Parsing a text file to split at specific positions using pyspark

阅读更多关于 Parsing a text file to split at specific positions using pyspark

问题 I have a text file which is not delimited by any character and I want to split it at specific positions so that I can convert it to a 'dataframe'.Example data in file1.txt below: 1JITENDER33 2VIRENDER28 3BIJENDER37 I want to split the file so that positions 0 to 1 goes into first column, positions 2 to 9 goes to second column and 10 to 11 goes to third column so that I can finally convert it into a spark dataframe . 回答1: you can use a below python code to read onto your input file and make it

PySpark: how to groupby, resample and forward-fill null values?

阅读更多关于 PySpark: how to groupby, resample and forward-fill null values?

问题 Considering the below dataset in Spark , I would like to resample the dates with a specific frequency (eg. 5 minutes). START_DATE = dt.datetime(2019,8,15,20,33,0) test_df = pd.DataFrame({ 'school_id': ['remote','remote','remote','remote','onsite','onsite','onsite','onsite','remote','remote'], 'class_id': ['green', 'green', 'red', 'red', 'green', 'green', 'green', 'green', 'red', 'green'], 'user_id': [15,15,16,16,15,17,17,17,16,17], 'status': [0,1,1,1,0,1,0,1,1,0], 'start': pd.date_range(start

Spark Will Not Load Large MySql Table: Java Communications link failure - Timing Out

阅读更多关于 Spark Will Not Load Large MySql Table: Java Communications link failure - Timing Out

问题 I'm trying to get a pretty large table from mysql so I can manipulate using spark/databricks. I can't get it to load into spark - I have tried taking smaller subsets, but even at the smallest reasonable unit, it still fails to load. I have tried playing with the wait_timeout and interactive_timeout in mysql, but it doesn't seem to make any difference I am also loading a smaller (different) table, and that loads just fine. df_dataset = get_jdbc('raw_data_load', predicates=predicates).select(

Calculating Cumulative sum in PySpark using Window Functions

阅读更多关于 Calculating Cumulative sum in PySpark using Window Functions

问题 I have the following sample DataFrame: rdd = sc.parallelize([(1,20), (2,30), (3,30)]) df2 = spark.createDataFrame(rdd, ["id", "duration"]) df2.show() +---+--------+ | id|duration| +---+--------+ | 1| 20| | 2| 30| | 3| 30| +---+--------+ I want to sort this DataFrame in desc order of duration and add a new column which has the cumulative sum of the duration. So I did the following: windowSpec = Window.orderBy(df2['duration'].desc()) df_cum_sum = df2.withColumn("duration_cum_sum", sum('duration

Pyspark Structured Streaming Kafka configuration error

阅读更多关于 Pyspark Structured Streaming Kafka configuration error

问题 I've been using pyspark for Spark Streaming (Spark 2.0.2) with Kafka (0.10.1.0) successfully before, but my purposes are better suited for Structured Streaming. I've attempted to use the example online: https://spark.apache.org/docs/2.1.0/structured-streaming-kafka-integration.html with the following analogous code: ds1 = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1") .load() query = ds1 .writeStream .outputMode(