pyspark

How to combine small parquet files to one large parquet file? [duplicate]

半城伤御伤魂 提交于 2019-12-23 05:31:22
问题 This question already has answers here : Spark dataframe write method writing many small files (6 answers) Closed last year . I have some partitioned hive tables which point to parquet files. Now I have lot of small parquet files for each partition, each of size around 5kb and I want to merge those small files into one large file per partition. How can I achieve this to increase my hive performance? I have tried reading all the parquet files in the partition to a pyspark dataframe and

Spark HiveContext : Insert Overwrite the same table it is read from

左心房为你撑大大i 提交于 2019-12-23 05:27:10
问题 I want to apply SCD1 and SCD2 using PySpark in HiveContext. In my approach, I am reading incremental data and target table. After reading, I am joining them for upsert approach. I am doing registerTempTable on all the source dataframes. I am trying to write final dataset into target table and I am facing the issue that Insert overwrite is not possible in the table it is read from. Please suggest some solution for this. I do not want to write intermediate data into a physical table and read it

How to save data in cassandra table using spark's saveToCassandra?

自闭症网瘾萝莉.ら 提交于 2019-12-23 05:05:45
问题 I'm using Cassandra with spark and I want to save data in Cassandra table. I want to insert data into a below table - cqlsh:users> select * from subscription ; pk | a | b ----+---+--- (0 rows) cqlsh:users> desc subscription ; CREATE TABLE users.subscription ( pk uuid PRIMARY KEY, a text, b text ) Program Code (consumer_demo.py)- from pyspark import SparkConf import pyspark_cassandra from pyspark_cassandra import CassandraSparkContext conf = SparkConf().set("spark.cassandra.connection.host",

Delete azure sql database rows from azure databricks

梦想与她 提交于 2019-12-23 04:54:09
问题 I have a table in Azure SQL database from which I want to either delete selected rows based on some criteria or entire table from Azure Databricks. Currently I am using the truncate property of JDBC to truncate the entire table without dropping it and then re-write it with new dataframe. df.write \ .option('user', jdbcUsername) \ .option('password', jdbcPassword) \ .jdbc('<connection_string>', '<table_name>', mode = 'overwrite', properties = {'truncate' : 'true'} ) But going forward I don't

My Datastax Spark doesn't work with my current python version and I have no idea why?

会有一股神秘感。 提交于 2019-12-23 04:53:28
问题 Below is my error message. When I use python 2.7 in Datastax Spark with the code below it doesn't work. I don't know why. Would be very grateful for some suggestions. Thanks vi /etc/dse/spark/spark-env.sh export PYTHONHOME=/usr/local export PYTHONPATH=/usr/local/lib/python2.7 export PYSPARK_PYTHON=/usr/local/bin/python2.7 Error message: Error from python worker: /usr/local/bin/python2.7: /usr/local/lib/python2.7/lib-dynload/_io.so: undefined symbol: _PyCodec_LookupTextEncoding PYTHONPATH was:

Parsing a text file to split at specific positions using pyspark

微笑、不失礼 提交于 2019-12-23 04:39:07
问题 I have a text file which is not delimited by any character and I want to split it at specific positions so that I can convert it to a 'dataframe'.Example data in file1.txt below: 1JITENDER33 2VIRENDER28 3BIJENDER37 I want to split the file so that positions 0 to 1 goes into first column, positions 2 to 9 goes to second column and 10 to 11 goes to third column so that I can finally convert it into a spark dataframe . 回答1: you can use a below python code to read onto your input file and make it

PySpark: how to groupby, resample and forward-fill null values?

旧街凉风 提交于 2019-12-23 04:33:18
问题 Considering the below dataset in Spark , I would like to resample the dates with a specific frequency (eg. 5 minutes). START_DATE = dt.datetime(2019,8,15,20,33,0) test_df = pd.DataFrame({ 'school_id': ['remote','remote','remote','remote','onsite','onsite','onsite','onsite','remote','remote'], 'class_id': ['green', 'green', 'red', 'red', 'green', 'green', 'green', 'green', 'red', 'green'], 'user_id': [15,15,16,16,15,17,17,17,16,17], 'status': [0,1,1,1,0,1,0,1,1,0], 'start': pd.date_range(start

Spark Will Not Load Large MySql Table: Java Communications link failure - Timing Out

跟風遠走 提交于 2019-12-23 04:24:38
问题 I'm trying to get a pretty large table from mysql so I can manipulate using spark/databricks. I can't get it to load into spark - I have tried taking smaller subsets, but even at the smallest reasonable unit, it still fails to load. I have tried playing with the wait_timeout and interactive_timeout in mysql, but it doesn't seem to make any difference I am also loading a smaller (different) table, and that loads just fine. df_dataset = get_jdbc('raw_data_load', predicates=predicates).select(

Calculating Cumulative sum in PySpark using Window Functions

…衆ロ難τιáo~ 提交于 2019-12-23 04:24:12
问题 I have the following sample DataFrame: rdd = sc.parallelize([(1,20), (2,30), (3,30)]) df2 = spark.createDataFrame(rdd, ["id", "duration"]) df2.show() +---+--------+ | id|duration| +---+--------+ | 1| 20| | 2| 30| | 3| 30| +---+--------+ I want to sort this DataFrame in desc order of duration and add a new column which has the cumulative sum of the duration. So I did the following: windowSpec = Window.orderBy(df2['duration'].desc()) df_cum_sum = df2.withColumn("duration_cum_sum", sum('duration

Pyspark Structured Streaming Kafka configuration error

↘锁芯ラ 提交于 2019-12-23 04:21:29
问题 I've been using pyspark for Spark Streaming (Spark 2.0.2) with Kafka (0.10.1.0) successfully before, but my purposes are better suited for Structured Streaming. I've attempted to use the example online: https://spark.apache.org/docs/2.1.0/structured-streaming-kafka-integration.html with the following analogous code: ds1 = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1") .load() query = ds1 .writeStream .outputMode(