apache-spark

Spark History Server on S3A FileSystem: ClassNotFoundException

半城伤御伤魂 提交于 2021-02-06 09:18:37
问题 Spark can use Hadoop S3A file system org.apache.hadoop.fs.s3a.S3AFileSystem . By adding the following into the conf/spark-defaults.conf , I can get spark-shell to log to the S3 bucket: spark.jars.packages net.java.dev.jets3t:jets3t:0.9.0,com.google.guava:guava:16.0.1,com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem spark.eventLog.enabled true spark.eventLog.dir s3a://spark-logs-test/ spark.history.fs

Compare two dataframes Pyspark

馋奶兔 提交于 2021-02-06 06:31:48
问题 I'm trying to compare two data frames with have same number of columns i.e. 4 columns with id as key column in both data frames df1 = spark.read.csv("/path/to/data1.csv") df2 = spark.read.csv("/path/to/data2.csv") Now I want to append new column to DF2 i.e. column_names which is the list of the columns with different values than df1 df2.withColumn("column_names",udf()) DF1 +------+---------+--------+------+ | id | |name | sal | Address | +------+---------+--------+------+ | 1| ABC | 5000 | US

Compare two dataframes Pyspark

假如想象 提交于 2021-02-06 06:31:33
问题 I'm trying to compare two data frames with have same number of columns i.e. 4 columns with id as key column in both data frames df1 = spark.read.csv("/path/to/data1.csv") df2 = spark.read.csv("/path/to/data2.csv") Now I want to append new column to DF2 i.e. column_names which is the list of the columns with different values than df1 df2.withColumn("column_names",udf()) DF1 +------+---------+--------+------+ | id | |name | sal | Address | +------+---------+--------+------+ | 1| ABC | 5000 | US

How to handle small file problem in spark structured streaming?

半世苍凉 提交于 2021-02-06 02:59:53
问题 I have a scenario in my project , where I am reading the kafka topic messages using spark-sql-2.4.1 version. I am able to process the day using structured streaming. Once the data is received and after processed I need to save the data into respective parquet files in hdfs store. I am able to store and read parquet files, I kept a trigger time of 15 seconds to 1 minutes. These files are very small in size hence resulting into many files. These parquet files need to be read latter by hive

How to handle small file problem in spark structured streaming?

拥有回忆 提交于 2021-02-06 02:59:49
问题 I have a scenario in my project , where I am reading the kafka topic messages using spark-sql-2.4.1 version. I am able to process the day using structured streaming. Once the data is received and after processed I need to save the data into respective parquet files in hdfs store. I am able to store and read parquet files, I kept a trigger time of 15 seconds to 1 minutes. These files are very small in size hence resulting into many files. These parquet files need to be read latter by hive

Converting a dataframe into JSON (in pyspark) and then selecting desired fields

醉酒当歌 提交于 2021-02-06 02:41:06
问题 I'm new to Spark. I have a dataframe that contains the results of some analysis. I converted that dataframe into JSON so I could display it in a Flask App: results = result.toJSON().collect() An example entry in my json file is below. I then tried to run a for loop in order to get specific results: {"userId":"1","systemId":"30","title":"interest"} for i in results: print i["userId"] This doesn't work at all and I get errors such as: Python (json) : TypeError: expected string or buffer I used

Converting a dataframe into JSON (in pyspark) and then selecting desired fields

。_饼干妹妹 提交于 2021-02-06 02:36:22
问题 I'm new to Spark. I have a dataframe that contains the results of some analysis. I converted that dataframe into JSON so I could display it in a Flask App: results = result.toJSON().collect() An example entry in my json file is below. I then tried to run a for loop in order to get specific results: {"userId":"1","systemId":"30","title":"interest"} for i in results: print i["userId"] This doesn't work at all and I get errors such as: Python (json) : TypeError: expected string or buffer I used

Converting a dataframe into JSON (in pyspark) and then selecting desired fields

拟墨画扇 提交于 2021-02-06 02:32:39
问题 I'm new to Spark. I have a dataframe that contains the results of some analysis. I converted that dataframe into JSON so I could display it in a Flask App: results = result.toJSON().collect() An example entry in my json file is below. I then tried to run a for loop in order to get specific results: {"userId":"1","systemId":"30","title":"interest"} for i in results: print i["userId"] This doesn't work at all and I get errors such as: Python (json) : TypeError: expected string or buffer I used

Spark Driver memory and Application Master memory

倾然丶 夕夏残阳落幕 提交于 2021-02-05 20:26:50
问题 Am I understanding the documentation for client mode correctly? client mode is opposed to cluster mode where the driver runs within the application master? In client mode the driver and application master are separate processes and therefore spark.driver.memory + spark.yarn.am.memory must be less than the machine's memory? In client mode is the driver memory is not included in the application master memory setting? 回答1: client mode is opposed to cluster mode where the driver runs within the

SparkContext Error - File not found /tmp/spark-events does not exist

狂风中的少年 提交于 2021-02-05 20:19:06
问题 Running a Python Spark Application via API call - On submitting the Application - response - Failed SSH into the Worker My python application exists in /root/spark/work/driver-id/wordcount.py Error can be found in /root/spark/work/driver-id/stderr Show the following error - Traceback (most recent call last): File "/root/wordcount.py", line 34, in <module> main() File "/root/wordcount.py", line 18, in main sc = SparkContext(conf=conf) File "/root/spark/python/lib/pyspark.zip/pyspark/context.py