apache-spark

spark test on local machine

时间秒杀一切 提交于 2021-01-27 17:12:59
问题 I am running unit tests on Spark 1.3.1 with sbt test and besides the unit tests being incredibly slow I keep running into java.lang.ClassNotFoundException: org.apache.spark.storage.RDDBlockId issues. Usually this means a dependency issue, but I wouldn't know from where. Tried installing everything on a new machine, including fresh hadoop, fresh ivy2, but I still run into the same issue Any help is greatly appreciated Exception: Exception in thread "Driver Heartbeater" java.lang

Is it possible to force schema definition when loading tables from AWS RDS (MySQL)

最后都变了- 提交于 2021-01-27 16:45:37
问题 I'm using Apache Spark to read data from MySQL database from AWS RDS . It is actually inferring the schema from the database as well. Unfortunately, one of the table's columns is of type TINYINT(1) (column name : active). The active column has the following values: non active active pending etc. Spark recognizes TINYINT(1) as BooleanType . So he change all value in active to true or false . As a result, I can’t identify the value. Is it possible to force schema definition when loading tables

PySpark UDF optimization challenge

◇◆丶佛笑我妖孽 提交于 2021-01-27 15:01:14
问题 I am trying to optimize the code below. The when run with 1000 lines of data takes about 12 minutes to complete. Our use case would require data sizes to be around 25K - 50K rows which would make this implementation completely infeasible. import pyspark.sql.types as Types import numpy import spacy from pyspark.sql.functions import udf inputPath = "s3://myData/part-*.parquet" df = spark.read.parquet(inputPath) test_df = df.select('uid', 'content').limit(1000).repartition(10) # print(df.rdd

org.apache.spark.sql.AnalysisException: 'write' can not be called on streaming Dataset/DataFrame

三世轮回 提交于 2021-01-27 14:23:40
问题 I'm trying to write a Spark Structured Streaming (2.3) dataset to ScyllaDB (Cassandra). My code to write the dataset: def saveStreamSinkProvider(ds: Dataset[InvoiceItemKafka]) = { ds .writeStream .format("cassandra.ScyllaSinkProvider") .outputMode(OutputMode.Append) .queryName("KafkaToCassandraStreamSinkProvider") .options( Map( "keyspace" -> namespace, "table" -> StreamProviderTableSink, "checkpointLocation" -> "/tmp/checkpoints" ) ) .start() } My ScyllaDB Streaming Sinks: class

PYSPARK: CX_ORACLE.InterfaceError: not a query

混江龙づ霸主 提交于 2021-01-27 13:57:24
问题 i need to perform update query in spark job. i am trying below code. but facing issues. import cx_Oracle def query(sql): connection = cx_Oracle.connect("username/password@s<url>/db") cursor = connection.cursor() cursor.execute(sql) result = cursor.fetchall() return result v = [10] rdd = sc.parallelize(v).coalesce(1) rdd.foreachPartition(lambda x : [query("UPDATE db.tableSET MAPPERS ="+str(i)+" WHERE TABLE_NAME = 'table_name'") for i in x]) when i execute the above process i am getting below

spark throws error when reading hive table

谁说胖子不能爱 提交于 2021-01-27 13:56:17
问题 i am trying to do select * from db.abc in hive,this hive table was loaded using spark it does not work shows an error: Error: java.io.IOException: java.lang.IllegalArgumentException: bucketId out of range: -1 (state=,code=0) when i use the following properties i was able to query for hive: set hive.mapred.mode=nonstrict; set hive.optimize.ppd=true; set hive.optimize.index.filter=true; set hive.tez.bucket.pruning=true; set hive.explain.user=false; set hive.fetch.task.conversion=none; now when

How to use CROSS JOIN and CROSS APPLY in Spark SQL

…衆ロ難τιáo~ 提交于 2021-01-27 13:51:38
问题 I am very new to Spark and Scala, I writing Spark SQL code. I am in situation to apply CROSS JOIN and CROSS APPLY in my logic. Here I will post the SQL query which I have to convert to spark SQL. select Table1.Column1,Table2.Column2,Table3.Column3 from Table1 CROSS JOIN Table2 CROSS APPLY Table3 I need the above query to convert in to SQLContext in Spark SQL. Kindly help me. Thanks in Advance. 回答1: First set the below property in spark conf spark.sql.crossJoin.enabled=true then dataFrame1

Spark stuck at removing broadcast variable (probably)

孤人 提交于 2021-01-27 13:14:18
问题 Spark 2.0.0-preview We've got an app that uses a fairly big broadcast variable. We run this on a big EC2 instance, so deployment is in client-mode. Broadcasted variable is a massive Map[String, Array[String]] . At the end of saveAsTextFile , the output in the folder seems to be complete and correct (apart from .crc files still being there) BUT the spark-submit process is stuck on, seemingly, removing the broadcast variable. The stuck logs look like this: http://pastebin.com/wpTqvArY My last

Spark worker throws FileNotFoundException on temporary shuffle files

耗尽温柔 提交于 2021-01-27 08:00:52
问题 I am running a Spark application that processes multiple sets of data points; some of these sets need to be processed sequentially. When running the application for small sets of data points (ca. 100), everything works fine. But in some cases, the sets will have a size of ca. 10,000 data points, and those cause the worker to crash with the following stack trace: Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 26.0 failed 4 times,

Spark Error: Unable to find encoder for type stored in a Dataset

China☆狼群 提交于 2021-01-27 07:50:22
问题 I am using Spark on a Zeppelin notebook, and groupByKey() does not seem to be working. This code: df.groupByKey(row => row.getLong(0)) .mapGroups((key, iterable) => println(key)) Gives me this error (presumably a compilation error, since it shows up in no time while the dataset I am working on is pretty big): error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for