spark-dataframe

Spark: 'Requested array size exceeds VM limit' when writing dataframe

半世苍凉 提交于 2019-12-13 03:16:51
问题 I am running into a "OutOfMemoryError: Requested array size exceeds VM limit" error when running my Scala Spark job. I'm running this job on an AWS EMR cluster with the following makeup: Master: 1 m4.4xlarge 32 vCore, 64 GiB memory Core: 1 r3.4xlarge 32 vCore, 122 GiB memory The version of Spark I'm using is 2.2.1 on EMR release label 5.11.0. I'm running my job in a spark shell with the following configurations: spark-shell --conf spark.driver.memory=40G --conf spark.driver.maxResultSize=25G

Write dataframe to csv with datatype map<string,bigint> in Spark

余生颓废 提交于 2019-12-13 02:48:05
问题 I have a file which is file1snappy.parquet. It is having a complex data structure like a map, array inside that.After processing that I got final result.while writing that results to csv I am getting some error saying "Exception in thread "main" java.lang.UnsupportedOperationException: CSV data source does not support map<string,bigint> data type." Code which I have used: val conf=new SparkConf().setAppName("student-example").setMaster("local") val sc = new SparkContext(conf) val sqlcontext =

Custom schema in spark-csv throwing error in spark 1.4.1

只愿长相守 提交于 2019-12-13 01:27:30
问题 I trying to process CSV file using spark -csv package in spark-shell in spark 1.4.1. scala> import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.hive.HiveContext scala> import org.apache.spark.sql.hive.orc._ import org.apache.spark.sql.hive.orc._ scala> import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}; import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType} scala> val hiveContext = new org.apache.spark.sql

java.lang.IllegalArgumentException: Can't get JDBC type for array<string>

谁说胖子不能爱 提交于 2019-12-13 00:48:21
问题 I want to import the output data into the mysql database, but the following error occurs, I will not convert the array to the desired string type, can help me? val Array(trainingData, testData) = msgDF.randomSplit(Array(0.9, 0.1)) val pipeline = new Pipeline().setStages(Array(labelIndexer, word2Vec, mlpc, labelConverter)) val model = pipeline.fit(trainingData) val predictionResultDF = model.transform(testData) val rows = predictionResultDF.select("song", "label", "predictedLabel") val df =

Find and remove matching column values in pyspark

痴心易碎 提交于 2019-12-12 19:15:31
问题 I have a pyspark dataframe where occasionally the columns will have a wrong value that matches another column. It would look something like this: | Date | Latitude | | 2017-01-01 | 43.4553 | | 2017-01-02 | 42.9399 | | 2017-01-03 | 43.0091 | | 2017-01-04 | 2017-01-04 | Obviously, the last Latitude value is incorrect. I need to remove any and all rows that are like this. I thought about using .isin() but I can't seem to get it to work. If I try df['Date'].isin(['Latitude']) I get: Column<(Date

spark out of memory multiple iterations

早过忘川 提交于 2019-12-12 19:05:38
问题 I have a spark job that (runs in spark 1.3.1) has to iterate over several keys (about 42) and process the job. Here is the structure of the program Get the key from a map Fetch data from hive (hadoop-yarn underneath) that is matching the key as a data frame Process data Write results to hive When I run this for one key, everything works fine. When I run with 42 keys, I am getting an out of memory exception around 12th iteration. Is there a way I can clean the memory in between each iteration?

Java Spark DataFrameReader java.lang.NegativeArraySizeException

时光总嘲笑我的痴心妄想 提交于 2019-12-12 18:16:11
问题 Learning Spark for java and trying to read in a .csv file as a DataFrame using the DataFrameReader , can't even get a super simple .csv file to work as I keep getting exception java.lang.NegativeArraySizeException . Here is what I am doing: public void test() { DataFrameReader dataFrameReader = new DataFrameReader(getSparkSession()); StructType parentSchema = new StructType(new StructField[] { DataTypes.createStructField("NAME", DataTypes.StringType, false), }); Dataset<Row> parents =

why does spark appends 'WHERE 1=0' at the end of sql query

雨燕双飞 提交于 2019-12-12 17:41:21
问题 I am trying to execute a simple mysql query using Apache Spark and create a data frame. But for some reasons spark appends 'WHERE 1=0' at the end of the query which I want to execute and throws an exception stating 'You have an error in your SQL syntax' . val spark = SparkSession.builder.master("local[*]").appName("rddjoin"). getOrCreate() val mhost = "jdbc:mysql://localhost:3306/registry" val mprop = new java.util.Properties mprop.setProperty("driver", "com.mysql.jdbc.Driver")mprop

Spark Streaming application gives OOM after running for 24 hours

瘦欲@ 提交于 2019-12-12 16:28:29
问题 I am using spark 1.5.0 and working on a spark streaming application . The application reads files from HDFS , converts rdd into dataframe and execute multiple queries on each dataframe. The application runs perfectly for around 24 hours and then it crashes. The application master logs / driver logs shows : Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: GC overhead limit exceeded at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class

Meaning of Exchange in Spark Stage

喜夏-厌秋 提交于 2019-12-12 10:50:22
问题 Can anyone explain me the meaning of exchange in my spark stages in spark DAG. Most of my stages either starts or end in exchange. 1). WholeStageCodeGen -> Exchange 2). Exchange -> WholeStageCodeGen -> SortAggregate -> Exchange 回答1: Whole stage code generation is a technique inspired by modern compilers to collapse the entire query into a single function Prior to whole-stage code generation, each physical plan is a class with the code defining the execution. With whole-stage code generation,