spark-dataframe | 易学教程

Spark: 'Requested array size exceeds VM limit' when writing dataframe

阅读更多关于 Spark: 'Requested array size exceeds VM limit' when writing dataframe

问题 I am running into a "OutOfMemoryError: Requested array size exceeds VM limit" error when running my Scala Spark job. I'm running this job on an AWS EMR cluster with the following makeup: Master: 1 m4.4xlarge 32 vCore, 64 GiB memory Core: 1 r3.4xlarge 32 vCore, 122 GiB memory The version of Spark I'm using is 2.2.1 on EMR release label 5.11.0. I'm running my job in a spark shell with the following configurations: spark-shell --conf spark.driver.memory=40G --conf spark.driver.maxResultSize=25G

Write dataframe to csv with datatype map<string,bigint> in Spark

阅读更多关于 Write dataframe to csv with datatype map in Spark

问题 I have a file which is file1snappy.parquet. It is having a complex data structure like a map, array inside that.After processing that I got final result.while writing that results to csv I am getting some error saying "Exception in thread "main" java.lang.UnsupportedOperationException: CSV data source does not support map<string,bigint> data type." Code which I have used: val conf=new SparkConf().setAppName("student-example").setMaster("local") val sc = new SparkContext(conf) val sqlcontext =

Custom schema in spark-csv throwing error in spark 1.4.1

阅读更多关于 Custom schema in spark-csv throwing error in spark 1.4.1

问题 I trying to process CSV file using spark -csv package in spark-shell in spark 1.4.1. scala> import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.hive.HiveContext scala> import org.apache.spark.sql.hive.orc._ import org.apache.spark.sql.hive.orc._ scala> import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}; import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType} scala> val hiveContext = new org.apache.spark.sql

java.lang.IllegalArgumentException: Can't get JDBC type for array<string>

阅读更多关于 java.lang.IllegalArgumentException: Can't get JDBC type for array

问题 I want to import the output data into the mysql database, but the following error occurs, I will not convert the array to the desired string type, can help me? val Array(trainingData, testData) = msgDF.randomSplit(Array(0.9, 0.1)) val pipeline = new Pipeline().setStages(Array(labelIndexer, word2Vec, mlpc, labelConverter)) val model = pipeline.fit(trainingData) val predictionResultDF = model.transform(testData) val rows = predictionResultDF.select("song", "label", "predictedLabel") val df =

Find and remove matching column values in pyspark

阅读更多关于 Find and remove matching column values in pyspark

问题 I have a pyspark dataframe where occasionally the columns will have a wrong value that matches another column. It would look something like this: | Date | Latitude | | 2017-01-01 | 43.4553 | | 2017-01-02 | 42.9399 | | 2017-01-03 | 43.0091 | | 2017-01-04 | 2017-01-04 | Obviously, the last Latitude value is incorrect. I need to remove any and all rows that are like this. I thought about using .isin() but I can't seem to get it to work. If I try df['Date'].isin(['Latitude']) I get: Column<(Date

spark out of memory multiple iterations

阅读更多关于 spark out of memory multiple iterations

问题 I have a spark job that (runs in spark 1.3.1) has to iterate over several keys (about 42) and process the job. Here is the structure of the program Get the key from a map Fetch data from hive (hadoop-yarn underneath) that is matching the key as a data frame Process data Write results to hive When I run this for one key, everything works fine. When I run with 42 keys, I am getting an out of memory exception around 12th iteration. Is there a way I can clean the memory in between each iteration?

Java Spark DataFrameReader java.lang.NegativeArraySizeException

阅读更多关于 Java Spark DataFrameReader java.lang.NegativeArraySizeException

问题 Learning Spark for java and trying to read in a .csv file as a DataFrame using the DataFrameReader , can't even get a super simple .csv file to work as I keep getting exception java.lang.NegativeArraySizeException . Here is what I am doing: public void test() { DataFrameReader dataFrameReader = new DataFrameReader(getSparkSession()); StructType parentSchema = new StructType(new StructField[] { DataTypes.createStructField("NAME", DataTypes.StringType, false), }); Dataset<Row> parents =

why does spark appends 'WHERE 1=0' at the end of sql query

阅读更多关于 why does spark appends 'WHERE 1=0' at the end of sql query

问题 I am trying to execute a simple mysql query using Apache Spark and create a data frame. But for some reasons spark appends 'WHERE 1=0' at the end of the query which I want to execute and throws an exception stating 'You have an error in your SQL syntax' . val spark = SparkSession.builder.master("local[*]").appName("rddjoin"). getOrCreate() val mhost = "jdbc:mysql://localhost:3306/registry" val mprop = new java.util.Properties mprop.setProperty("driver", "com.mysql.jdbc.Driver")mprop

Spark Streaming application gives OOM after running for 24 hours

阅读更多关于 Spark Streaming application gives OOM after running for 24 hours

问题 I am using spark 1.5.0 and working on a spark streaming application . The application reads files from HDFS , converts rdd into dataframe and execute multiple queries on each dataframe. The application runs perfectly for around 24 hours and then it crashes. The application master logs / driver logs shows : Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: GC overhead limit exceeded at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class

Meaning of Exchange in Spark Stage

阅读更多关于 Meaning of Exchange in Spark Stage

问题 Can anyone explain me the meaning of exchange in my spark stages in spark DAG. Most of my stages either starts or end in exchange. 1). WholeStageCodeGen -> Exchange 2). Exchange -> WholeStageCodeGen -> SortAggregate -> Exchange 回答1: Whole stage code generation is a technique inspired by modern compilers to collapse the entire query into a single function Prior to whole-stage code generation, each physical plan is a class with the code defining the execution. With whole-stage code generation,