apache-spark-sql | 易学教程

Ordering of rows in JavaRdds after union

阅读更多关于 Ordering of rows in JavaRdds after union

问题 I am trying to find out any information on the ordering of the rows in a RDD. Here is what I am trying to do: Rdd1, Rdd2 Rdd3 = Rdd1.union(rdd2); in Rdd3, is there any guarantee that rdd1 records will appear first and rdd2 afterwards? For my tests I saw this behaviorunion happening but wasn't able to find it in any docs. just FI, I really do not care about the ordering of RDDs in itself (i.e. rdd2's or rdd1's data order is really not concern but after union Rdd1 record data must come first is

Spark: create a nested schema

阅读更多关于 Spark: create a nested schema

问题 With spark, import spark.implicits._ val data = Seq( (1, ("value11", "value12")), (2, ("value21", "value22")), (3, ("value31", "value32")) ) val df = data.toDF("id", "v1") df.printSchema() The result is the following: root |-- id: integer (nullable = false) |-- v1: struct (nullable = true) | |-- _1: string (nullable = true) | |-- _2: string (nullable = true) Now if I want to create the schema myself, how should I process? val schema = StructType(Array( StructField("id", IntegerType),

How to make VectorAssembler do not compress data?

阅读更多关于 How to make VectorAssembler do not compress data?

问题 I want to transform multiple columns to one column using VectorAssembler ,but the data is compressed by default without other options. val arr2= Array((1,2,0,0,0),(1,2,3,0,0),(1,2,4,5,0),(1,2,2,5,6)) val df=sc.parallelize(arr2).toDF("a","b","c","e","f") val colNames=Array("a","b","c","e","f") val assembler = new VectorAssembler() .setInputCols(colNames) .setOutputCol("newCol") val transDF= assembler.transform(df).select(col("newCol")) transDF.show(false) The input is: +---+---+---+---+---+ |

Sum one column values if other columns are matched

阅读更多关于 Sum one column values if other columns are matched

问题 I have a spark dataframe like this: word1 word2 co-occur ---- ----- ------- w1 w2 10 w2 w1 15 w2 w3 11 And my expected result is: word1 word2 co-occur ---- ----- ------- w1 w2 25 w2 w3 11 I tried dataframe's groupBy and aggregate functions but I couldn't come up with the solution. 回答1: You need a single column containing both words in sorted order, this column can then be used for the groupBy . You can create a new column with an array containing word1 and word as follows: df.withColumn(

Premature end of Content-Length delimited message body SparkException while reading from S3 using Pyspark

阅读更多关于 Premature end of Content-Length delimited message body SparkException while reading from S3 using Pyspark

问题 I am using the below code to read S3 csv file from my local machine . from pyspark import SparkConf, SparkContext from pyspark.sql import SparkSession import configparser import os conf = SparkConf() conf.set('spark.jars', '/usr/local/spark/jars/aws-java-sdk-1.7.4.jar,/usr/local/spark/jars/hadoop-aws-2.7.4.jar') #Tried by setting this, but failed conf.set('spark.executor.memory', '8g') conf.set('spark.driver.memory', '8g') spark_session = SparkSession.builder \ .config(conf=conf) \ .appName(

How do I pass parameters to selectExpr? SparkSQL-Scala

阅读更多关于 How do I pass parameters to selectExpr? SparkSQL-Scala

问题 :) When you have a data frame, you can add columns and fill their rows with the method selectExprt Something like this: scala> table.show +------+--------+---------+--------+--------+ |idempr|tipperrd| codperrd|tipperrt|codperrt| +------+--------+---------+--------+--------+ | OlcM| h|999999999| J| 0| | zOcQ| r|777777777| J| 1| | kyGp| t|333333333| J| 2| | BEuX| A|999999999| F| 3| scala> var table2 = table.selectExpr("idempr", "tipperrd", "codperrd", "tipperrt", "codperrt", "'hola' as Saludo"

Spark read CSV - Not showing corroupt Records

阅读更多关于 Spark read CSV - Not showing corroupt Records

问题 Spark has a Permissive mode for reading CSV files which stores the corroupt records into a separate column named _corroupt_record . permissive - Sets all fields to null when it encounters a corrupted record and places all corrupted records in a string column called _corrupt_record However, when I am trying following example, I don't see any column named _corroupt_record . the reocords which doesn't match with schema appears to be null data.csv data 10.00 11.00 $12.00 $13 gaurang code import

How do I pass parameters to selectExpr? SparkSQL-Scala

阅读更多关于 How do I pass parameters to selectExpr? SparkSQL-Scala

Is it possible to force schema definition when loading tables from AWS RDS (MySQL)

阅读更多关于 Is it possible to force schema definition when loading tables from AWS RDS (MySQL)

问题 I'm using Apache Spark to read data from MySQL database from AWS RDS . It is actually inferring the schema from the database as well. Unfortunately, one of the table's columns is of type TINYINT(1) (column name : active). The active column has the following values: non active active pending etc. Spark recognizes TINYINT(1) as BooleanType . So he change all value in active to true or false . As a result, I can’t identify the value. Is it possible to force schema definition when loading tables

spark throws error when reading hive table

阅读更多关于 spark throws error when reading hive table

问题 i am trying to do select * from db.abc in hive,this hive table was loaded using spark it does not work shows an error: Error: java.io.IOException: java.lang.IllegalArgumentException: bucketId out of range: -1 (state=,code=0) when i use the following properties i was able to query for hive: set hive.mapred.mode=nonstrict; set hive.optimize.ppd=true; set hive.optimize.index.filter=true; set hive.tez.bucket.pruning=true; set hive.explain.user=false; set hive.fetch.task.conversion=none; now when