apache-spark-sql

Ordering of rows in JavaRdds after union

不羁的心 提交于 2021-01-28 08:08:45
问题 I am trying to find out any information on the ordering of the rows in a RDD. Here is what I am trying to do: Rdd1, Rdd2 Rdd3 = Rdd1.union(rdd2); in Rdd3, is there any guarantee that rdd1 records will appear first and rdd2 afterwards? For my tests I saw this behaviorunion happening but wasn't able to find it in any docs. just FI, I really do not care about the ordering of RDDs in itself (i.e. rdd2's or rdd1's data order is really not concern but after union Rdd1 record data must come first is

Spark: create a nested schema

隐身守侯 提交于 2021-01-28 06:50:41
问题 With spark, import spark.implicits._ val data = Seq( (1, ("value11", "value12")), (2, ("value21", "value22")), (3, ("value31", "value32")) ) val df = data.toDF("id", "v1") df.printSchema() The result is the following: root |-- id: integer (nullable = false) |-- v1: struct (nullable = true) | |-- _1: string (nullable = true) | |-- _2: string (nullable = true) Now if I want to create the schema myself, how should I process? val schema = StructType(Array( StructField("id", IntegerType),

How to make VectorAssembler do not compress data?

僤鯓⒐⒋嵵緔 提交于 2021-01-28 05:32:51
问题 I want to transform multiple columns to one column using VectorAssembler ,but the data is compressed by default without other options. val arr2= Array((1,2,0,0,0),(1,2,3,0,0),(1,2,4,5,0),(1,2,2,5,6)) val df=sc.parallelize(arr2).toDF("a","b","c","e","f") val colNames=Array("a","b","c","e","f") val assembler = new VectorAssembler() .setInputCols(colNames) .setOutputCol("newCol") val transDF= assembler.transform(df).select(col("newCol")) transDF.show(false) The input is: +---+---+---+---+---+ |

Sum one column values if other columns are matched

被刻印的时光 ゝ 提交于 2021-01-28 05:13:49
问题 I have a spark dataframe like this: word1 word2 co-occur ---- ----- ------- w1 w2 10 w2 w1 15 w2 w3 11 And my expected result is: word1 word2 co-occur ---- ----- ------- w1 w2 25 w2 w3 11 I tried dataframe's groupBy and aggregate functions but I couldn't come up with the solution. 回答1: You need a single column containing both words in sorted order, this column can then be used for the groupBy . You can create a new column with an array containing word1 and word as follows: df.withColumn(

Premature end of Content-Length delimited message body SparkException while reading from S3 using Pyspark

我的梦境 提交于 2021-01-28 01:42:06
问题 I am using the below code to read S3 csv file from my local machine . from pyspark import SparkConf, SparkContext from pyspark.sql import SparkSession import configparser import os conf = SparkConf() conf.set('spark.jars', '/usr/local/spark/jars/aws-java-sdk-1.7.4.jar,/usr/local/spark/jars/hadoop-aws-2.7.4.jar') #Tried by setting this, but failed conf.set('spark.executor.memory', '8g') conf.set('spark.driver.memory', '8g') spark_session = SparkSession.builder \ .config(conf=conf) \ .appName(

How do I pass parameters to selectExpr? SparkSQL-Scala

和自甴很熟 提交于 2021-01-27 22:23:32
问题 :) When you have a data frame, you can add columns and fill their rows with the method selectExprt Something like this: scala> table.show +------+--------+---------+--------+--------+ |idempr|tipperrd| codperrd|tipperrt|codperrt| +------+--------+---------+--------+--------+ | OlcM| h|999999999| J| 0| | zOcQ| r|777777777| J| 1| | kyGp| t|333333333| J| 2| | BEuX| A|999999999| F| 3| scala> var table2 = table.selectExpr("idempr", "tipperrd", "codperrd", "tipperrt", "codperrt", "'hola' as Saludo"

Spark read CSV - Not showing corroupt Records

佐手、 提交于 2021-01-27 20:54:30
问题 Spark has a Permissive mode for reading CSV files which stores the corroupt records into a separate column named _corroupt_record . permissive - Sets all fields to null when it encounters a corrupted record and places all corrupted records in a string column called _corrupt_record However, when I am trying following example, I don't see any column named _corroupt_record . the reocords which doesn't match with schema appears to be null data.csv data 10.00 11.00 $12.00 $13 gaurang code import

How do I pass parameters to selectExpr? SparkSQL-Scala

懵懂的女人 提交于 2021-01-27 20:31:03
问题 :) When you have a data frame, you can add columns and fill their rows with the method selectExprt Something like this: scala> table.show +------+--------+---------+--------+--------+ |idempr|tipperrd| codperrd|tipperrt|codperrt| +------+--------+---------+--------+--------+ | OlcM| h|999999999| J| 0| | zOcQ| r|777777777| J| 1| | kyGp| t|333333333| J| 2| | BEuX| A|999999999| F| 3| scala> var table2 = table.selectExpr("idempr", "tipperrd", "codperrd", "tipperrt", "codperrt", "'hola' as Saludo"

Is it possible to force schema definition when loading tables from AWS RDS (MySQL)

最后都变了- 提交于 2021-01-27 16:45:37
问题 I'm using Apache Spark to read data from MySQL database from AWS RDS . It is actually inferring the schema from the database as well. Unfortunately, one of the table's columns is of type TINYINT(1) (column name : active). The active column has the following values: non active active pending etc. Spark recognizes TINYINT(1) as BooleanType . So he change all value in active to true or false . As a result, I can’t identify the value. Is it possible to force schema definition when loading tables

spark throws error when reading hive table

谁说胖子不能爱 提交于 2021-01-27 13:56:17
问题 i am trying to do select * from db.abc in hive,this hive table was loaded using spark it does not work shows an error: Error: java.io.IOException: java.lang.IllegalArgumentException: bucketId out of range: -1 (state=,code=0) when i use the following properties i was able to query for hive: set hive.mapred.mode=nonstrict; set hive.optimize.ppd=true; set hive.optimize.index.filter=true; set hive.tez.bucket.pruning=true; set hive.explain.user=false; set hive.fetch.task.conversion=none; now when