spark-dataframe

How to transform DataFrame before joining operation?

坚强是说给别人听的谎言 提交于 2019-12-29 09:41:47
问题 The following code is used to extract ranks from the column products . The ranks are second numbers in each pair [...] . For example, in the given example [[222,66],[333,55]] the ranks are 66 and 55 for products with PK 222 and 333 , accordingly. But the code in Spark 2.2 works very slowly when df_products is around 800 Mb: df_products.createOrReplaceTempView("df_products") val result = df.as("df2") .join(spark.sql("SELECT * FROM df_products") .select($"product_PK", explode($"products").as(

How to replace empty values in a column of DataFrame?

狂风中的少年 提交于 2019-12-29 09:17:14
问题 How can I replace empty values in a column Field1 of DataFrame df ? Field1 Field2 AA 12 BB This command does not provide an expected result: df.na.fill("Field1",Seq("Anonymous")) The expected result: Field1 Field2 Anonymous AA 12 BB 回答1: Fill: Returns a new DataFrame that replaces null or NaN values in numeric columns with value. Two things: An empty string is not null or NaN, so you'll have to use a case statement for that. Fill seems to not work well when giving a text value into a numeric

OUTER JOIN on 2 DATA FRAMES : Spark Scala SqlContext

非 Y 不嫁゛ 提交于 2019-12-29 09:13:07
问题 I am getting error while doing outer joins on 2 data frames. I am trying to get the percentile. val sqlContext = new org.apache.spark.sql.SQLContext(sc) val df = sqlContext.jsonFile("temp.txt") val res = df.withColumn("visited", explode($"visited")) val result1 =res.groupBy($"customerId", $"visited.placeName").agg(count("*").alias("total")) val result2 = res .filter($"visited.rating" < 4) .groupBy($"customerId", $"visited.placeName") .agg(count("*").alias("top")) result1.show() result2.show()

How to force Spark to evaluate DataFrame operations inline

只愿长相守 提交于 2019-12-29 08:36:07
问题 According to the Spark RDD docs: All transformations in Spark are lazy, in that they do not compute their results right away...This design enables Spark to run more efficiently. There are times when I need to do certain operations on my dataframes right then and now . But because dataframe ops are " lazily evaluated " (per above), when I write these operations in the code, there's very little guarantee that Spark will actually execute those operations inline with the rest of the code. For

Array Intersection in Spark SQL

旧巷老猫 提交于 2019-12-29 08:03:10
问题 I have a table with a array type column named writer which has the values like array[value1, value2] , array[value2, value3] .... etc. I am doing self join to get results which have common values between arrays. I tried: sqlContext.sql("SELECT R2.writer FROM table R1 JOIN table R2 ON R1.id != R2.id WHERE ARRAY_INTERSECTION(R1.writer, R2.writer)[0] is not null ") And sqlContext.sql("SELECT R2.writer FROM table R1 JOIN table R2 ON R1.id != R2.id WHERE ARRAY_INTERSECT(R1.writer, R2.writer)[0] is

Converting Pandas dataframe into Spark dataframe error

蹲街弑〆低调 提交于 2019-12-28 08:17:10
问题 I'm trying to convert Pandas DF into Spark one. DF head: 10000001,1,0,1,12:35,OK,10002,1,0,9,f,NA,24,24,0,3,9,0,0,1,1,0,0,4,543 10000001,2,0,1,12:36,OK,10002,1,0,9,f,NA,24,24,0,3,9,2,1,1,3,1,3,2,611 10000002,1,0,4,12:19,PA,10003,1,1,7,f,NA,74,74,0,2,15,2,0,2,3,1,2,2,691 Code: dataset = pd.read_csv("data/AS/test_v2.csv") sc = SparkContext(conf=conf) sqlCtx = SQLContext(sc) sdf = sqlCtx.createDataFrame(dataset) And I got an error: TypeError: Can not merge type <class 'pyspark.sql.types

How to improve performance for slow Spark jobs using DataFrame and JDBC connection?

耗尽温柔 提交于 2019-12-27 11:45:26
问题 I am trying to access a mid-size Teradata table (~100 million rows) via JDBC in standalone mode on a single node (local[*]). I am using Spark 1.4.1. and is setup on a very powerful machine(2 cpu, 24 cores, 126G RAM). I have tried several memory setup and tuning options to make it work faster, but neither of them made a huge impact. I am sure there is something I am missing and below is my final try that took about 11 minutes to get this simple counts vs it only took 40 seconds using a JDBC

Unable to read, and later query text file in Apache Spark

柔情痞子 提交于 2019-12-25 17:17:20
问题 So I am trying to implement the example Spark Programming Example using a dataset available with us. It is a file which is separated by | . However it throws the following error, even after following the instructions as given. I can see it is unable to "cast" an object of one instance into another, any advice as to how to handle the scenario. Caused by: java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD

how to change header of a data frame with another data frame header?

ぐ巨炮叔叔 提交于 2019-12-25 11:35:39
问题 I have a data set which looks like this LineItem.organizationId|^|LineItem.lineItemId|^|StatementTypeCode|^|LineItemName|^|LocalLanguageLabel|^|FinancialConceptLocal|^|FinancialConceptGlobal|^|IsDimensional|^|InstrumentId|^|LineItemSequence|^|PhysicalMeasureId|^|FinancialConceptCodeGlobalSecondary|^|IsRangeAllowed|^|IsSegmentedByOrigin|^|SegmentGroupDescription|^|SegmentChildDescription|^|SegmentChildLocalLanguageLabel|^|LocalLanguageLabel.languageId|^|LineItemName.languageId|^

Another instance of Derby may have already booted the database /home/cloudera/metastore_db

时光怂恿深爱的人放手 提交于 2019-12-25 11:14:24
问题 I am trying to load a normal text file into a hive table using Spark. I am using Spark version 2.0.2. I have done it successfully in Spark version: 1.6.0 and I am trying to do the same in version 2x I executed the below steps: import org.apache.spark.sql.SparkSession val spark = SparkSession.builder().appName("SparkHiveLoad").master("local").enableHiveSupport().getOrCreate() import spark.implicits._ There is no problem until now. But when I try to load the file into Spark: val partfile =