apache-spark

How to refer a map column in a spark-sql query?

允我心安 提交于 2021-01-28 19:11:42
问题 scala> val map1 = spark.sql("select map('p1', 's1', 'p2', 's2')") map1: org.apache.spark.sql.DataFrame = [map(p1, s1, p2, s2): map<string,string>] scala> map1.show() +--------------------+ | map(p1, s1, p2, s2)| +--------------------+ |[p1 -> s1, p2 -> s2]| +--------------------+ scala> spark.sql("select element_at(map1, 'p1')") org.apache.spark.sql.AnalysisException: cannot resolve ' map1 ' given input columns: []; line 1 pos 18; 'Project [unresolvedalias('element_at('map1, p1), None)] How

RDD of BSONObject to a DataFrame

我怕爱的太早我们不能终老 提交于 2021-01-28 18:47:49
问题 I'm loading a bson dump from Mongo into Spark as described here. It works, but what I get is: org.apache.spark.rdd.RDD[(Object, org.bson.BSONObject)] It should basically be just JSON with all String fields. The rest of my code requires a DataFrame object to manipulate the data. But, of course, toDF fails on that RDD. How can I convert it to a Spark DataFrame with all fields as String ? Something similar to spark.read.json would be great to have. 回答1: val datapath = "path_to_bson_file.bson"

Timezone conversion with pyspark from timestamp and country

爱⌒轻易说出口 提交于 2021-01-28 18:44:31
问题 I'm trying to convert UTC date to date with local timezone (using the country) with PySpark. I have the country as string and the date as timestamp So the input is : date = Timestamp('2016-11-18 01:45:55') # type is pandas._libs.tslibs.timestamps.Timestamp country = "FR" # Type is string import pytz import pandas as pd def convert_date_spark(date, country): timezone = pytz.country_timezones(country)[0] local_time = date.replace(tzinfo = pytz.utc).astimezone(timezone) date, time = local_time

Spark: Dataframe action really slow when upgraded from 2.1.0 to 2.2.1

人走茶凉 提交于 2021-01-28 17:56:20
问题 I just upgraded spark 2.1.0 to spark 2.2.1. Has anyone seen extreme slow behavior on dataframe.filter(…).collect() ?.. specifically a collect operation with filter before. dataframe.collect seems to run okay. However, dataframe.filter(…).collect() takes forever. it contains only 2 records. and its on a unit test. When I go back to spark 2.1.0, its back to normal speed I have looked at the thread dump and could not find an obvious cause. I have made an effort to make sure all the libraries I

Spark: Dataframe action really slow when upgraded from 2.1.0 to 2.2.1

送分小仙女□ 提交于 2021-01-28 17:53:49
问题 I just upgraded spark 2.1.0 to spark 2.2.1. Has anyone seen extreme slow behavior on dataframe.filter(…).collect() ?.. specifically a collect operation with filter before. dataframe.collect seems to run okay. However, dataframe.filter(…).collect() takes forever. it contains only 2 records. and its on a unit test. When I go back to spark 2.1.0, its back to normal speed I have looked at the thread dump and could not find an obvious cause. I have made an effort to make sure all the libraries I

sbt/sbt : no such file or directory error

爱⌒轻易说出口 提交于 2021-01-28 12:42:12
问题 I'm trying to install spark in my ubuntu machine. I have installed sbt and scala. I'm able to view their versions. But, when I try to install spark using 'sbt/sbt assembly' command, i get the below error. 'bash: sbt/sbt: No such file or directory' Can you please let me know where I am making a mistake. I have been stuck here since yesterday. Thank you for the help in advance. 回答1: You may had downloaded the pre-built version of Spark. If its a pre-built you dont need to execute built tool

Issue with creating a global list from map using PySpark

一笑奈何 提交于 2021-01-28 12:22:48
问题 I have this code where I am reading a file in ipython using pyspark . What I am trying to do is to add a piece to it which forms a list based on a particular column read from the file but when I try to execute it then the list comes out to be empty and nothing gets appended to it. My code is: list1 = [] def file_read(line): list1.append(line[10]) # bunch of other code which process other column indexes on `line` inputData = sc.textFile(fileName).zipWithIndex().filter(lambda (line,rownum):

Creating a Random Feature Array in Spark DataFrames

自闭症网瘾萝莉.ら 提交于 2021-01-28 12:15:56
问题 When creating an ALS model, we can extract a userFactors DataFrame and an itemFactors DataFrame. These DataFrames contain a column with an Array. I would like to generate some random data and union it to the userFactors DataFrame. Here is my code: val df1: DataFrame = Seq((123, 456, 4.0), (123, 789, 5.0), (234, 456, 4.5), (234, 789, 1.0)).toDF("user", "item", "rating") val model1 = (new ALS() .setImplicitPrefs(true) .fit(df1)) val iF = model1.itemFactors val uF = model1.userFactors I then

Spark Small ORC Stripes

跟風遠走 提交于 2021-01-28 11:58:32
问题 We use Spark to flatten out clickstream data and then write the same to S3 in ORC+zlib format, I have tried changing many settings in Spark but still the resultant stripe sizes of the ORC file getting created are very small (<2MB) Things which I tried so far to decrease the stripe size, Earlier each file was 20MB in size, using coalesce I am now creating files which are of 250-300MB in size, but still there are 200 stripes per file i.e each stripe <2MB Tried using hivecontext instead of

Rolling average without timestamp in pyspark

我怕爱的太早我们不能终老 提交于 2021-01-28 11:42:08
问题 We can find the rolling/moving average of a time series data using window function in pyspark. The data I am dealing with doesn't have any timestamp column but it does have a strictly increasing column frame_number . Data looks like this. d = [ {'session_id': 1, 'frame_number': 1, 'rtd': 11.0, 'rtd2': 11.0,}, {'session_id': 1, 'frame_number': 2, 'rtd': 12.0, 'rtd2': 6.0}, {'session_id': 1, 'frame_number': 3, 'rtd': 4.0, 'rtd2': 233.0}, {'session_id': 1, 'frame_number': 4, 'rtd': 110.0, 'rtd2'