apache-spark | 易学教程

How to refer a map column in a spark-sql query?

阅读更多关于 How to refer a map column in a spark-sql query?

问题 scala> val map1 = spark.sql("select map('p1', 's1', 'p2', 's2')") map1: org.apache.spark.sql.DataFrame = [map(p1, s1, p2, s2): map<string,string>] scala> map1.show() +--------------------+ | map(p1, s1, p2, s2)| +--------------------+ |[p1 -> s1, p2 -> s2]| +--------------------+ scala> spark.sql("select element_at(map1, 'p1')") org.apache.spark.sql.AnalysisException: cannot resolve ' map1 ' given input columns: []; line 1 pos 18; 'Project [unresolvedalias('element_at('map1, p1), None)] How

RDD of BSONObject to a DataFrame

阅读更多关于 RDD of BSONObject to a DataFrame

问题 I'm loading a bson dump from Mongo into Spark as described here. It works, but what I get is: org.apache.spark.rdd.RDD[(Object, org.bson.BSONObject)] It should basically be just JSON with all String fields. The rest of my code requires a DataFrame object to manipulate the data. But, of course, toDF fails on that RDD. How can I convert it to a Spark DataFrame with all fields as String ? Something similar to spark.read.json would be great to have. 回答1: val datapath = "path_to_bson_file.bson"

Timezone conversion with pyspark from timestamp and country

阅读更多关于 Timezone conversion with pyspark from timestamp and country

问题 I'm trying to convert UTC date to date with local timezone (using the country) with PySpark. I have the country as string and the date as timestamp So the input is : date = Timestamp('2016-11-18 01:45:55') # type is pandas._libs.tslibs.timestamps.Timestamp country = "FR" # Type is string import pytz import pandas as pd def convert_date_spark(date, country): timezone = pytz.country_timezones(country)[0] local_time = date.replace(tzinfo = pytz.utc).astimezone(timezone) date, time = local_time

Spark: Dataframe action really slow when upgraded from 2.1.0 to 2.2.1

阅读更多关于 Spark: Dataframe action really slow when upgraded from 2.1.0 to 2.2.1

问题 I just upgraded spark 2.1.0 to spark 2.2.1. Has anyone seen extreme slow behavior on dataframe.filter(…).collect() ?.. specifically a collect operation with filter before. dataframe.collect seems to run okay. However, dataframe.filter(…).collect() takes forever. it contains only 2 records. and its on a unit test. When I go back to spark 2.1.0, its back to normal speed I have looked at the thread dump and could not find an obvious cause. I have made an effort to make sure all the libraries I

Spark: Dataframe action really slow when upgraded from 2.1.0 to 2.2.1

阅读更多关于 Spark: Dataframe action really slow when upgraded from 2.1.0 to 2.2.1

sbt/sbt : no such file or directory error

阅读更多关于 sbt/sbt : no such file or directory error

问题 I'm trying to install spark in my ubuntu machine. I have installed sbt and scala. I'm able to view their versions. But, when I try to install spark using 'sbt/sbt assembly' command, i get the below error. 'bash: sbt/sbt: No such file or directory' Can you please let me know where I am making a mistake. I have been stuck here since yesterday. Thank you for the help in advance. 回答1: You may had downloaded the pre-built version of Spark. If its a pre-built you dont need to execute built tool

Issue with creating a global list from map using PySpark

阅读更多关于 Issue with creating a global list from map using PySpark

问题 I have this code where I am reading a file in ipython using pyspark . What I am trying to do is to add a piece to it which forms a list based on a particular column read from the file but when I try to execute it then the list comes out to be empty and nothing gets appended to it. My code is: list1 = [] def file_read(line): list1.append(line[10]) # bunch of other code which process other column indexes on `line` inputData = sc.textFile(fileName).zipWithIndex().filter(lambda (line,rownum):

Creating a Random Feature Array in Spark DataFrames

阅读更多关于 Creating a Random Feature Array in Spark DataFrames

问题 When creating an ALS model, we can extract a userFactors DataFrame and an itemFactors DataFrame. These DataFrames contain a column with an Array. I would like to generate some random data and union it to the userFactors DataFrame. Here is my code: val df1: DataFrame = Seq((123, 456, 4.0), (123, 789, 5.0), (234, 456, 4.5), (234, 789, 1.0)).toDF("user", "item", "rating") val model1 = (new ALS() .setImplicitPrefs(true) .fit(df1)) val iF = model1.itemFactors val uF = model1.userFactors I then

Spark Small ORC Stripes

阅读更多关于 Spark Small ORC Stripes

问题 We use Spark to flatten out clickstream data and then write the same to S3 in ORC+zlib format, I have tried changing many settings in Spark but still the resultant stripe sizes of the ORC file getting created are very small (<2MB) Things which I tried so far to decrease the stripe size, Earlier each file was 20MB in size, using coalesce I am now creating files which are of 250-300MB in size, but still there are 200 stripes per file i.e each stripe <2MB Tried using hivecontext instead of

Rolling average without timestamp in pyspark

阅读更多关于 Rolling average without timestamp in pyspark

问题 We can find the rolling/moving average of a time series data using window function in pyspark. The data I am dealing with doesn't have any timestamp column but it does have a strictly increasing column frame_number . Data looks like this. d = [ {'session_id': 1, 'frame_number': 1, 'rtd': 11.0, 'rtd2': 11.0,}, {'session_id': 1, 'frame_number': 2, 'rtd': 12.0, 'rtd2': 6.0}, {'session_id': 1, 'frame_number': 3, 'rtd': 4.0, 'rtd2': 233.0}, {'session_id': 1, 'frame_number': 4, 'rtd': 110.0, 'rtd2'