pyspark

How to do range lookup and search in PySpark

半城伤御伤魂 提交于 2019-12-24 08:59:58
问题 I try to code in PySpark a function which can do combination search and lookup values within a range. The following is the detailed description. I have two data sets. One data set, say D1 , is basically a lookup table, as in below: MinValue MaxValue Value1 Value2 --------------------------------- 1 1000 0.5 0.6 1001 2000 0.8 0.1 2001 4000 0.2 0.5 4001 9000 0.04 0.06 The other data set, say D2, is a table with millions of records, for example: ID InterestsRate Days ----------------------------

name spark is not defined

耗尽温柔 提交于 2019-12-24 08:15:15
问题 Trying to follow the spark tutorial but get the following error - https://spark.apache.org/docs/latest/quick-start.html "name 'spark' is not defined" Using Python version 2.6.6 (r266:84292, Nov 22 2013 12:16:22) SparkContext available as sc. >>> import pyspark >>> textFile = spark.read.text("README.md") Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'spark' is not defined This is how I start it - ./bin/pyspark --master local[*] 回答1: If your spark

How to connect to Presto JDBC in PySpark?

Deadly 提交于 2019-12-24 08:12:38
问题 I want to connect to Presto server using JDBC in PySpark. I followed a tutorial which is written in Java. I am trying to do the same in my Python3 code but getting an error: : java.sql.SQLException: No suitable driver I have tried to execute the following code: jdbcDF = spark.read \ .format("jdbc") \ .option("url", "jdbc:presto://my_machine_ip:8080/hive/default") \ .option("user", "airflow") \ .option("dbtable", "may30_1") \ .load() It should be noted that I am using Spark on EMR and so,

Count most frequent word in row by R

别说谁变了你拦得住时间么 提交于 2019-12-24 08:04:38
问题 There is a table shown below Name Mon Tue Wed Thu Fri Sat Sun 1 John Apple Orange Apple Banana Apple Apple Orange 2 Ricky Banana Apple Banana Banana Banana Banana Apple 3 Alex Apple Orange Orange Apple Apple Orange Orange 4 Robbin Apple Apple Apple Apple Apple Banana Banana 5 Sunny Banana Banana Apple Apple Apple Banana Banana So , I want to count the most frequent Fruit for each person and add those value in new column. For example. Name Mon Tue Wed Thu Fri Sat Sun Max_Acc Count 1 John Apple

PySpark: DataFrame - Convert Struct to Array

和自甴很熟 提交于 2019-12-24 07:53:01
问题 I have a dataframe in the following structure: root |-- index: long (nullable = true) |-- text: string (nullable = true) |-- topicDistribution: struct (nullable = true) | |-- type: long (nullable = true) | |-- values: array (nullable = true) | | |-- element: double (containsNull = true) |-- wiki_index: string (nullable = true) I need to change it to: root |-- index: long (nullable = true) |-- text: string (nullable = true) |-- topicDistribution: array (nullable = true) | |-- element: double

How to convert a list of array to Spark dataframe

半腔热情 提交于 2019-12-24 07:46:25
问题 Suppose I have a list: x = [[1,10],[2,14],[3,17]] I want to convert x to a Spark dataframe with two columns id (1,2,3) and value (10,14,17). How could I do that? Thanks 回答1: x = [[1,10],[2,14],[3,17]] df = sc.parallelize(x).toDF(['ID','VALUE']) df.show() 来源: https://stackoverflow.com/questions/45858900/how-to-convert-a-list-of-array-to-spark-dataframe

How to identify repeated occurrences of a string column in Hive?

谁说我不能喝 提交于 2019-12-24 07:24:06
问题 I have a view like this in Hive: id sequencenumber appname 242539622 1 A 242539622 2 A 242539622 3 A 242539622 4 B 242539622 5 B 242539622 6 C 242539622 7 D 242539622 8 D 242539622 9 D 242539622 10 B 242539622 11 B 242539622 12 D 242539622 13 D 242539622 14 F I'd like to have, per each id, the following view: id sequencenumber appname appname_c 242539622 1 A A 242539622 2 A A 242539622 3 A A 242539622 4 B B_1 242539622 5 B B_1 242539622 6 C C 242539622 7 D D_1 242539622 8 D D_1 242539622 9 D

TypeError: 'GroupedData' object is not iterable in pyspark

谁说胖子不能爱 提交于 2019-12-24 06:48:52
问题 I'm using spark version 2.0.1 & python 2.7. I'm running following code # This will return a new DF with all the columns + id data1 = data.withColumn("id", monotonically_increasing_id()) # Create an integer index data1.show() def create_indexes(df, fields=['country', 'state_id', 'airport', 'airport_id']): """ Create indexes for the different element ids for CMRs. This allows us to select CMRs that match a given element and element value very quickly. """ if fields == None: print("No fields

Pyspark saving is not working when called from inside a foreach

泄露秘密 提交于 2019-12-24 06:38:28
问题 I am building a pipeline that receives messages from Azure EventHub and save into databricks delta tables. All my tests with static data went well, see the code below: body = 'A|B|C|D\n"False"|"253435564"|"14"|"2019-06-25 04:56:21.713"\n"True"|"253435564"|"13"|"2019-06-25 04:56:21.713"\n" tableLocation = "/delta/tables/myTableName" spark = SparkSession.builder.appName("CSV converter").getOrCreate() csvData = spark.sparkContext.parallelize(body.split('\n')) df = spark.read \ .option("header",

Error using pyspark with WASB/Connecting Pyspark with Azure Blob

陌路散爱 提交于 2019-12-24 06:35:47
问题 I'm currently working on connecting an Azure blob with Pyspark and am encountering difficulties getting the two connected and running. I have installed both required jar files (hadoop-azure-3.2.0-javadoc.jar and azure-storage-8.3.0-javadoc.jar). I set them to be read in my sparkConf by using SparkConf().setAll() and once the I start the session I use: spark._jsc.hadoopConfiguration().set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem") spark._jsc.hadoopConfiguration().set("fs