pyspark-sql | 易学教程

Spark SQL get max & min dynamically from datasource

阅读更多关于 Spark SQL get max & min dynamically from datasource

问题 I am using Spark SQL where I want to fetch whole data everyday from a Oracle table(consist of more than 1800k records). The application is hanging up when I read from Oracle hence I used concept of partitionColumn,lowerBound & upperBound . But,the problem is how can I get l owerBound & upperBound value of primary key column dynamically ?? Every day value of lowerBound & upperBound will be changing.Hence how can I get the boundary values of primary key column dynamically?? Can anyone guide me

Count number of duplicate rows in SPARKSQL

阅读更多关于 Count number of duplicate rows in SPARKSQL

问题 I have requirement where i need to count number of duplicate rows in SparkSQL for Hive tables. from pyspark import SparkContext, SparkConf from pyspark.sql import HiveContext from pyspark.sql.types import * from pyspark.sql import Row app_name="test" conf = SparkConf().setAppName(app_name) sc = SparkContext(conf=conf) sqlContext = HiveContext(sc) df = sqlContext.sql("select * from DV_BDFRAWZPH_NOGBD_R000_SG.employee") As of now i have hardcoded the table name, but it actually comes as

How to cache a Spark data frame and reference it in another script

阅读更多关于 How to cache a Spark data frame and reference it in another script

问题 Is it possible to cache a data frame and then reference (query) it in another script?...My goal is as follows: In script 1, create a data frame (df) Run script 1 and cache df In script 2, query data in df 回答1: Spark >= 2.1.0 Since Spark 2.1 you can create global temporary views (createGlobalTempView), which can be accessed across multiple sessions using the same metastore, as long as the original session is kept alive: The lifetime of this temporary view is tied to this Spark application.

Pyspark connection to Postgres database in ipython notebook

阅读更多关于 Pyspark connection to Postgres database in ipython notebook

问题 I've read previous posts on this, but I still cannot pinpoint why I am unable to connect my ipython notebook to a Postgres db. I am able to launch pyspark in an ipython notebook, SparkContext is loaded as 'sc'. I have the following in my .bash_profile for finding the Postgres driver: export SPARK_CLASSPATH=/path/to/downloaded/jar Here's what I am doing in the ipython notebook to connect to the db (based on this post): from pyspark.sql import DataFrameReader as dfr sqlContext = SQLContext(sc)

Pyspark DataFrame UDF on Text Column

阅读更多关于 Pyspark DataFrame UDF on Text Column

I'm trying to do some NLP text clean up of some Unicode columns in a PySpark DataFrame. I've tried in Spark 1.3, 1.5 and 1.6 and can't seem to get things to work for the life of me. I've also tried using Python 2.7 and Python 3.4. I've created an extremely simple udf as seen below that should just return a string back for each record in a new column. Other functions will manipulate the text and then return the changed text back in a new column. import pyspark from pyspark.sql import SQLContext from pyspark.sql.types import * from pyspark.sql import SQLContext from pyspark.sql.functions import

Difference between createOrReplaceTempView and registerTempTable

阅读更多关于 Difference between createOrReplaceTempView and registerTempTable

I am new to spark and was trying out a few commands in sparkSql using python when I came across these two commands: createOrReplaceTempView () and registerTempTable (). What is the difference between the two commands?. They seem to have same set of functionalities. registerTempTable is a part of the 1.x API and has been deprecated in Spark 2.0. createOrReplaceTempView and createTempView have been introduced in Spark 2.0, as a replacement for registerTempTable . Other than that registerTempTable and createOrReplaceTempView functionally equivalent and the former one calls the latter one. No

How to pivot on multiple columns in Spark SQL?

阅读更多关于 How to pivot on multiple columns in Spark SQL?

I need to pivot more than one column in a pyspark dataframe. Sample dataframe, >>> d = [(100,1,23,10),(100,2,45,11),(100,3,67,12),(100,4,78,13),(101,1,23,10),(101,2,45,13),(101,3,67,14),(101,4,78,15),(102,1,23,10),(102,2,45,11),(102,3,67,16),(102,4,78,18)] >>> mydf = spark.createDataFrame(d,['id','day','price','units']) >>> mydf.show() +---+---+-----+-----+ | id|day|price|units| +---+---+-----+-----+ |100| 1| 23| 10| |100| 2| 45| 11| |100| 3| 67| 12| |100| 4| 78| 13| |101| 1| 23| 10| |101| 2| 45| 13| |101| 3| 67| 14| |101| 4| 78| 15| |102| 1| 23| 10| |102| 2| 45| 11| |102| 3| 67| 16| |102| 4|

Spark 2.0: Relative path in absolute URI (spark-warehouse)

阅读更多关于 Spark 2.0: Relative path in absolute URI (spark-warehouse)

I'm trying to migrate from Spark 1.6.1 to Spark 2.0.0 and I am getting a weird error when trying to read a csv file into SparkSQL. Previously, when I would read a file from local disk in pyspark I would do: Spark 1.6 df = sqlContext.read \ .format('com.databricks.spark.csv') \ .option('header', 'true') \ .load('file:///C:/path/to/my/file.csv', schema=mySchema) In the latest release I think it should look like this: Spark 2.0 spark = SparkSession.builder \ .master('local[*]') \ .appName('My App') \ .getOrCreate() df = spark.read \ .format('csv') \ .option('header', 'true') \ .load('file:///C:

Pyspark convert a standard list to data frame [duplicate]

阅读更多关于 Pyspark convert a standard list to data frame [duplicate]

This question already has an answer here: Create Spark DataFrame. Can not infer schema for type: <type 'float'> 1 answer The case is really simple, I need to convert a python list into data frame with following code from pyspark.sql.types import StructType from pyspark.sql.types import StructField from pyspark.sql.types import StringType, IntegerType schema = StructType([StructField("value", IntegerType(), True)]) my_list = [1, 2, 3, 4] rdd = sc.parallelize(my_list) df = sqlContext.createDataFrame(rdd, schema) df.show() it failed with following error: raise TypeError("StructType can not accept

How to convert type Row into Vector to feed to the KMeans

阅读更多关于 How to convert type Row into Vector to feed to the KMeans

when i try to feed df2 to kmeans i get the following error clusters = KMeans.train(df2, 10, maxIterations=30, runs=10, initializationMode="random") The error i get: Cannot convert type <class 'pyspark.sql.types.Row'> into Vector df2 is a dataframe created as follow: df = sqlContext.read.json("data/ALS3.json") df2 = df.select('latitude','longitude') df2.show() latitude| longitude| 60.1643075| 24.9460844| 60.4686748| 22.2774728| how can i convert this two columns to Vector and feed it to KMeans? ML The problem is that you missed the documentation's example , and it's pretty clear that the method