pyspark-sql | 易学教程

Use “IS IN” between 2 Spark dataframe columns

阅读更多关于 Use “IS IN” between 2 Spark dataframe columns

问题 I have the above dataframe: from pyspark.sql.types import * rdd = sc.parallelize([ ('ALT', ['chien', 'chat'] , 'oiseau'), ('ALT', ['oiseau'] , 'oiseau'), ('TDR', ['poule','poulet'], 'poule' ), ('ALT', ['ours'] , 'chien' ), ('ALT', ['paon'] , 'tigre' ), ('TDR', ['tigre','lion'] , 'lion' ), ('ALT', ['chat'] ,'chien' ), ]) schema = StructType([StructField("ClientId",StringType(),True), StructField("Animaux",ArrayType(StringType(),True),True), StructField("Animal",StringType(),True),]) test = rdd

spark.sql vs SqlContext

阅读更多关于 spark.sql vs SqlContext

问题 I have used SQL in Spark, in this example: results = spark.sql("select * from ventas") where ventas is a dataframe, previosuly cataloged like a table: df.createOrReplaceTempView('ventas') but I have seen other ways of working with SQL in Spark, using the class SqlContext: df = sqlContext.sql("SELECT * FROM table") What is the difference between both of them? Thanks in advance 回答1: Sparksession is the preferred way of working with Spark object now. Both Hivecontext and SQLContext are available

Read range of files in pySpark

阅读更多关于 Read range of files in pySpark

I need to read contiguous files in pySpark. The following works for me. from pyspark.sql import SQLContext file = "events.parquet/exportDay=2015090[1-7]" df = sqlContext.read.load(file) How do I read files 8-14? kathleen Use curly braces. file = "events.parquet/exportDay=201509{08,09,10,11,12,13,14}" Here's a similar question on stack overflow: Pyspark select subset of files using regex glob . They suggest either using curly braces, OR performing multiple reads and then unioning the objects (whether they are RDDs or data frames or whatever, there should be some way). Barry Loper It uses shell

spark.sql vs SqlContext

阅读更多关于 spark.sql vs SqlContext

I have used SQL in Spark, in this example: results = spark.sql("select * from ventas") where ventas is a dataframe, previosuly cataloged like a table: df.createOrReplaceTempView('ventas') but I have seen other ways of working with SQL in Spark, using the class SqlContext: df = sqlContext.sql("SELECT * FROM table") What is the difference between both of them? Thanks in advance Sparksession is the preferred way of working with Spark object now. Both Hivecontext and SQLContext are available as a part of this single object SparkSession. You are using the latest syntax by creating a view df

Join two DataFrames where the join key is different and only select some columns

阅读更多关于 Join two DataFrames where the join key is different and only select some columns

问题 What I would like to do is: Join two DataFrames A and B using their respective id columns a_id and b_id . I want to select all columns from A and two specific columns from B I tried something like what I put below with different quotation marks but still not working. I feel in pyspark, there should have a simple way to do this. A_B = A.join(B, A.id == B.id).select(A.*, B.b1, B.b2) I know you could write A_B = sqlContext.sql("SELECT A.*, B.b1, B.b2 FROM A JOIN B ON A.a_id = B.b_id") to do this

Use “IS IN” between 2 Spark dataframe columns

阅读更多关于 Use “IS IN” between 2 Spark dataframe columns

I have the above dataframe: from pyspark.sql.types import * rdd = sc.parallelize([ ('ALT', ['chien', 'chat'] , 'oiseau'), ('ALT', ['oiseau'] , 'oiseau'), ('TDR', ['poule','poulet'], 'poule' ), ('ALT', ['ours'] , 'chien' ), ('ALT', ['paon'] , 'tigre' ), ('TDR', ['tigre','lion'] , 'lion' ), ('ALT', ['chat'] ,'chien' ), ]) schema = StructType([StructField("ClientId",StringType(),True), StructField("Animaux",ArrayType(StringType(),True),True), StructField("Animal",StringType(),True),]) test = rdd.toDF(schema) test.show() +--------+---------------+------+ |ClientId| Animaux|Animal| +--------+------

PySpark java.io.IOException: No FileSystem for scheme: https

阅读更多关于 PySpark java.io.IOException: No FileSystem for scheme: https

I am using local windows and trying to load the XML file with the following code on python, and i am having this error, do anyone knows how to resolve it, this is the code df1 = sqlContext.read.format("xml").options(rowTag="IRS990EZ").load("https://irs-form-990.s3.amazonaws.com/201611339349202661_public.xml") and this is the error Py4JJavaError Traceback (most recent call last) <ipython-input-7-4832eb48a4aa> in <module>() ----> 1 df1 = sqlContext.read.format("xml").options(rowTag="IRS990EZ").load("https://irs-form-990.s3.amazonaws.com/201611339349202661_public.xml") C:\SPARK_HOME\spark-2.2.0

PySpark java.io.IOException: No FileSystem for scheme: https

阅读更多关于 PySpark java.io.IOException: No FileSystem for scheme: https

问题 I am using local windows and trying to load the XML file with the following code on python, and i am having this error, do anyone knows how to resolve it, this is the code df1 = sqlContext.read.format("xml").options(rowTag="IRS990EZ").load("https://irs-form-990.s3.amazonaws.com/201611339349202661_public.xml") and this is the error Py4JJavaError Traceback (most recent call last) <ipython-input-7-4832eb48a4aa> in <module>() ----> 1 df1 = sqlContext.read.format("xml").options(rowTag="IRS990EZ")

Working with jdbc jar in pyspark

阅读更多关于 Working with jdbc jar in pyspark

I need to read from a postgres sql database in pyspark. I know this has been asked before such as here , here and many other places, however, the solutions there either use a jar in the local running directory or copy it to all workers manually. I downloaded the postgresql-9.4.1208 jar and placed it in /tmp/jars. I then proceeded to call pyspark with the --jars and --driver-class-path switches: pyspark --master yarn --jars /tmp/jars/postgresql-9.4.1208.jar --driver-class-path /tmp/jars/postgresql-9.4.1208.jar Inside pyspark I did: df = sqlContext.read.format("jdbc").options(url="jdbc

Spark Dataframe column with last character of other column

阅读更多关于 Spark Dataframe column with last character of other column

I'm looking for a way to get the last character from a string in a dataframe column and place it into another column. I have a Spark dataframe that looks like this: animal ====== cat mouse snake I want something like this: lastchar ======== t e e Right now I can do this with a UDF that looks like: def get_last_letter(animal): return animal[-1] get_last_letter_udf = udf(get_last_letter, StringType()) df.select(get_last_letter_udf("animal").alias("lastchar")).show() I'm mainly curious if there's a better way to do this without a UDF. Thanks! Just use the substring function from pyspark.sql