pyspark-sql

Use “IS IN” between 2 Spark dataframe columns

≯℡__Kan透↙ 提交于 2019-12-02 04:55:59
问题 I have the above dataframe: from pyspark.sql.types import * rdd = sc.parallelize([ ('ALT', ['chien', 'chat'] , 'oiseau'), ('ALT', ['oiseau'] , 'oiseau'), ('TDR', ['poule','poulet'], 'poule' ), ('ALT', ['ours'] , 'chien' ), ('ALT', ['paon'] , 'tigre' ), ('TDR', ['tigre','lion'] , 'lion' ), ('ALT', ['chat'] ,'chien' ), ]) schema = StructType([StructField("ClientId",StringType(),True), StructField("Animaux",ArrayType(StringType(),True),True), StructField("Animal",StringType(),True),]) test = rdd

spark.sql vs SqlContext

笑着哭i 提交于 2019-12-02 02:14:08
问题 I have used SQL in Spark, in this example: results = spark.sql("select * from ventas") where ventas is a dataframe, previosuly cataloged like a table: df.createOrReplaceTempView('ventas') but I have seen other ways of working with SQL in Spark, using the class SqlContext: df = sqlContext.sql("SELECT * FROM table") What is the difference between both of them? Thanks in advance 回答1: Sparksession is the preferred way of working with Spark object now. Both Hivecontext and SQLContext are available

Read range of files in pySpark

蹲街弑〆低调 提交于 2019-12-02 01:38:54
I need to read contiguous files in pySpark. The following works for me. from pyspark.sql import SQLContext file = "events.parquet/exportDay=2015090[1-7]" df = sqlContext.read.load(file) How do I read files 8-14? kathleen Use curly braces. file = "events.parquet/exportDay=201509{08,09,10,11,12,13,14}" Here's a similar question on stack overflow: Pyspark select subset of files using regex glob . They suggest either using curly braces, OR performing multiple reads and then unioning the objects (whether they are RDDs or data frames or whatever, there should be some way). Barry Loper It uses shell

spark.sql vs SqlContext

血红的双手。 提交于 2019-12-02 00:40:45
I have used SQL in Spark, in this example: results = spark.sql("select * from ventas") where ventas is a dataframe, previosuly cataloged like a table: df.createOrReplaceTempView('ventas') but I have seen other ways of working with SQL in Spark, using the class SqlContext: df = sqlContext.sql("SELECT * FROM table") What is the difference between both of them? Thanks in advance Sparksession is the preferred way of working with Spark object now. Both Hivecontext and SQLContext are available as a part of this single object SparkSession. You are using the latest syntax by creating a view df

Join two DataFrames where the join key is different and only select some columns

筅森魡賤 提交于 2019-12-02 00:10:41
问题 What I would like to do is: Join two DataFrames A and B using their respective id columns a_id and b_id . I want to select all columns from A and two specific columns from B I tried something like what I put below with different quotation marks but still not working. I feel in pyspark, there should have a simple way to do this. A_B = A.join(B, A.id == B.id).select(A.*, B.b1, B.b2) I know you could write A_B = sqlContext.sql("SELECT A.*, B.b1, B.b2 FROM A JOIN B ON A.a_id = B.b_id") to do this

Use “IS IN” between 2 Spark dataframe columns

こ雲淡風輕ζ 提交于 2019-12-01 23:36:45
I have the above dataframe: from pyspark.sql.types import * rdd = sc.parallelize([ ('ALT', ['chien', 'chat'] , 'oiseau'), ('ALT', ['oiseau'] , 'oiseau'), ('TDR', ['poule','poulet'], 'poule' ), ('ALT', ['ours'] , 'chien' ), ('ALT', ['paon'] , 'tigre' ), ('TDR', ['tigre','lion'] , 'lion' ), ('ALT', ['chat'] ,'chien' ), ]) schema = StructType([StructField("ClientId",StringType(),True), StructField("Animaux",ArrayType(StringType(),True),True), StructField("Animal",StringType(),True),]) test = rdd.toDF(schema) test.show() +--------+---------------+------+ |ClientId| Animaux|Animal| +--------+------

PySpark java.io.IOException: No FileSystem for scheme: https

☆樱花仙子☆ 提交于 2019-12-01 22:39:27
I am using local windows and trying to load the XML file with the following code on python, and i am having this error, do anyone knows how to resolve it, this is the code df1 = sqlContext.read.format("xml").options(rowTag="IRS990EZ").load("https://irs-form-990.s3.amazonaws.com/201611339349202661_public.xml") and this is the error Py4JJavaError Traceback (most recent call last) <ipython-input-7-4832eb48a4aa> in <module>() ----> 1 df1 = sqlContext.read.format("xml").options(rowTag="IRS990EZ").load("https://irs-form-990.s3.amazonaws.com/201611339349202661_public.xml") C:\SPARK_HOME\spark-2.2.0

PySpark java.io.IOException: No FileSystem for scheme: https

守給你的承諾、 提交于 2019-12-01 22:22:15
问题 I am using local windows and trying to load the XML file with the following code on python, and i am having this error, do anyone knows how to resolve it, this is the code df1 = sqlContext.read.format("xml").options(rowTag="IRS990EZ").load("https://irs-form-990.s3.amazonaws.com/201611339349202661_public.xml") and this is the error Py4JJavaError Traceback (most recent call last) <ipython-input-7-4832eb48a4aa> in <module>() ----> 1 df1 = sqlContext.read.format("xml").options(rowTag="IRS990EZ")

Working with jdbc jar in pyspark

拥有回忆 提交于 2019-12-01 20:54:45
I need to read from a postgres sql database in pyspark. I know this has been asked before such as here , here and many other places, however, the solutions there either use a jar in the local running directory or copy it to all workers manually. I downloaded the postgresql-9.4.1208 jar and placed it in /tmp/jars. I then proceeded to call pyspark with the --jars and --driver-class-path switches: pyspark --master yarn --jars /tmp/jars/postgresql-9.4.1208.jar --driver-class-path /tmp/jars/postgresql-9.4.1208.jar Inside pyspark I did: df = sqlContext.read.format("jdbc").options(url="jdbc

Spark Dataframe column with last character of other column

∥☆過路亽.° 提交于 2019-12-01 18:06:40
I'm looking for a way to get the last character from a string in a dataframe column and place it into another column. I have a Spark dataframe that looks like this: animal ====== cat mouse snake I want something like this: lastchar ======== t e e Right now I can do this with a UDF that looks like: def get_last_letter(animal): return animal[-1] get_last_letter_udf = udf(get_last_letter, StringType()) df.select(get_last_letter_udf("animal").alias("lastchar")).show() I'm mainly curious if there's a better way to do this without a UDF. Thanks! Just use the substring function from pyspark.sql