pyspark-sql | 易学教程

Trying to connect to Oracle from Spark

阅读更多关于 Trying to connect to Oracle from Spark

I am trying to connect to Oracle to Spark and want pull data from some table and SQL queries. But I am not able to connect to Oracle. I have tried different work around options, but no look. I have followed the below steps. Please correct me if I need to make any changes. I am using Windows 7 machine. I using Jupyter notebook to use Pyspark. I have python 2.7 and Spark 2.1.0. I have set a spark Class path in environment variables: SPARK_CLASS_PATH = C:\Oracle\Product\11.2.0\client_1\jdbc\lib\ojdbc6.jar jdbcDF = sqlContext.read.format("jdbc").option("driver", "oracle.jdbc.driver.OracleDriver")

JSON file parsing in Pyspark

阅读更多关于 JSON file parsing in Pyspark

问题 I am very new to Pyspark. I tried parsing the JSON file using the following code from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.read.json("file:///home/malwarehunter/Downloads/122116-path.json") df.printSchema() The output is as follows. root |-- _corrupt_record: string (nullable = true) df.show() The output looks like this +--------------------+ | _corrupt_record| +--------------------+ | {| | "time1":"2...| | "time2":"201...| | "step":0.5,| | "xyz":[| | {| |

Extract date from a string column containing timestamp in Pyspark

阅读更多关于 Extract date from a string column containing timestamp in Pyspark

问题 I have a dataframe which has a date in the following format: +----------------------+ |date | +----------------------+ |May 6, 2016 5:59:34 AM| +----------------------+ I intend to extract the date from this in the format YYYY-MM-DD ; so the result should be for the above date - 2016-05-06. But when I extract is using the following: df.withColumn('part_date', from_unixtime(unix_timestamp(df.date, "MMM dd, YYYY hh:mm:ss aa"), "yyyy-MM-dd")) I get the following date 2015-12-27 Can anyone please

Create a dataframe from a list in pyspark.sql

阅读更多关于 Create a dataframe from a list in pyspark.sql

I am totally lost in a wired situation. Now I have a list li li = example_data.map(lambda x: get_labeled_prediction(w,x)).collect() print li, type(li) the output is like, [(0.0, 59.0), (0.0, 51.0), (0.0, 81.0), (0.0, 8.0), (0.0, 86.0), (0.0, 86.0), (0.0, 60.0), (0.0, 54.0), (0.0, 54.0), (0.0, 84.0)] <type 'list'> When I try to create a dataframe from this list m = sqlContext.createDataFrame(l, ["prediction", "label"]) It threw the error message TypeError Traceback (most recent call last) <ipython-input-90-4a49f7f67700> in <module>() 56 l = example_data.map(lambda x: get_labeled_prediction(w,x)

How to add sparse vectors after group by, using Spark SQL?

阅读更多关于 How to add sparse vectors after group by, using Spark SQL?

I am doing a News recommendation system and I need to build a table for users and news they read. my raw data just like this : 001436800277225 ["9161492","9161787","9378531"] 009092130698762 ["9394697"] 010003000431538 ["9394697","9426473","9428530"] 010156461231357 ["9350394","9414181"] 010216216021063 ["9173862","9247870"] 010720006581483 ["9018786"] 011199797794333 ["9017977","9091134","9142852","9325464","9331913"] 011337201765123 ["9161294","9198693"] 011414545455156 ["9168185","9178348","9182782","9359776"] 011425002581540 ["9083446","9161294","9309432"] and I use spark-SQL do explode

Pyspark: Filter dataframe based on multiple conditions

阅读更多关于 Pyspark: Filter dataframe based on multiple conditions

问题 I want to filter dataframe according to the following conditions firstly (d<5) and secondly (value of col2 not equal its counterpart in col4 if value in col1 equal its counterpart in col3). If the original dataframe DF is as follows: +----+----+----+----+---+ |col1|col2|col3|col4| d| +----+----+----+----+---+ | A| xx| D| vv| 4| | C| xxx| D| vv| 10| | A| x| A| xx| 3| | E| xxx| B| vv| 3| | E| xxx| F| vvv| 6| | F|xxxx| F| vvv| 4| | G| xxx| G| xxx| 4| | G| xxx| G| xx| 4| | G| xxx| G| xxx| 12| | B

Question about joining dataframes in Spark

阅读更多关于 Question about joining dataframes in Spark

Suppose I have two partitioned dataframes: df1 = spark.createDataFrame( [(x,x,x) for x in range(5)], ['key1', 'key2', 'time'] ).repartition(3, 'key1', 'key2') df2 = spark.createDataFrame( [(x,x,x) for x in range(7)], ['key1', 'key2', 'time'] ).repartition(3, 'key1', 'key2') (scenario 1) If I join them by [key1, key2] join operation is performed within each partition without shuffle (number of partitions in result dataframe is the same): x = df1.join(df2, on=['key1', 'key2'], how='left') assert x.rdd.getNumPartitions() == 3 (scenario 2) But If I joint them by [key1, key2, time] shuffle

PySpark - Add a new column with a Rank by User

阅读更多关于 PySpark - Add a new column with a Rank by User

I have this PySpark DataFrame df = pd.DataFrame(np.array([ ["aa@gmail.com",2,3], ["aa@gmail.com",5,5], ["bb@gmail.com",8,2], ["cc@gmail.com",9,3] ]), columns=['user','movie','rating']) sparkdf = sqlContext.createDataFrame(df, samplingRatio=0.1) user movie rating aa@gmail.com 2 3 aa@gmail.com 5 5 bb@gmail.com 8 2 cc@gmail.com 9 3 I need to add a new column with a Rank by User I want have this output user movie rating Rank aa@gmail.com 2 3 1 aa@gmail.com 5 5 1 bb@gmail.com 8 2 2 cc@gmail.com 9 3 3 How can I do that? There is really no elegant solution here as for now. If you have to you can try

Evaluating Spark DataFrame in loop slows down with every iteration, all work done by controller

阅读更多关于 Evaluating Spark DataFrame in loop slows down with every iteration, all work done by controller

I am trying to use a Spark cluster (running on AWS EMR) to link groups of items that have common elements in them. Essentially, I have groups with some elements and if some of the elements are in multiple groups, I want to make one group that contains elements from all of those groups. I know about GraphX library and I tried to use graphframes package ( ConnectedComponents algorithm) to resolve this task, but it seams that the graphframes package is not yet mature enough and is very wasteful with resources... Running it on my data set (cca 60GB) it just runs out of memory no matter how much I

Trim string column in PySpark dataframe

阅读更多关于 Trim string column in PySpark dataframe

I'm beginner on Python and Spark. After creating a DataFrame from CSV file, I would like to know how I can trim a column. I've try: df = df.withColumn("Product", df.Product.strip()) df is my data frame, Product is a column in my table But I see always the error: Column object is not callable Do you have any suggestions? Starting from version 1.5 , Spark SQL provides two specific functions for trimming white space, ltrim and rtrim (search for "trim" in the DataFrame documentation ); you'll need to import pyspark.sql.functions first. Here is an example: from pyspark.sql import SQLContext from