pyspark-sql

Trying to connect to Oracle from Spark

我们两清 提交于 2019-12-04 14:33:16
I am trying to connect to Oracle to Spark and want pull data from some table and SQL queries. But I am not able to connect to Oracle. I have tried different work around options, but no look. I have followed the below steps. Please correct me if I need to make any changes. I am using Windows 7 machine. I using Jupyter notebook to use Pyspark. I have python 2.7 and Spark 2.1.0. I have set a spark Class path in environment variables: SPARK_CLASS_PATH = C:\Oracle\Product\11.2.0\client_1\jdbc\lib\ojdbc6.jar jdbcDF = sqlContext.read.format("jdbc").option("driver", "oracle.jdbc.driver.OracleDriver")

JSON file parsing in Pyspark

不羁岁月 提交于 2019-12-04 13:38:41
问题 I am very new to Pyspark. I tried parsing the JSON file using the following code from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.read.json("file:///home/malwarehunter/Downloads/122116-path.json") df.printSchema() The output is as follows. root |-- _corrupt_record: string (nullable = true) df.show() The output looks like this +--------------------+ | _corrupt_record| +--------------------+ | {| | "time1":"2...| | "time2":"201...| | "step":0.5,| | "xyz":[| | {| |

Extract date from a string column containing timestamp in Pyspark

你。 提交于 2019-12-04 12:48:55
问题 I have a dataframe which has a date in the following format: +----------------------+ |date | +----------------------+ |May 6, 2016 5:59:34 AM| +----------------------+ I intend to extract the date from this in the format YYYY-MM-DD ; so the result should be for the above date - 2016-05-06. But when I extract is using the following: df.withColumn('part_date', from_unixtime(unix_timestamp(df.date, "MMM dd, YYYY hh:mm:ss aa"), "yyyy-MM-dd")) I get the following date 2015-12-27 Can anyone please

Create a dataframe from a list in pyspark.sql

北慕城南 提交于 2019-12-04 10:26:18
I am totally lost in a wired situation. Now I have a list li li = example_data.map(lambda x: get_labeled_prediction(w,x)).collect() print li, type(li) the output is like, [(0.0, 59.0), (0.0, 51.0), (0.0, 81.0), (0.0, 8.0), (0.0, 86.0), (0.0, 86.0), (0.0, 60.0), (0.0, 54.0), (0.0, 54.0), (0.0, 84.0)] <type 'list'> When I try to create a dataframe from this list m = sqlContext.createDataFrame(l, ["prediction", "label"]) It threw the error message TypeError Traceback (most recent call last) <ipython-input-90-4a49f7f67700> in <module>() 56 l = example_data.map(lambda x: get_labeled_prediction(w,x)

How to add sparse vectors after group by, using Spark SQL?

流过昼夜 提交于 2019-12-04 08:09:56
I am doing a News recommendation system and I need to build a table for users and news they read. my raw data just like this : 001436800277225 ["9161492","9161787","9378531"] 009092130698762 ["9394697"] 010003000431538 ["9394697","9426473","9428530"] 010156461231357 ["9350394","9414181"] 010216216021063 ["9173862","9247870"] 010720006581483 ["9018786"] 011199797794333 ["9017977","9091134","9142852","9325464","9331913"] 011337201765123 ["9161294","9198693"] 011414545455156 ["9168185","9178348","9182782","9359776"] 011425002581540 ["9083446","9161294","9309432"] and I use spark-SQL do explode

Pyspark: Filter dataframe based on multiple conditions

狂风中的少年 提交于 2019-12-03 16:26:54
问题 I want to filter dataframe according to the following conditions firstly (d<5) and secondly (value of col2 not equal its counterpart in col4 if value in col1 equal its counterpart in col3). If the original dataframe DF is as follows: +----+----+----+----+---+ |col1|col2|col3|col4| d| +----+----+----+----+---+ | A| xx| D| vv| 4| | C| xxx| D| vv| 10| | A| x| A| xx| 3| | E| xxx| B| vv| 3| | E| xxx| F| vvv| 6| | F|xxxx| F| vvv| 4| | G| xxx| G| xxx| 4| | G| xxx| G| xx| 4| | G| xxx| G| xxx| 12| | B

Question about joining dataframes in Spark

社会主义新天地 提交于 2019-12-03 16:17:51
Suppose I have two partitioned dataframes: df1 = spark.createDataFrame( [(x,x,x) for x in range(5)], ['key1', 'key2', 'time'] ).repartition(3, 'key1', 'key2') df2 = spark.createDataFrame( [(x,x,x) for x in range(7)], ['key1', 'key2', 'time'] ).repartition(3, 'key1', 'key2') (scenario 1) If I join them by [key1, key2] join operation is performed within each partition without shuffle (number of partitions in result dataframe is the same): x = df1.join(df2, on=['key1', 'key2'], how='left') assert x.rdd.getNumPartitions() == 3 (scenario 2) But If I joint them by [key1, key2, time] shuffle

PySpark - Add a new column with a Rank by User

£可爱£侵袭症+ 提交于 2019-12-03 15:12:21
I have this PySpark DataFrame df = pd.DataFrame(np.array([ ["aa@gmail.com",2,3], ["aa@gmail.com",5,5], ["bb@gmail.com",8,2], ["cc@gmail.com",9,3] ]), columns=['user','movie','rating']) sparkdf = sqlContext.createDataFrame(df, samplingRatio=0.1) user movie rating aa@gmail.com 2 3 aa@gmail.com 5 5 bb@gmail.com 8 2 cc@gmail.com 9 3 I need to add a new column with a Rank by User I want have this output user movie rating Rank aa@gmail.com 2 3 1 aa@gmail.com 5 5 1 bb@gmail.com 8 2 2 cc@gmail.com 9 3 3 How can I do that? There is really no elegant solution here as for now. If you have to you can try

Evaluating Spark DataFrame in loop slows down with every iteration, all work done by controller

巧了我就是萌 提交于 2019-12-03 13:01:55
I am trying to use a Spark cluster (running on AWS EMR) to link groups of items that have common elements in them. Essentially, I have groups with some elements and if some of the elements are in multiple groups, I want to make one group that contains elements from all of those groups. I know about GraphX library and I tried to use graphframes package ( ConnectedComponents algorithm) to resolve this task, but it seams that the graphframes package is not yet mature enough and is very wasteful with resources... Running it on my data set (cca 60GB) it just runs out of memory no matter how much I

Trim string column in PySpark dataframe

风格不统一 提交于 2019-12-03 09:58:45
I'm beginner on Python and Spark. After creating a DataFrame from CSV file, I would like to know how I can trim a column. I've try: df = df.withColumn("Product", df.Product.strip()) df is my data frame, Product is a column in my table But I see always the error: Column object is not callable Do you have any suggestions? Starting from version 1.5 , Spark SQL provides two specific functions for trimming white space, ltrim and rtrim (search for "trim" in the DataFrame documentation ); you'll need to import pyspark.sql.functions first. Here is an example: from pyspark.sql import SQLContext from