pyspark-sql | 易学教程

pyspark mysql jdbc load An error occurred while calling o23.load No suitable driver

阅读更多关于 pyspark mysql jdbc load An error occurred while calling o23.load No suitable driver

问题 I use docker image sequenceiq/spark on my Mac to study these spark examples, during the study process, I upgrade the spark inside that image to 1.6.1 according to this answer, and the error occurred when I start the Simple Data Operations example, here is what happened: when I run df = sqlContext.read.format("jdbc").option("url",url).option("dbtable","people").load() it raise a error, and the full stack with the pyspark console is as followed: Python 2.6.6 (r266:84292, Jul 23 2015, 15:22:56)

How to change dataframe column names in pyspark?

阅读更多关于 How to change dataframe column names in pyspark?

I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command: df.columns = new_column_name_list However, the same doesn't work in pyspark dataframes created using sqlContext. The only solution I could figure out to do this easily is the following: df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', inferschema='true', delimiter='\t').load("data.txt") oldSchema = df.schema for i,k in enumerate(oldSchema.fields): k.name = new_column_name_list[i] df =

Cannot find col function in pyspark

阅读更多关于 Cannot find col function in pyspark

问题 In pyspark 1.6.2, I can import col function by from pyspark.sql.functions import col but when I try to look it up in the Github source code I find no col function in functions.py file, how can python import a function that doesn't exist? 回答1: It exists. It just isn't explicitly defined. Functions exported from pyspark.sql.functions are thin wrappers around JVM code and, with a few exceptions which require special treatment, are generated automatically using helper methods. If you carefully

More than one hour to execute pyspark.sql.DataFrame.take(4)

阅读更多关于 More than one hour to execute pyspark.sql.DataFrame.take(4)

问题 I am running spark 1.6 on 3 VMs (i.e. 1x master; 2x slaves) all with 4 cores and 16GB RAM. I can see the workers registered on spark-master webUI. I want to retrieve data from my Vertica database to work on it. As I didn\'t manage to run complex queries I tried dummy queries to understand. We consider here an easy task. My code is: df = sqlContext.read.format(\'jdbc\').options(url=\'xxxx\', dbtable=\'xxx\', user=\'xxxx\', password=\'xxxx\').load() four = df.take(4) And the output is (note: I

How to change dataframe column names in pyspark?

阅读更多关于 How to change dataframe column names in pyspark?

问题 I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command: df.columns = new_column_name_list However, the same doesn\'t work in pyspark dataframes created using sqlContext. The only solution I could figure out to do this easily is the following: df = sqlContext.read.format(\"com.databricks.spark.csv\").options(header=\'false\', inferschema=\'true\', delimiter=\'\\t\').load(\

Split Spark Dataframe string column into multiple columns

阅读更多关于 Split Spark Dataframe string column into multiple columns

问题 I\'ve seen various people suggesting that Dataframe.explode is a useful way to do this, but it results in more rows than the original dataframe, which isn\'t what I want at all. I simply want to do the Dataframe equivalent of the very simple: rdd.map(lambda row: row + [row.my_str_col.split(\'-\')]) which takes something looking like: col1 | my_str_col -----+----------- 18 | 856-yygrm 201 | 777-psgdg and converts it to this: col1 | my_str_col | _col3 | _col4 -----+------------+-------+------

Convert pyspark string to date format

阅读更多关于 Convert pyspark string to date format

问题 I have a date pyspark dataframe with a string column in the format of MM-dd-yyyy and I am attempting to convert this into a date column. I tried: df.select(to_date(df.STRING_COLUMN).alias(\'new_date\')).show() and I get a string of nulls. Can anyone help? 回答1: It is possible (preferrable?) to do this without a udf: from pyspark.sql.functions import unix_timestamp, from_unixtime df = spark.createDataFrame( [("11/25/1991",), ("11/24/1991",), ("11/30/1991",)], ['date_str'] ) df2 = df.select(

Using a column value as a parameter to a spark DataFrame function

阅读更多关于 Using a column value as a parameter to a spark DataFrame function

问题 Consider the following DataFrame: #+------+---+ #|letter|rpt| #+------+---+ #| X| 3| #| Y| 1| #| Z| 2| #+------+---+ which can be created using the following code: df = spark.createDataFrame([(\"X\", 3),(\"Y\", 1),(\"Z\", 2)], [\"letter\", \"rpt\"]) Suppose I wanted to repeat each row the number of times specified in the column rpt , just like in this question. One way would be to replicate my solution to that question using the following pyspark-sql query: query = \"\"\" SELECT * FROM

How to make good reproducible Apache Spark examples

阅读更多关于 How to make good reproducible Apache Spark examples

问题 I\'ve been spending a fair amount of time reading through some questions with the pyspark and spark-dataframe tags and very often I find that posters don\'t provide enough information to truly understand their question. I usually comment asking them to post an MCVE but sometimes getting them to show some sample input/output data is like pulling teeth. For example: see the comments on this question. Perhaps part of the problem is that people just don\'t know how to easily create an MCVE for