pyspark-sql | 易学教程

What's the difference between --archives, --files, py-files in pyspark job arguments

阅读更多关于 What's the difference between --archives, --files, py-files in pyspark job arguments

问题 --archives , --files , --py-files and sc.addFile and sc.addPyFile are quite confusing, can someone explain these clearly? 回答1: These options are truly scattered all over the place. In general, add your data files via --files or --archives and code files via --py-files . The latter will be added to the classpath (c.f., here) so you could import and use. As you can imagine, the CLI arguments is actually dealt with by addFile and addPyFiles functions (c.f., here) From http://spark.apache.org

How to use a subquery for dbtable option in jdbc data source?

阅读更多关于 How to use a subquery for dbtable option in jdbc data source?

问题 I want to use Spark to process some data from a JDBC source. But to begin with, instead of reading original tables from JDBC, I want to run some queries on the JDBC side to filter columns and join tables, and load the query result as a table in Spark SQL. The following syntax to load raw JDBC table works for me: df_table1 = sqlContext.read.format('jdbc').options( url="jdbc:mysql://foo.com:3306", dbtable="mydb.table1", user="me", password="******", driver="com.mysql.jdbc.Driver" # mysql JDBC

GroupByKey and create lists of values pyspark sql dataframe

阅读更多关于 GroupByKey and create lists of values pyspark sql dataframe

So I have a spark dataframe that looks like: a | b | c 5 | 2 | 1 5 | 4 | 3 2 | 4 | 2 2 | 3 | 7 And I want to group by column a , create a list of values from column b, and forget about c. The output dataframe would be : a | b_list 5 | (2,4) 2 | (4,3) How would I go about doing this with a pyspark sql dataframe? Thank you! :) Here are the steps to get that Dataframe. >>> from pyspark.sql import functions as F >>> >>> d = [{'a': 5, 'b': 2, 'c':1}, {'a': 5, 'b': 4, 'c':3}, {'a': 2, 'b': 4, 'c':2}, {'a': 2, 'b': 3,'c':7}] >>> df = spark.createDataFrame(d) >>> df.show() +---+---+---+ | a| b| c| +--

Cannot find col function in pyspark

阅读更多关于 Cannot find col function in pyspark

In pyspark 1.6.2, I can import col function by from pyspark.sql.functions import col but when I try to look it up in the Github source code I find no col function in functions.py file, how can python import a function that doesn't exist? It exists. It just isn't explicitly defined. Functions exported from pyspark.sql.functions are thin wrappers around JVM code and, with a few exceptions which require special treatment, are generated automatically using helper methods. If you carefully check the source you'll find col listed among other _functions . This dictionary is further iterated and

Spark SQL security considerations

阅读更多关于 Spark SQL security considerations

问题 What are the security considerations when accepting and executing arbitrary spark SQL queries? Imagine the following setup: Two files on hdfs are registered as tables a_secrets and b_secrets : # must only be accessed by clients with access to all of customer a' data spark.read.csv("/customer_a/secrets.csv").createTempView("a_secrets") # must only be accessed by clients with access to all of customer b's data spark.read.csv("/customer_b/secrets.csv").createTempView("b_secrets") These two views

How to skip lines while reading a CSV file as a dataFrame using PySpark?

阅读更多关于 How to skip lines while reading a CSV file as a dataFrame using PySpark?

问题 I have a CSV file that is structured this way: Header Blank Row "Col1","Col2" "1,200","1,456" "2,000","3,450" I have two problems in reading this file. I want to Ignore the Header and Ignore the blank row The commas within the value is not a separator Here is what I tried: df = sc.textFile("myFile.csv")\ .map(lambda line: line.split(","))\ #Split By comma .filter(lambda line: len(line) == 2).collect() #This helped me ignore the first two rows However, This did not work, because the commas

Caching ordered Spark DataFrame creates unwanted job

阅读更多关于 Caching ordered Spark DataFrame creates unwanted job

问题 I want to convert a RDD to a DataFrame and want to cache the results of the RDD: from pyspark.sql import * from pyspark.sql.types import * import pyspark.sql.functions as fn schema = StructType([StructField('t', DoubleType()), StructField('value', DoubleType())]) df = spark.createDataFrame( sc.parallelize([Row(t=float(i/10), value=float(i*i)) for i in range(1000)], 4), #.cache(), schema=schema, verifySchema=False ).orderBy("t") #.cache() If you don't use a cache function no job is generated.

Spark Structured Streaming using sockets, set SCHEMA, Display DATAFRAME in console

阅读更多关于 Spark Structured Streaming using sockets, set SCHEMA, Display DATAFRAME in console

问题 How can I set a schema for a streaming DataFrame in PySpark. from pyspark.sql import SparkSession from pyspark.sql.functions import explode from pyspark.sql.functions import split # Import data types from pyspark.sql.types import * spark = SparkSession\ .builder\ .appName("StructuredNetworkWordCount")\ .getOrCreate() # Create DataFrame representing the stream of input lines from connection to localhost:5560 lines = spark\ .readStream\ .format('socket')\ .option('host', '192.168.0.113')\

PySpark error: AttributeError: 'NoneType' object has no attribute '_jvm'

阅读更多关于 PySpark error: AttributeError: 'NoneType' object has no attribute '_jvm'

问题 I have timestamp dataset which is in format of And I have written a udf in pyspark to process this dataset and return as Map of key values. But am getting below error message. Dataset:df_ts_list +--------------------+ | ts_list| +--------------------+ |[1477411200, 1477...| |[1477238400, 1477...| |[1477022400, 1477...| |[1477224000, 1477...| |[1477256400, 1477...| |[1477346400, 1476...| |[1476986400, 1477...| |[1477321200, 1477...| |[1477306800, 1477...| |[1477062000, 1477...| |[1477249200,

Apply a transformation to multiple columns pyspark dataframe

阅读更多关于 Apply a transformation to multiple columns pyspark dataframe