pyspark | 易学教程

How to use external database (postgresql) as input in streaming query?

阅读更多关于 How to use external database (postgresql) as input in streaming query?

问题 I am trying to implement streaming input updates in Postgresql. Specifically , I would like to use Postgresql as datasource in stream input into spark. Looking at the document, I was not sure if this is possible or not. https://spark.apache.org/docs/latest/streaming-programming-guide.html Would it be possible to stream input from PostgresQL, perhaps as a micro batch ? 回答1: To stream your PSQL Data as a micro batch, Kafka is the best way. You can use Kafka connect (as a source) to established

How to write parquet file in partition in java similar to pyspark?

阅读更多关于 How to write parquet file in partition in java similar to pyspark?

问题 I can write parquet file into partition in pyspark like this: rdd.write .partitionBy("created_year", "created_month") .parquet("hdfs:///my_file") The parquet file is auto partition into created_year, created_month. How to do the same in java? I don't see an option in ParquetWriter class. Is there another class that can do that? Thanks, 回答1: You have to convert your RDD into DataFrame and then call write parquet function. df = sql_context.createDataFrame(rdd) df.write.parquet("hdfs:///my_file"

Take top N elements from each group in PySpark RDD (without using groupByKey)

阅读更多关于 Take top N elements from each group in PySpark RDD (without using groupByKey)

问题 I have an RDD like the following dataSource = sc.parallelize( [("user1", (3, "blue")), ("user1", (4, "black")), ("user2", (5, "white"), ("user2", (3, "black")), ("user2", (6, "red")), ("user1", (1, "red"))] ) I want to use reduceByKey to find Top 2 colors for each user so the output would be an RDD like: sc.parallelize([("user1", ["black", "blue"]), ("user2", ["red", "white"])]) so I need to reduce by key and then sort each key's values, i.e. (number, color) on number and return top n colors.

Retrieve data from Azure HDInsight with PySpark

阅读更多关于 Retrieve data from Azure HDInsight with PySpark

问题 I have the credentials and the URL for access to an Azure database. I want to read the data using pyspark but I don't know how to do it. Is there a specific syntax to connect to an Azure database? EDIT After I used the shared code I received this kind of error, any suggestion? I saw that in a sample that i have on the machine they use ODBC driver, maybe this is involved? 2018-07-14 11:22:00 WARN SQLServerConnection:2141 - ConnectionID:1 ClientConnectionId: 7561d3ba-71ac-43b3-a35f-26ababef90cc

LEFT and RIGHT function in PySpark SQL

阅读更多关于 LEFT and RIGHT function in PySpark SQL

问题 I am new for PySpark. I pulled a csv file using pandas. And created a temp table using registerTempTable function. from pyspark.sql import SQLContext from pyspark.sql import Row import pandas as pd sqlc = SQLContext(sc) aa1 = pd.read_csv("D:\mck1.csv") aa2 = sqlc.createDataFrame(aa1) aa2.show() +--------+-------+----------+------------+---------+------------+-------------------+ | City| id|First_Name|Phone_Number|new_date|new code| New_date| +--------+-------+----------+------------+---------

hadoop, when run under spark, merges its stderr into stdout

阅读更多关于 hadoop, when run under spark, merges its stderr into stdout

问题 When I type hadoop fs -text /foo/bar/baz.bz2 2>err 1>out I get two non-empty files: err with 2015-05-26 15:33:49,786 INFO [main] bzip2.Bzip2Factory (Bzip2Factory.java:isNativeBzip2Loaded(70)) - Successfully loaded & initialized native-bzip2 library system-native 2015-05-26 15:33:49,789 INFO [main] compress.CodecPool (CodecPool.java:getDecompressor(179)) - Got brand-new decompressor [.bz2] and out with the content of the file (as expected). When I call the same command from Python (2.6): from

pySpark Data Frames “assert isinstance(dataType, DataType), ”dataType should be DataType"

阅读更多关于 pySpark Data Frames “assert isinstance(dataType, DataType), ”dataType should be DataType"

问题 I want to generate my Data Frame schema dynamically I have the following Error: assert isinstance(dataType, DataType), "dataType should be DataType" AssertionError: dataType should be DataType code: filteredSchema = [] for line in correctSchema: fieldName = line.split(',') if fieldName[1] == "decimal": filteredSchema.append([fieldName[0], "DecimalType()"]) elif fieldName[1] == "string": filteredSchema.append([fieldName[0], "StringType()"]) elif fieldName[1] == "integer": filteredSchema.append

add a python external library in Pyspark

阅读更多关于 add a python external library in Pyspark

问题 I'm using pyspark (1.6) and i want to use databricks:spark-csv library. For this i've tried different ways with no success 1- i've tried to add a jar i downloaded from https://spark-packages.org/package/databricks/spark-csv, and run pyspark --jars THE_NAME_OF_THE_JAR df = sqlContext.read.format('com.databricks:spark-csv').options(header='true', inferschema='true').load('/dlk/doaat/nsi_dev/utilisateur/referentiel/refecart.csv') But got this error : Traceback (most recent call last): File "

cross combine two RDDs using pyspark

阅读更多关于 cross combine two RDDs using pyspark

问题 How can I cross combine (is this the correct way to describe?) the two RDDS? input: rdd1 = [a, b] rdd2 = [c, d] output: rdd3 = [(a, c), (a, d), (b, c), (b, d)] I tried rdd3 = rdd1.flatMap(lambda x: rdd2.map(lambda y: (x, y)) , it complains that It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. . I guess that means you can not nest action as in the list comprehension, and one statement can only do one action . 回答1: So as you have

Apache Spark selects all rows

阅读更多关于 Apache Spark selects all rows

问题 When I use JDBC connection to feed spark, even if I use filtering on dataframe; when I inspect query log on my oracle datasource, I am seeing spark executing: SELECT [column_names] FROM MY_TABLE Referring to https://stackoverflow.com/a/40870714/1941560, I was expecting spark lazily plan query and execute like; SELECT [column_names] FROM MY_TABLE WHERE [filter_predicate] But spark is not doing that. It takes all the data and filters afterwards. I need this behaviour because I don't want to