pyspark | 易学教程

Spark 2.1 Structured Streaming - Using Kakfa as source with Python (pyspark)

阅读更多关于 Spark 2.1 Structured Streaming - Using Kakfa as source with Python (pyspark)

问题 With Apache Spark version 2.1, I would like to use Kafka (0.10.0.2.5) as source for Structured Streaming with pyspark: kafka_app.py: from pyspark.sql import SparkSession spark=SparkSession.builder.appName("TestKakfa").getOrCreate() kafka=spark.readStream.format("kafka") \ .option("kafka.bootstrap.servers","localhost:6667") \ .option("subscribe","mytopic").load() I launched the app in the following way: ./bin/spark-submit kafka_app.py --master local[4] --jars spark-streaming-kafka-0-10

pyspark: aggregate on the most frequent value in a column

阅读更多关于 pyspark: aggregate on the most frequent value in a column

问题 aggregrated_table = df_input.groupBy('city', 'income_bracket') \ .agg( count('suburb').alias('suburb'), sum('population').alias('population'), sum('gross_income').alias('gross_income'), sum('no_households').alias('no_households')) Would like to group by city and income bracket but within each city certain suburbs have different income brackets. How do I group by the most frequently occurring income bracket per city? for example: city1 suburb1 income_bracket_10 city1 suburb1 income_bracket_10

Spark 2.3 AsyncEventQueue Error and Warning

阅读更多关于 Spark 2.3 AsyncEventQueue Error and Warning

问题 I'm running a memory intensive code where I've created a pipeline which consists of : Finding the best number of bin value using Shimazaki and Shinomoto's Bin Width algorithm. Creating a new column by Bucketizing the same column with the respective bin values found from above. Calculating a Weight of Evidence by 8 sequencial SQL queries. Config: Python - 3.6 Spark - 2.3 Environment - Standalone machine (16 GB RAM and 500 GB HDD with i7 processor) IDE - Pycharm My doubt is, it is working as

Count occurrences of a list of substrings in a pyspark df column

阅读更多关于 Count occurrences of a list of substrings in a pyspark df column

问题 I want to count the occurrences of list of substrings and create a column based on a column in the pyspark df which contains a long string. Input: ID History 1 USA|UK|IND|DEN|MAL|SWE|AUS 2 USA|UK|PAK|NOR 3 NOR|NZE 4 IND|PAK|NOR lst=['USA','IND','DEN'] Output : ID History Count 1 USA|UK|IND|DEN|MAL|SWE|AUS 3 2 USA|UK|PAK|NOR 1 3 NOR|NZE 0 4 IND|PAK|NOR 1 回答1: # Importing requisite packages and creating a DataFrame from pyspark.sql.functions import split, col, size, regexp_replace values = [(1,

PySpark job fails when loading multiple files and one is missing [duplicate]

阅读更多关于 PySpark job fails when loading multiple files and one is missing [duplicate]

问题 This question already has an answer here : Pyspark Invalid Input Exception try except error (1 answer) Closed 10 months ago . When using PySpark to load multiple JSON files from S3 I get an error and the Spark job fails if a file is missing. Caused by: org.apache.hadoop.mapred.InvalidInputException: Input Pattern s3n://example/example/2017-02-18/*.json matches 0 files This is how I add the 5 last days to my job with PySpark. days = 5 x = 0 files = [] while x < days: filedate = (date.today() -

How to add suffix and prefix to all columns in python/pyspark dataframe

阅读更多关于 How to add suffix and prefix to all columns in python/pyspark dataframe

问题 I have a data frame in pyspark with more than 100 columns. What I want to do is for all the column names I would like to add back ticks(`) at the start of the column name and end of column name. For example: column name is testing user. I want `testing user` Is there a method to do this in pyspark/python. when we apply the code it should return a data frame. 回答1: You can use withColumnRenamed method of dataframe in combination with na to create new dataframe df.na.withColumnRenamed('testing

How to add suffix and prefix to all columns in python/pyspark dataframe

阅读更多关于 How to add suffix and prefix to all columns in python/pyspark dataframe

how to load a word2vec model and call its function into the mapper

阅读更多关于 how to load a word2vec model and call its function into the mapper

问题 I want to load a word2vec model and evaluate it by executing word analogy tasks (e.g. a is to b as c is to something? ). To do this, first I load my w2v model: model = Word2VecModel.load(spark.sparkContext, str(sys.argv[1])) and then I call the mapper to evaluate the model: rdd_lines = spark.read.text("questions-words.txt").rdd.map(getAnswers) The getAnswers function reads one line per time from questions-words.txt , in which each line contains the question and the answer to evaluate my model

KeyError: SPARK_HOME during SparkConf initialization

阅读更多关于 KeyError: SPARK_HOME during SparkConf initialization

问题 I am a spark newbie and I want to run a Python script from the command line. I have tested pyspark interactively and it works. I get this error when trying to create the sc: File "test.py", line 10, in <module> conf=(SparkConf().setMaster('local').setAppName('a').setSparkHome('/home/dirk/spark-1.4.1-bin-hadoop2.6/bin')) File "/home/dirk/spark-1.4.1-bin-hadoop2.6/python/pyspark/conf.py", line 104, in __init__ SparkContext._ensure_initialized() File "/home/dirk/spark-1.4.1-bin-hadoop2.6/python

Compare rows of two dataframes to find the matching column count of 1's

阅读更多关于 Compare rows of two dataframes to find the matching column count of 1's

问题 I have 2 dataframes with same schema, i need to compare the rows of dataframes and keep a count of rows with at-least one column with value 1 in both the dataframes Right now i am making a list of the rows and then comparing the 2 lists to find even if one value is equal in both the list and equal to 1 rowOgList = [] for row in cat_og_df.rdd.toLocalIterator(): rowOgDict = {} for cat in categories: rowOgDict[cat] = row[cat] rowOgList.append(rowOgDict) #print(rowOgList[0]) rowPredList = [] for