pyspark | 易学教程

pySpark (v2.4) DataFrameReader adds leading whitespace to column names

阅读更多关于 pySpark (v2.4) DataFrameReader adds leading whitespace to column names

问题 Here is a snippet of a CSV file that I have: "Index", "Living Space (sq ft)", "Beds", "Baths", "Zip", "Year", "List Price ($)" 1, 2222, 3, 3.5, 32312, 1981, 250000 2, 1628, 3, 2, 32308, 2009, 185000 3, 3824, 5, 4, 32312, 1954, 399000 4, 1137, 3, 2, 32309, 1993, 150000 5, 3560, 6, 4, 32309, 1973, 315000 Oddly, when I perform the following pySpark (v2.4) statements, the header column names (minus the first column) have leading whitespaces. I've tried different quote and escape options , but to

“'DataFrame' object has no attribute 'apply'” when trying to apply lambda to create new column

阅读更多关于 “'DataFrame' object has no attribute 'apply'” when trying to apply lambda to create new column

问题 I aim at adding a new column in a Pandas DataFrame, but I am facing an weird error. The new column is expected to be a transformation from an existing column, that can be done doing a lookup in a dictionary/hashmap. # Loading data df = sqlContext.read.format(...).load(train_df_path) # Instanciating the map some_map = { 'a': 0, 'b': 1, 'c': 1, } # Creating a new column using the map df['new_column'] = df.apply(lambda row: some_map(row.some_column_name), axis=1) Which leads to the following

Unable to see messages from Kafka Stream in Spark

阅读更多关于 Unable to see messages from Kafka Stream in Spark

问题 I just started the testing of Kafka Stream to Spark using Pyspark library. I have been running the whole setup on Jupyter Notebook . I am trying to get data from the Twitter Streaming . Twitter Streaming Code: import json import tweepy from uuid import uuid4 import time from kafka import KafkaConsumer from kafka import KafkaProducer auth = tweepy.OAuthHandler("key", "key") auth.set_access_token("token", "token") api = tweepy.API(auth, wait_on_rate_limit=True, retry_count=3, retry_delay=5,

Is there a way to join databases of two different data sources(ie. Mysql and Postgres SQL) using logstash and indexing it to elastic search?

阅读更多关于 Is there a way to join databases of two different data sources(ie. Mysql and Postgres SQL) using logstash and indexing it to elastic search?

问题 I am very new to ELK and want to know if there is a way around to join two databases from different sources (ie. MYSQL and Postgres) and indexing it to a single index in elasticsearch using logstash. As I am able to achieve the same with the help of pyspark. But I want to achieve the same thing using log stash if it's possible! Also, suggest some other feasible ways to achieve the same apart from the spark and logstash. Thanks in Advance! 回答1: You can definitely achieve this by sourcing data

Accessing to elements of an array in Row object format and concatenate them- pySpark

阅读更多关于 Accessing to elements of an array in Row object format and concatenate them- pySpark

问题 I have a pyspark.sql.dataframe.DataFrame , where one of the columns has an array of Row objects: +------------------------------------------------------------------------------------------------+ |column | +------------------------------------------------------------------------------------------------+ |[Row(arrival='2019-12-25 19:55', departure='2019-12-25 18:22'), | | Row(arrival='2019-12-26 14:56', departure='2019-12-26 08:52')] | +---------------------------------------------------------

Dynamically setting schema for spark.createDataFrame

阅读更多关于 Dynamically setting schema for spark.createDataFrame

问题 So I am trying to dynamically set the type of data in the schema. I have seen the code schema = StructType([StructField(header[i], StringType(), True) for i in range(len(header))]) on stackoverflow But how can I add change this into a conditional statement? If header is in list1 then IntergerType, if in list2 then DoubleType, else StringType for example? 回答1: A colleague answered this for me schema = StructType([ StructField(header[i], DateType(), True) if header[i] in dateFields else

Window Functions partitionBy over a list

阅读更多关于 Window Functions partitionBy over a list

问题 I have a dataframe tableDS In scala I am able to remove duplicates over primary keys using the following - import org.apache.spark.sql.expressions.Window.partitionBy import org.apache.spark.sql.functions.row_number val window = partitionBy(primaryKeySeq.map(k => tableDS(k)): _*).orderBy(tableDS(mergeCol).desc) tableDS.withColumn("rn", row_number.over(window)).where($"rn" === 1).drop("rn") I need to write a similar thing in python. primaryKeySeq is a list in python. I tried the first statement

How to use to_json and from_json to eliminate nested structfields in pyspark dataframe?

阅读更多关于 How to use to_json and from_json to eliminate nested structfields in pyspark dataframe?

问题 This solution in theory, works perfectly for what I need, which is to create a new copied version of a dataframe while excluding certain nested structfields. here is a minimally reproducible artifact of my issue: >>> df.printSchema() root | -- big: array(nullable=true) | | -- element: struct(containsNull=true) | | | -- keep: string(nullable=true) | | | -- delete: string(nullable=true) which you can instantiate like such: schema = StructType([StructField("big", ArrayType(StructType([

Connecting Spark Streaming to Tableau

阅读更多关于 Connecting Spark Streaming to Tableau

问题 I am streaming tweets from a Twitter app to Spark for analysis. I want to output the resulting Spark SQL table to Tableau for real-time analysis locally. I have already tried connecting to Databricks to run the program but I haven't been able to connect the Twitter app to Databricks notebook. My code for writing the steam looks like this: activityQuery = output.writeStream.trigger(processingTime='1 seconds').queryName("Places")\ .format("memory")\ .start() 来源： https://stackoverflow.com

PySpark: How to covert column with Ljava.lang.Object

阅读更多关于 PySpark: How to covert column with Ljava.lang.Object