pyspark

pySpark (v2.4) DataFrameReader adds leading whitespace to column names

这一生的挚爱 提交于 2019-12-25 00:29:31
问题 Here is a snippet of a CSV file that I have: "Index", "Living Space (sq ft)", "Beds", "Baths", "Zip", "Year", "List Price ($)" 1, 2222, 3, 3.5, 32312, 1981, 250000 2, 1628, 3, 2, 32308, 2009, 185000 3, 3824, 5, 4, 32312, 1954, 399000 4, 1137, 3, 2, 32309, 1993, 150000 5, 3560, 6, 4, 32309, 1973, 315000 Oddly, when I perform the following pySpark (v2.4) statements, the header column names (minus the first column) have leading whitespaces. I've tried different quote and escape options , but to

“'DataFrame' object has no attribute 'apply'” when trying to apply lambda to create new column

爱⌒轻易说出口 提交于 2019-12-25 00:09:03
问题 I aim at adding a new column in a Pandas DataFrame, but I am facing an weird error. The new column is expected to be a transformation from an existing column, that can be done doing a lookup in a dictionary/hashmap. # Loading data df = sqlContext.read.format(...).load(train_df_path) # Instanciating the map some_map = { 'a': 0, 'b': 1, 'c': 1, } # Creating a new column using the map df['new_column'] = df.apply(lambda row: some_map(row.some_column_name), axis=1) Which leads to the following

Unable to see messages from Kafka Stream in Spark

落爺英雄遲暮 提交于 2019-12-25 00:08:24
问题 I just started the testing of Kafka Stream to Spark using Pyspark library. I have been running the whole setup on Jupyter Notebook . I am trying to get data from the Twitter Streaming . Twitter Streaming Code: import json import tweepy from uuid import uuid4 import time from kafka import KafkaConsumer from kafka import KafkaProducer auth = tweepy.OAuthHandler("key", "key") auth.set_access_token("token", "token") api = tweepy.API(auth, wait_on_rate_limit=True, retry_count=3, retry_delay=5,

Is there a way to join databases of two different data sources(ie. Mysql and Postgres SQL) using logstash and indexing it to elastic search?

拜拜、爱过 提交于 2019-12-24 23:24:24
问题 I am very new to ELK and want to know if there is a way around to join two databases from different sources (ie. MYSQL and Postgres) and indexing it to a single index in elasticsearch using logstash. As I am able to achieve the same with the help of pyspark. But I want to achieve the same thing using log stash if it's possible! Also, suggest some other feasible ways to achieve the same apart from the spark and logstash. Thanks in Advance! 回答1: You can definitely achieve this by sourcing data

Accessing to elements of an array in Row object format and concatenate them- pySpark

大憨熊 提交于 2019-12-24 23:02:29
问题 I have a pyspark.sql.dataframe.DataFrame , where one of the columns has an array of Row objects: +------------------------------------------------------------------------------------------------+ |column | +------------------------------------------------------------------------------------------------+ |[Row(arrival='2019-12-25 19:55', departure='2019-12-25 18:22'), | | Row(arrival='2019-12-26 14:56', departure='2019-12-26 08:52')] | +---------------------------------------------------------

Dynamically setting schema for spark.createDataFrame

核能气质少年 提交于 2019-12-24 22:34:33
问题 So I am trying to dynamically set the type of data in the schema. I have seen the code schema = StructType([StructField(header[i], StringType(), True) for i in range(len(header))]) on stackoverflow But how can I add change this into a conditional statement? If header is in list1 then IntergerType, if in list2 then DoubleType, else StringType for example? 回答1: A colleague answered this for me schema = StructType([ StructField(header[i], DateType(), True) if header[i] in dateFields else

Window Functions partitionBy over a list

╄→尐↘猪︶ㄣ 提交于 2019-12-24 22:25:03
问题 I have a dataframe tableDS In scala I am able to remove duplicates over primary keys using the following - import org.apache.spark.sql.expressions.Window.partitionBy import org.apache.spark.sql.functions.row_number val window = partitionBy(primaryKeySeq.map(k => tableDS(k)): _*).orderBy(tableDS(mergeCol).desc) tableDS.withColumn("rn", row_number.over(window)).where($"rn" === 1).drop("rn") I need to write a similar thing in python. primaryKeySeq is a list in python. I tried the first statement

How to use to_json and from_json to eliminate nested structfields in pyspark dataframe?

只谈情不闲聊 提交于 2019-12-24 21:53:20
问题 This solution in theory, works perfectly for what I need, which is to create a new copied version of a dataframe while excluding certain nested structfields. here is a minimally reproducible artifact of my issue: >>> df.printSchema() root | -- big: array(nullable=true) | | -- element: struct(containsNull=true) | | | -- keep: string(nullable=true) | | | -- delete: string(nullable=true) which you can instantiate like such: schema = StructType([StructField("big", ArrayType(StructType([

Connecting Spark Streaming to Tableau

☆樱花仙子☆ 提交于 2019-12-24 21:16:52
问题 I am streaming tweets from a Twitter app to Spark for analysis. I want to output the resulting Spark SQL table to Tableau for real-time analysis locally. I have already tried connecting to Databricks to run the program but I haven't been able to connect the Twitter app to Databricks notebook. My code for writing the steam looks like this: activityQuery = output.writeStream.trigger(processingTime='1 seconds').queryName("Places")\ .format("memory")\ .start() 来源: https://stackoverflow.com

PySpark: How to covert column with Ljava.lang.Object

十年热恋 提交于 2019-12-24 20:49:48
问题 I created data frame in PySpark by reading data from HDFS like this: df = spark.read.parquet('path/to/parquet') I expect the data frame to have two column of strings: +------------+------------------+ |my_column |my_other_column | +------------+------------------+ |my_string_1 |my_other_string_1 | |my_string_2 |my_other_string_2 | |my_string_3 |my_other_string_3 | |my_string_4 |my_other_string_4 | |my_string_5 |my_other_string_5 | |my_string_6 |my_other_string_6 | |my_string_7 |my_other