apache-spark

How divide or multiply every non-string columns of a PySpark dataframe with a float constant?

感情迁移 提交于 2021-02-16 08:42:31
问题 My input dataframe looks like the below from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Basics").getOrCreate() df=spark.createDataFrame(data=[('Alice',4.300,None),('Bob',float('nan'),897)],schema=['name','High','Low']) +-----+----+----+ | name|High| Low| +-----+----+----+ |Alice| 4.3|null| | Bob| NaN| 897| +-----+----+----+ Expected Output if divided by 10.0 +-----+----+----+ | name|High| Low| +-----+----+----+ |Alice| 0.43|null| | Bob| NaN| 89.7| +-----+----+----+

How divide or multiply every non-string columns of a PySpark dataframe with a float constant?

99封情书 提交于 2021-02-16 08:42:11
问题 My input dataframe looks like the below from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Basics").getOrCreate() df=spark.createDataFrame(data=[('Alice',4.300,None),('Bob',float('nan'),897)],schema=['name','High','Low']) +-----+----+----+ | name|High| Low| +-----+----+----+ |Alice| 4.3|null| | Bob| NaN| 897| +-----+----+----+ Expected Output if divided by 10.0 +-----+----+----+ | name|High| Low| +-----+----+----+ |Alice| 0.43|null| | Bob| NaN| 89.7| +-----+----+----+

Apache Spark Kinesis Integration: connected, but no records received

為{幸葍}努か 提交于 2021-02-16 08:30:23
问题 tldr; Can't use Kinesis Spark Streaming integration, because it receives no data. Testing stream is set up, nodejs app sends 1 simple record per second. Standard Spark 1.5.2 cluster is set up with master and worker nodes (4 cores) with docker-compose, AWS credentials in environment spark-streaming-kinesis-asl-assembly_2.10-1.5.2.jar is downloaded and added to classpath job.py or job.jar (just reads and prints) submitted. Everything seems to be okay, but no records what-so-ever are received.

Read fixed width file using schema from json file in pyspark

流过昼夜 提交于 2021-02-16 05:33:52
问题 I have fixed width file as below 00120181120xyz12341 00220180203abc56792 00320181203pqr25483 And a corresponding JSON file that specifies the schema: {"Column":"id","From":"1","To":"3"} {"Column":"date","From":"4","To":"8"} {"Column":"name","From":"12","To":"3"} {"Column":"salary","From":"15","To":"5"} I read the schema file into DataFrame using: SchemaFile = spark.read\ .format("json")\ .option("header","true")\ .json('C:\Temp\schemaFile\schema.json') SchemaFile.show() #+------+----+---+ #

PySpark Row objects: accessing row elements by variable names

末鹿安然 提交于 2021-02-16 05:16:46
问题 One can access PySpark Row elements using the dot notation: given r= Row(name="Alice", age=11) , one can get the name or the age using r.name or r.age respectively. What happens when one needs to get an element whose name is stored in a variable element ? One option is to do r.toDict()[element] . However, consider a situation where we have a large DataFrame and we wish to map a function on each row of that data frame. We can certainly do something like def f(row, element1, element2): row =

Joining two spark dataframes on time (TimestampType) in python

烂漫一生 提交于 2021-02-16 03:30:32
问题 I have two dataframes and I would like to join them based on one column, with a caveat that this column is a timestamp, and that timestamp has to be within a certain offset (5 seconds) in order to join records. More specifically, a record in dates_df with date=1/3/2015:00:00:00 should be joined with events_df with time=1/3/2015:00:00:01 because both timestamps are within 5 seconds from each other. I'm trying to get this logic working with python spark, and it is extremely painful. How do

PySpark explode list into multiple columns based on name

若如初见. 提交于 2021-02-15 12:01:02
问题 Hi I'm dealing with a slightly difficult file format which I'm trying to clean for some future processing. I've been using Pyspark to process the data into a dataframe. The file looks similar to this: AA 1234 ZXYW BB A 890 CC B 321 AA 1234 LMNO BB D 123 CC E 321 AA 1234 ZXYW CC E 456 Each 'AA' record defines the start of a logical group or records, and the data on each line is fixed length and has information encoded in it that I want to extract. There are at least 20-30 different record

Spark Data set transformation to array [duplicate]

↘锁芯ラ 提交于 2021-02-11 18:16:14
问题 This question already has answers here : How to aggregate values into collection after groupBy? (3 answers) Closed 8 months ago . I have a dataset like below; with values of col1 repeating multiple times and unique values of col2. This original dataset can almost a billion rows, so I do not want to use collect or collect_list as it will not scale-out for my use case. Original Dataset: +---------------------| | col1 | col2 | +---------------------| | AA| 11 | | BB| 21 | | AA| 12 | | AA| 13 | |

Computing First Day of Previous Quarter in Spark SQL

不想你离开。 提交于 2021-02-11 17:55:52
问题 How do I derive the first day of the last quarter pertaining to any given date in Spark-SQL query using the SQL API ? Few required samples are as below: input_date | start_date ------------------------ 2020-01-21 | 2019-10-01 2020-02-06 | 2019-10-01 2020-04-15 | 2020-01-01 2020-07-10 | 2020-04-01 2020-10-20 | 2020-07-01 2021-02-04 | 2020-10-01 The Quarters generally are: 1 | Jan - Mar 2 | Apr - Jun 3 | Jul - Sep 4 | Oct - Dec Note:I am using Spark SQL v2.4. Any help is appreciated. Thanks.

PySpark Sql with column name containing dash/hyphen in it

时间秒杀一切 提交于 2021-02-11 17:37:33
问题 I've PySpark dataframe df data = {'Passenger-Id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},'Age': {0: 22, 1: 38, 2: 26, 3: 35, 4: 35}} df_pd = pd.DataFrame(data, columns=data.keys()) df = spark.createDataFrame(df_pd) +------------+---+ |Passenger-Id|Age| +------------+---+ | 1| 22| | 2| 38| | 3| 26| | 4| 35| | 5| 35| +------------+---+ This works df.filter(df.Age == 22).show() But below doesn't work, due to - in the column name df.filter(df.Passenger-Id == 2).show() AttributeError: 'DataFrame' object