apache-spark | 易学教程

How divide or multiply every non-string columns of a PySpark dataframe with a float constant?

阅读更多关于 How divide or multiply every non-string columns of a PySpark dataframe with a float constant?

问题 My input dataframe looks like the below from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Basics").getOrCreate() df=spark.createDataFrame(data=[('Alice',4.300,None),('Bob',float('nan'),897)],schema=['name','High','Low']) +-----+----+----+ | name|High| Low| +-----+----+----+ |Alice| 4.3|null| | Bob| NaN| 897| +-----+----+----+ Expected Output if divided by 10.0 +-----+----+----+ | name|High| Low| +-----+----+----+ |Alice| 0.43|null| | Bob| NaN| 89.7| +-----+----+----+

How divide or multiply every non-string columns of a PySpark dataframe with a float constant?

阅读更多关于 How divide or multiply every non-string columns of a PySpark dataframe with a float constant?

Apache Spark Kinesis Integration: connected, but no records received

阅读更多关于 Apache Spark Kinesis Integration: connected, but no records received

问题 tldr; Can't use Kinesis Spark Streaming integration, because it receives no data. Testing stream is set up, nodejs app sends 1 simple record per second. Standard Spark 1.5.2 cluster is set up with master and worker nodes (4 cores) with docker-compose, AWS credentials in environment spark-streaming-kinesis-asl-assembly_2.10-1.5.2.jar is downloaded and added to classpath job.py or job.jar (just reads and prints) submitted. Everything seems to be okay, but no records what-so-ever are received.

Read fixed width file using schema from json file in pyspark

阅读更多关于 Read fixed width file using schema from json file in pyspark

问题 I have fixed width file as below 00120181120xyz12341 00220180203abc56792 00320181203pqr25483 And a corresponding JSON file that specifies the schema: {"Column":"id","From":"1","To":"3"} {"Column":"date","From":"4","To":"8"} {"Column":"name","From":"12","To":"3"} {"Column":"salary","From":"15","To":"5"} I read the schema file into DataFrame using: SchemaFile = spark.read\ .format("json")\ .option("header","true")\ .json('C:\Temp\schemaFile\schema.json') SchemaFile.show() #+------+----+---+ #

PySpark Row objects: accessing row elements by variable names

阅读更多关于 PySpark Row objects: accessing row elements by variable names

问题 One can access PySpark Row elements using the dot notation: given r= Row(name="Alice", age=11) , one can get the name or the age using r.name or r.age respectively. What happens when one needs to get an element whose name is stored in a variable element ? One option is to do r.toDict()[element] . However, consider a situation where we have a large DataFrame and we wish to map a function on each row of that data frame. We can certainly do something like def f(row, element1, element2): row =

Joining two spark dataframes on time (TimestampType) in python

阅读更多关于 Joining two spark dataframes on time (TimestampType) in python

问题 I have two dataframes and I would like to join them based on one column, with a caveat that this column is a timestamp, and that timestamp has to be within a certain offset (5 seconds) in order to join records. More specifically, a record in dates_df with date=1/3/2015:00:00:00 should be joined with events_df with time=1/3/2015:00:00:01 because both timestamps are within 5 seconds from each other. I'm trying to get this logic working with python spark, and it is extremely painful. How do

PySpark explode list into multiple columns based on name

阅读更多关于 PySpark explode list into multiple columns based on name

问题 Hi I'm dealing with a slightly difficult file format which I'm trying to clean for some future processing. I've been using Pyspark to process the data into a dataframe. The file looks similar to this: AA 1234 ZXYW BB A 890 CC B 321 AA 1234 LMNO BB D 123 CC E 321 AA 1234 ZXYW CC E 456 Each 'AA' record defines the start of a logical group or records, and the data on each line is fixed length and has information encoded in it that I want to extract. There are at least 20-30 different record

Spark Data set transformation to array [duplicate]

阅读更多关于 Spark Data set transformation to array [duplicate]

问题 This question already has answers here : How to aggregate values into collection after groupBy? (3 answers) Closed 8 months ago . I have a dataset like below; with values of col1 repeating multiple times and unique values of col2. This original dataset can almost a billion rows, so I do not want to use collect or collect_list as it will not scale-out for my use case. Original Dataset: +---------------------| | col1 | col2 | +---------------------| | AA| 11 | | BB| 21 | | AA| 12 | | AA| 13 | |

Computing First Day of Previous Quarter in Spark SQL

阅读更多关于 Computing First Day of Previous Quarter in Spark SQL

问题 How do I derive the first day of the last quarter pertaining to any given date in Spark-SQL query using the SQL API ? Few required samples are as below: input_date | start_date ------------------------ 2020-01-21 | 2019-10-01 2020-02-06 | 2019-10-01 2020-04-15 | 2020-01-01 2020-07-10 | 2020-04-01 2020-10-20 | 2020-07-01 2021-02-04 | 2020-10-01 The Quarters generally are: 1 | Jan - Mar 2 | Apr - Jun 3 | Jul - Sep 4 | Oct - Dec Note:I am using Spark SQL v2.4. Any help is appreciated. Thanks.

PySpark Sql with column name containing dash/hyphen in it

阅读更多关于 PySpark Sql with column name containing dash/hyphen in it

问题 I've PySpark dataframe df data = {'Passenger-Id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},'Age': {0: 22, 1: 38, 2: 26, 3: 35, 4: 35}} df_pd = pd.DataFrame(data, columns=data.keys()) df = spark.createDataFrame(df_pd) +------------+---+ |Passenger-Id|Age| +------------+---+ | 1| 22| | 2| 38| | 3| 26| | 4| 35| | 5| 35| +------------+---+ This works df.filter(df.Age == 22).show() But below doesn't work, due to - in the column name df.filter(df.Passenger-Id == 2).show() AttributeError: 'DataFrame' object