apache-spark | 易学教程

Rolling average without timestamp in pyspark

阅读更多关于 Rolling average without timestamp in pyspark

问题 We can find the rolling/moving average of a time series data using window function in pyspark. The data I am dealing with doesn't have any timestamp column but it does have a strictly increasing column frame_number . Data looks like this. d = [ {'session_id': 1, 'frame_number': 1, 'rtd': 11.0, 'rtd2': 11.0,}, {'session_id': 1, 'frame_number': 2, 'rtd': 12.0, 'rtd2': 6.0}, {'session_id': 1, 'frame_number': 3, 'rtd': 4.0, 'rtd2': 233.0}, {'session_id': 1, 'frame_number': 4, 'rtd': 110.0, 'rtd2'

Generate database schema diagram for Databricks

阅读更多关于 Generate database schema diagram for Databricks

问题 I'm creating a Databricks application and the database schema is getting to be non-trivial. Is there a way I can generate a schema diagram for a Databricks database (something similar to the schema diagrams that can be generated from mysql)? 回答1: There are 2 variants possible: using Spark SQL with show databases , show tables in <database> , describe table ... using spark.catalog.listDatabases , spark.catalog.listTables , spark.catagog.listColumns . 2nd variant isn't very performant when you

Generate database schema diagram for Databricks

阅读更多关于 Generate database schema diagram for Databricks

Efficiently batching Spark dataframes to call an API

阅读更多关于 Efficiently batching Spark dataframes to call an API

问题 I am fairly new to Spark and I'm trying to call the Spotify API using Spotipy. I have a list of artist ids which can be used to fetch artist info. The Spotify API allows for batch calls up to 50 ids at once. I load the artist ids from a MySQL database and store them in a dataframe. My problem now is that I do not know how to efficiently batch that dataframe into pieces of 50 or less rows. In the example below I'm turning the dataframe into a regular Python list from which I can call the API

Running a python Apache Beam Pipeline on Spark

阅读更多关于 Running a python Apache Beam Pipeline on Spark

问题 I am giving apache beam (with python sdk) a try here so I created a simple pipeline and I tried to deploy it on a Spark cluster. from apache_beam.options.pipeline_options import PipelineOptions import apache_beam as beam op = PipelineOptions([ "--runner=DirectRunner" ] ) with beam.Pipeline(options=op) as p: p | beam.Create([1, 2, 3]) | beam.Map(lambda x: x+1) | beam.Map(print) This pipeline is working well with DirectRunner. So to deploy the same code on Spark (as the portability is a key

Issue with df.show() in pyspark

阅读更多关于 Issue with df.show() in pyspark

问题 I have the following code: import pyspark import pandas as pd from pyspark.sql import SQLContext from pyspark.sql.functions import udf from pyspark.sql.types import IntegerType, StringType sc = pyspark.SparkContext() sqlCtx = SQLContext(sc) df_pd = pd.DataFrame( data={'integers': [1, 2, 3], 'floats': [-1.0, 0.5, 2.7], 'integer_arrays': [[1, 2], [3, 4, 5], [6, 7, 8, 9]]} ) df = sqlCtx.createDataFrame(df_pd) df.printSchema() Runs fine until here, but when I run: df.show() It gives this error: -

Input data to AWS Elastic Search using Glue

阅读更多关于 Input data to AWS Elastic Search using Glue

问题 I'm looking for a solution to insert data to AWS Elastic Search using AWS Glue python or pyspark. I have seen Boto3 SDK for Elastic Search but could not find any function to insert data into Elastic Search. Can anyone help me to find solution ? Any useful links or code ? 回答1: For aws glue you need to add an additional jar to the job. Download the jar from https://repo1.maven.org/maven2/org/elasticsearch/elasticsearch-hadoop/7.8.0/elasticsearch-hadoop-7.8.0.jar Save the jar on s3 and pass it

Checking whether a column has proper decimal number

阅读更多关于 Checking whether a column has proper decimal number

问题 I have a dataframe ( input_dataframe ), which looks like as below: id test_column 1 0.25 2 1.1 3 12 4 test 5 1.3334 6 .11 I want to add a column result , which put values 1 if test_column has a decimal value and 0 if test_column has any other value. data type of test_column is string. Below is the expected output: id test_column result 1 0.25 1 2 1.1 1 3 12 0 4 test 0 5 1.3334 1 6 .11 1 Can we achieve it using pySpark code? 回答1: You can parse decimal token with decimal.Decimal() Here we are

Is there a way to collect the names of all fields in a nested schema in pyspark

阅读更多关于 Is there a way to collect the names of all fields in a nested schema in pyspark

问题 I wish to collect the names of all the fields in a nested schema. The data were imported from a json file. The schema looks like: root |-- column_a: string (nullable = true) |-- column_b: string (nullable = true) |-- column_c: struct (nullable = true) | |-- nested_a: struct (nullable = true) | | |-- double_nested_a: string (nullable = true) | | |-- double_nested_b: string (nullable = true) | | |-- double_nested_c: string (nullable = true) | |-- nested_b: string (nullable = true) |-- column_d:

Apache spark: map csv file to key: value format

阅读更多关于 Apache spark: map csv file to key: value format

问题 I'm totally new to Apache Spark and Scala , and I'm having problems with mapping a .csv file into a key-value (like JSON) structure. What I want to accomplish is to get the .csv file: user, timestamp, event ec79fcac8c76ebe505b76090f03350a2,2015-03-06 13:52:56,USER_PURCHASED ad0e431a69cb3b445ddad7bb97f55665,2015-03-06 13:52:57,USER_SHARED 83b2d8a2c549fbab0713765532b63b54,2015-03-06 13:52:57,USER_SUBSCRIBED ec79fcac8c76ebe505b76090f03350a2,2015-03-06 13:53:01,USER_ADDED_TO_PLAYLIST ... Into a