apache-spark

Rolling average without timestamp in pyspark

冷暖自知 提交于 2021-01-28 11:25:33
问题 We can find the rolling/moving average of a time series data using window function in pyspark. The data I am dealing with doesn't have any timestamp column but it does have a strictly increasing column frame_number . Data looks like this. d = [ {'session_id': 1, 'frame_number': 1, 'rtd': 11.0, 'rtd2': 11.0,}, {'session_id': 1, 'frame_number': 2, 'rtd': 12.0, 'rtd2': 6.0}, {'session_id': 1, 'frame_number': 3, 'rtd': 4.0, 'rtd2': 233.0}, {'session_id': 1, 'frame_number': 4, 'rtd': 110.0, 'rtd2'

Generate database schema diagram for Databricks

|▌冷眼眸甩不掉的悲伤 提交于 2021-01-28 11:11:09
问题 I'm creating a Databricks application and the database schema is getting to be non-trivial. Is there a way I can generate a schema diagram for a Databricks database (something similar to the schema diagrams that can be generated from mysql)? 回答1: There are 2 variants possible: using Spark SQL with show databases , show tables in <database> , describe table ... using spark.catalog.listDatabases , spark.catalog.listTables , spark.catagog.listColumns . 2nd variant isn't very performant when you

Generate database schema diagram for Databricks

假如想象 提交于 2021-01-28 11:09:41
问题 I'm creating a Databricks application and the database schema is getting to be non-trivial. Is there a way I can generate a schema diagram for a Databricks database (something similar to the schema diagrams that can be generated from mysql)? 回答1: There are 2 variants possible: using Spark SQL with show databases , show tables in <database> , describe table ... using spark.catalog.listDatabases , spark.catalog.listTables , spark.catagog.listColumns . 2nd variant isn't very performant when you

Efficiently batching Spark dataframes to call an API

谁说胖子不能爱 提交于 2021-01-28 10:52:55
问题 I am fairly new to Spark and I'm trying to call the Spotify API using Spotipy. I have a list of artist ids which can be used to fetch artist info. The Spotify API allows for batch calls up to 50 ids at once. I load the artist ids from a MySQL database and store them in a dataframe. My problem now is that I do not know how to efficiently batch that dataframe into pieces of 50 or less rows. In the example below I'm turning the dataframe into a regular Python list from which I can call the API

Running a python Apache Beam Pipeline on Spark

两盒软妹~` 提交于 2021-01-28 10:34:59
问题 I am giving apache beam (with python sdk) a try here so I created a simple pipeline and I tried to deploy it on a Spark cluster. from apache_beam.options.pipeline_options import PipelineOptions import apache_beam as beam op = PipelineOptions([ "--runner=DirectRunner" ] ) with beam.Pipeline(options=op) as p: p | beam.Create([1, 2, 3]) | beam.Map(lambda x: x+1) | beam.Map(print) This pipeline is working well with DirectRunner. So to deploy the same code on Spark (as the portability is a key

Issue with df.show() in pyspark

﹥>﹥吖頭↗ 提交于 2021-01-28 09:19:37
问题 I have the following code: import pyspark import pandas as pd from pyspark.sql import SQLContext from pyspark.sql.functions import udf from pyspark.sql.types import IntegerType, StringType sc = pyspark.SparkContext() sqlCtx = SQLContext(sc) df_pd = pd.DataFrame( data={'integers': [1, 2, 3], 'floats': [-1.0, 0.5, 2.7], 'integer_arrays': [[1, 2], [3, 4, 5], [6, 7, 8, 9]]} ) df = sqlCtx.createDataFrame(df_pd) df.printSchema() Runs fine until here, but when I run: df.show() It gives this error: -

Input data to AWS Elastic Search using Glue

被刻印的时光 ゝ 提交于 2021-01-28 09:03:01
问题 I'm looking for a solution to insert data to AWS Elastic Search using AWS Glue python or pyspark. I have seen Boto3 SDK for Elastic Search but could not find any function to insert data into Elastic Search. Can anyone help me to find solution ? Any useful links or code ? 回答1: For aws glue you need to add an additional jar to the job. Download the jar from https://repo1.maven.org/maven2/org/elasticsearch/elasticsearch-hadoop/7.8.0/elasticsearch-hadoop-7.8.0.jar Save the jar on s3 and pass it

Checking whether a column has proper decimal number

♀尐吖头ヾ 提交于 2021-01-28 08:55:14
问题 I have a dataframe ( input_dataframe ), which looks like as below: id test_column 1 0.25 2 1.1 3 12 4 test 5 1.3334 6 .11 I want to add a column result , which put values 1 if test_column has a decimal value and 0 if test_column has any other value. data type of test_column is string. Below is the expected output: id test_column result 1 0.25 1 2 1.1 1 3 12 0 4 test 0 5 1.3334 1 6 .11 1 Can we achieve it using pySpark code? 回答1: You can parse decimal token with decimal.Decimal() Here we are

Is there a way to collect the names of all fields in a nested schema in pyspark

孤者浪人 提交于 2021-01-28 08:40:33
问题 I wish to collect the names of all the fields in a nested schema. The data were imported from a json file. The schema looks like: root |-- column_a: string (nullable = true) |-- column_b: string (nullable = true) |-- column_c: struct (nullable = true) | |-- nested_a: struct (nullable = true) | | |-- double_nested_a: string (nullable = true) | | |-- double_nested_b: string (nullable = true) | | |-- double_nested_c: string (nullable = true) | |-- nested_b: string (nullable = true) |-- column_d:

Apache spark: map csv file to key: value format

旧街凉风 提交于 2021-01-28 08:17:26
问题 I'm totally new to Apache Spark and Scala , and I'm having problems with mapping a .csv file into a key-value (like JSON) structure. What I want to accomplish is to get the .csv file: user, timestamp, event ec79fcac8c76ebe505b76090f03350a2,2015-03-06 13:52:56,USER_PURCHASED ad0e431a69cb3b445ddad7bb97f55665,2015-03-06 13:52:57,USER_SHARED 83b2d8a2c549fbab0713765532b63b54,2015-03-06 13:52:57,USER_SUBSCRIBED ec79fcac8c76ebe505b76090f03350a2,2015-03-06 13:53:01,USER_ADDED_TO_PLAYLIST ... Into a