pyspark

ModuleNotFoundError: No module named 'pyspark.dbutils'

夙愿已清 提交于 2020-06-17 09:59:11
问题 I am running pyspark from an Azure Machine Learning notebook. I am trying to move a file using the dbutil module. from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() def get_dbutils(spark): try: from pyspark.dbutils import DBUtils dbutils = DBUtils(spark) except ImportError: import IPython dbutils = IPython.get_ipython().user_ns["dbutils"] return dbutils dbutils = get_dbutils(spark) dbutils.fs.cp("file:source", "dbfs:destination") I got this error:

Databricks SQL Server connection across multiple notebooks

不羁岁月 提交于 2020-06-17 09:45:14
问题 I found some resources for how to pass variables across pySpark databricks notebooks. I'm curious if we can pass SQL Server connection, such as having host/database/port/user/pw in Notebook A and calling the connection on Notebook B. 回答1: Take a look at that part of Databricks documentation: https://docs.databricks.com/notebooks/notebook-workflows.html#pass-structured-data. This way you can pass strings, one or multiple, across notebooks, but you'll have to create the connection in Notebook B

Databricks SQL Server connection across multiple notebooks

对着背影说爱祢 提交于 2020-06-17 09:45:10
问题 I found some resources for how to pass variables across pySpark databricks notebooks. I'm curious if we can pass SQL Server connection, such as having host/database/port/user/pw in Notebook A and calling the connection on Notebook B. 回答1: Take a look at that part of Databricks documentation: https://docs.databricks.com/notebooks/notebook-workflows.html#pass-structured-data. This way you can pass strings, one or multiple, across notebooks, but you'll have to create the connection in Notebook B

pyspark sql Add different Qtr start_date, End_date for exploded rows

流过昼夜 提交于 2020-06-17 09:41:51
问题 I have a dataframe which has start_date, end_date, sales_target. I have added code to identify the number of quarters between the date range, and accordingly able to split the sales_target across the number of quarters, using some a UDF. df = sqlContext.createDataFrame([("2020-01-01","2020-12-31","15"),("2020-04-01","2020-12-31","11"),("2020-07-01","2020-12-31","3")], ["start_date","end_date","sales_target"]) +----------+----------+------------+ |start_date| end_date |sales_target| +---------

How to explode an array without duplicate records

為{幸葍}努か 提交于 2020-06-17 09:38:06
问题 This is continuation to the question here in pyspark sql Add different Qtr start_date, End_date for exploded rows. Thanks. I have the following dataframe which has a array list as a column. +--------------+------------+----------+----------+---+---------+-----------+----------+ customer_number|sales_target|start_date|end_date |noq|cf_values|new_sdt |new_edate | +--------------+------------+----------+----------+---+---------------------+----------+ |A011021 |15 |2020-01-01|2020-12-31|4 |[4,4

PySpark Kafka py4j.protocol.Py4JJavaError: An error occurred while calling o28.load

心已入冬 提交于 2020-06-17 03:36:09
问题 While converting Kafka messages to dataframe am getting error while passing the packages as an argument. from pyspark.sql import SparkSession, Row from pyspark.context import SparkContext from kafka import KafkaConsumer import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars spark-sql-kafka-0-10_2.11-2.0.2.jar,spark-streaming-kafka-0-8-assembly_2.11-2.3.1.jar pyspark-shell' sc = SparkContext.getOrCreate() spark = SparkSession(sc) df = spark \ .read \ .format("kafka") \ .option("kafka.bootstrap

Renaming spark output csv in azure blob storage

主宰稳场 提交于 2020-06-17 02:56:48
问题 I have a Databricks notebook setup that works as the following; pyspark connection details to Blob storage account Read file through spark dataframe convert to pandas Df data modelling on pandas Df convert to spark Df write to blob storage in single file My problem is, that you can not name the file output file, where I need a static csv filename. Is there way to rename this in pyspark? ## Blob Storage account information storage_account_name = "" storage_account_access_key = "" ## File

convert parquet to json for dynamodb import

99封情书 提交于 2020-06-17 00:09:28
问题 I am using AWS Glue jobs to backup dynamodb tables in s3 in parquet format to be able to use it in Athena. If I want to use these parquet format s3 files to be able to do restore of the table in dynamodb, this is what I am thinking - read each parquet file and convert it into json and then insert the json formatted data into dynamodb (using pyspark on the below lines) # set sql context parquetFile = sqlContext.read.parquet(input_file) parquetFile.write.json(output_path) Convert normal json to

Spark Read Json: how to read field that alternates between integer and struct

試著忘記壹切 提交于 2020-06-16 17:27:37
问题 Trying to read multiple json files into a dataframe, both files have a "Value" node but the type of this node alternates between integer and struct: File 1: { "Value": 123 } File 2: { "Value": { "Value": "On", "ValueType": "State", "IsSystemValue": true } } My goal is to read the files into a dataframe like this: |---------------------|------------------|---------------------|------------------| | File | Value | ValueType | IsSystemValue | |---------------------|------------------|-----------

Spark Read Json: how to read field that alternates between integer and struct

末鹿安然 提交于 2020-06-16 17:26:30
问题 Trying to read multiple json files into a dataframe, both files have a "Value" node but the type of this node alternates between integer and struct: File 1: { "Value": 123 } File 2: { "Value": { "Value": "On", "ValueType": "State", "IsSystemValue": true } } My goal is to read the files into a dataframe like this: |---------------------|------------------|---------------------|------------------| | File | Value | ValueType | IsSystemValue | |---------------------|------------------|-----------