databricks

Pyspark Data Frame: Access to a Column

别来无恙 提交于 2020-05-09 15:12:42
问题 I hope every one of you is ok and the Covid19 is not affecting your life too much. I am struggling with a PySpark code, in particular, I'd like to call a function on an object col which is not iterable. from pyspark.sql.functions import col, lower, regexp_replace, split from googletrans import Translator def clean_text(c): c = lower(c) c = regexp_replace(c, r"^rt ", "") c = regexp_replace(c, r"(https?\://)\S+", "") c = regexp_replace(c, "[^a-zA-Z0-9\\s]", "") #removePunctuation c = regexp

Pyspark Data Frame: Access to a Column

99封情书 提交于 2020-05-09 15:06:30
问题 I hope every one of you is ok and the Covid19 is not affecting your life too much. I am struggling with a PySpark code, in particular, I'd like to call a function on an object col which is not iterable. from pyspark.sql.functions import col, lower, regexp_replace, split from googletrans import Translator def clean_text(c): c = lower(c) c = regexp_replace(c, r"^rt ", "") c = regexp_replace(c, r"(https?\://)\S+", "") c = regexp_replace(c, "[^a-zA-Z0-9\\s]", "") #removePunctuation c = regexp

Azure Databricks: How to add Spark configuration in Databricks cluster

余生颓废 提交于 2020-05-09 07:31:43
问题 I am using a Spark Databricks cluster and want to add a customized Spark configuration. There is a Databricks documentation on this but I am not getting any clue how and what changes I should make. Can someone pls share the example to configure the Databricks cluster. Is there any way to see the default configuration for Spark in the Databricks cluster. 回答1: To fine tune Spark jobs, you can provide custom Spark configuration properties in a cluster configuration. On the cluster configuration

databricks configure using cmd and R

[亡魂溺海] 提交于 2020-05-07 09:17:09
问题 I am trying to use databricks cli and invoke the databricks configure That's how I do it from cmd somepath>databricks configure --token Databricks Host (should begin with https://): my_https_address Token: my_token I want to invoke the same command using R. So I did: tool.control <- c('databricks configure --token' ,'my_https_address' ,'my_token') shell(tool.control) I get the following error Error in system(command, as.integer(flag), f, stdout, stderr, timeout) : character string expected as

Spark dataframe to numpy array via udf or without collecting to driver

旧街凉风 提交于 2020-04-30 09:48:46
问题 Real life df is a massive dataframe that cannot be loaded into driver memory. Can this be done using regular or pandas udf? # Code to generate a sample dataframe from pyspark.sql import functions as F from pyspark.sql.types import * import pandas as pd import numpy as np sample = [['123',[[0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1], [0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1]]], ['345',[[1,0,0,0,0,1,1,1,0,1,1,0,1,0,0,0,1,1,1,1,1,1], [0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1]]], ['425',

Spark dataframe to numpy array via udf or without collecting to driver

家住魔仙堡 提交于 2020-04-30 09:47:45
问题 Real life df is a massive dataframe that cannot be loaded into driver memory. Can this be done using regular or pandas udf? # Code to generate a sample dataframe from pyspark.sql import functions as F from pyspark.sql.types import * import pandas as pd import numpy as np sample = [['123',[[0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1], [0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1]]], ['345',[[1,0,0,0,0,1,1,1,0,1,1,0,1,0,0,0,1,1,1,1,1,1], [0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1]]], ['425',

Azure Databricks cluster init script - install python wheel

微笑、不失礼 提交于 2020-04-18 04:00:52
问题 I have a python script that mounts a storage account in databricks and then installs a wheel from the storage account. I am trying to run it as a cluster init script but it keeps failing. My script is of the form: #/databricks/python/bin/python mount_point = "/mnt/...." configs = {....} source = "...." if not any(mount.mountPoint == mount_point for mount in dbutils.fs.mounts()): dbutils.fs.mount(source = source, mount_point = mount_point, extra_configs = configs) dbutils.library.install("dbfs

Azure Databricks cluster init script - Install wheel from mounted storage

吃可爱长大的小学妹 提交于 2020-04-18 04:00:51
问题 I have a python wheel uploaded to an azure storage account that is mounted in a databricks service. I'm trying to install the wheel using a cluster init script as described in the databricks documentation. My storage is definitely mounted and my file path looks correct to me. Running the command display(dbutils.fs.ls("/mnt/package-source")) in a notebook yields the result: path: dbfs:/mnt/package-source/parser-3.0-py3-none-any.whl name: parser-3.0-py3-none-any.whl I have tried to install the

Azure Databricks cluster init script - Install wheel from mounted storage

点点圈 提交于 2020-04-18 04:00:40
问题 I have a python wheel uploaded to an azure storage account that is mounted in a databricks service. I'm trying to install the wheel using a cluster init script as described in the databricks documentation. My storage is definitely mounted and my file path looks correct to me. Running the command display(dbutils.fs.ls("/mnt/package-source")) in a notebook yields the result: path: dbfs:/mnt/package-source/parser-3.0-py3-none-any.whl name: parser-3.0-py3-none-any.whl I have tried to install the

Azure Databricks cluster init script - install python wheel

安稳与你 提交于 2020-04-18 04:00:39
问题 I have a python script that mounts a storage account in databricks and then installs a wheel from the storage account. I am trying to run it as a cluster init script but it keeps failing. My script is of the form: #/databricks/python/bin/python mount_point = "/mnt/...." configs = {....} source = "...." if not any(mount.mountPoint == mount_point for mount in dbutils.fs.mounts()): dbutils.fs.mount(source = source, mount_point = mount_point, extra_configs = configs) dbutils.library.install("dbfs