pyspark

Regular expression to find Specific character in a string

白昼怎懂夜的黑 提交于 2020-08-08 05:13:19
问题 I have these sample values prm_2020 P02 United Kingdom London 2 for 2 prm_2020 P2 United Kingdom London 2 for 2 prm_2020 P10 United Kingdom London 2 for 2 prm_2020 P11 United Kingdom London 2 for 2 Need to find P2, P02, P11,p06,p05 like this, trying to use Regexp_extract function in databricks. struggling to find the correct expression. Once i find P10, p6 from string i need to put numbers in new column called ID select distinct promo_name ,regexp_extract(promo_name, '(?<=p\d+\s+)P\d+') as

Enable _metadata files in Spark 2.1.0

為{幸葍}努か 提交于 2020-08-07 08:17:09
问题 It seems that saving empty Parquet files is broken in Spark 2.1.0 as it is not possible to read them in again (due to faulty schema inference) I found that since Spark 2.0 writing the _metadata file is disabled by default when writing parquet files. But I cannot find the configuration setting to put this back on. I tried the following: spark_session = SparkSession.builder \ .master(url) \ .appName(name) \ .config('spark.hadoop.parquet.enable.summary-metadata', 'true') \ .getOrCreate() and

Facing classnotfound exception while reading a snowflake table using spark

空扰寡人 提交于 2020-08-06 05:54:26
问题 I am trying to read a snowflake table from spark-shell. To do that, I did the following. pyspark --jars spark-snowflake_2.11-2.8.0-spark_2.4.jar,jackson-dataformat-xml-2.10.3.jar Using Python version 2.7.5 (default, Feb 20 2018 09:19:12) SparkSession available as 'spark'. >>> from pyspark import SparkConf, SparkContext >>> from pyspark.sql import SQLContext >>> from pyspark.sql.types import * >>> from pyspark import SparkConf, SparkContext >>> sc = SparkContext("local", "Simple App") >>>

Facing classnotfound exception while reading a snowflake table using spark

折月煮酒 提交于 2020-08-06 05:54:13
问题 I am trying to read a snowflake table from spark-shell. To do that, I did the following. pyspark --jars spark-snowflake_2.11-2.8.0-spark_2.4.jar,jackson-dataformat-xml-2.10.3.jar Using Python version 2.7.5 (default, Feb 20 2018 09:19:12) SparkSession available as 'spark'. >>> from pyspark import SparkConf, SparkContext >>> from pyspark.sql import SQLContext >>> from pyspark.sql.types import * >>> from pyspark import SparkConf, SparkContext >>> sc = SparkContext("local", "Simple App") >>>

Spark is only using one worker machine when more are available

核能气质少年 提交于 2020-08-03 15:28:23
问题 I'm trying to parallelize a machine learning prediction task via Spark. I've used Spark successfully a number of times before on other tasks and have faced no issues with parallelization before. In this particular task, my cluster has 4 workers. I'm calling mapPartitions on an RDD with 4 partitions. The map function loads a model from disk (a bootstrap script distributes all that is needed to do this; I've verified it exists on each slave machine) and performs prediction on data points in the

Dynamically rename multiple columns in PySpark DataFrame

夙愿已清 提交于 2020-07-31 09:00:39
问题 I have a dataframe in pyspark which has 15 columns. The column name are id , name , emp.dno , emp.sal , state , emp.city , zip ..... Now I want to replace the column names which have '.' in them to '_' Like 'emp.dno' to 'emp_dno' I would like to do it dynamically How can I achieve that in pyspark? 回答1: You can use something similar to this great solution from @zero323: df.toDF(*(c.replace('.', '_') for c in df.columns)) alternatively: from pyspark.sql.functions import col replacements = {c:c

Dynamically rename multiple columns in PySpark DataFrame

天大地大妈咪最大 提交于 2020-07-31 09:00:35
问题 I have a dataframe in pyspark which has 15 columns. The column name are id , name , emp.dno , emp.sal , state , emp.city , zip ..... Now I want to replace the column names which have '.' in them to '_' Like 'emp.dno' to 'emp_dno' I would like to do it dynamically How can I achieve that in pyspark? 回答1: You can use something similar to this great solution from @zero323: df.toDF(*(c.replace('.', '_') for c in df.columns)) alternatively: from pyspark.sql.functions import col replacements = {c:c

Syntax while setting schema for Pyspark.sql using StructType

无人久伴 提交于 2020-07-31 07:09:47
问题 I am new to spark and was playing around with Pyspark.sql. According to the pyspark.sql documentation here, one can go about setting the Spark dataframe and schema like this: spark= SparkSession.builder.getOrCreate() from pyspark.sql.types import StringType, IntegerType, StructType, StructField rdd = sc.textFile('./some csv_to_play_around.csv' schema = StructType([StructField('Name', StringType(), True), StructField('DateTime', TimestampType(), True) StructField('Age', IntegerType(), True)])

Syntax while setting schema for Pyspark.sql using StructType

隐身守侯 提交于 2020-07-31 07:09:21
问题 I am new to spark and was playing around with Pyspark.sql. According to the pyspark.sql documentation here, one can go about setting the Spark dataframe and schema like this: spark= SparkSession.builder.getOrCreate() from pyspark.sql.types import StringType, IntegerType, StructType, StructField rdd = sc.textFile('./some csv_to_play_around.csv' schema = StructType([StructField('Name', StringType(), True), StructField('DateTime', TimestampType(), True) StructField('Age', IntegerType(), True)])

How to return rows with Null values in pyspark dataframe?

£可爱£侵袭症+ 提交于 2020-07-30 06:11:06
问题 I am trying to get the rows with null values from a pyspark dataframe. In pandas, I can achieve this using isnull() on the dataframe: df = df[df.isnull().any(axis=1)] But in case of PySpark, when I am running below command it shows Attributeerror: df.filter(df.isNull()) AttributeError: 'DataFrame' object has no attribute 'isNull'. How can get the rows with null values without checking it for each column? 回答1: You can filter the rows with where , reduce and a list comprehension. For example,