pyspark

Optimal way of creating a cache in the PySpark environment

拥有回忆 提交于 2021-02-08 05:32:11
问题 I am using Spark Streaming for creating a system to enrich incoming data from a cloudant database. Example - Incoming Message: {"id" : 123} Outgoing Message: {"id" : 123, "data": "xxxxxxxxxxxxxxxxxxx"} My code for the driver class is as follows: from Sample.Job import EnrichmentJob from Sample.Job import FunctionJob import pyspark from pyspark.streaming.kafka import KafkaUtils from pyspark import SparkContext, SparkConf, SQLContext from pyspark.streaming import StreamingContext from pyspark

Faster Kmeans Clustering on High-dimensional Data with GPU Support

落爺英雄遲暮 提交于 2021-02-08 05:16:37
问题 We've been using Kmeans for clustering our logs. A typical dataset has 10 mill. samples with 100k+ features. To find the optimal k - we run multiple Kmeans in parallel and pick the one with the best silhouette score. In 90% of the cases we end up with k between 2 and 100. Currently, we are using scikit-learn Kmeans. For such a dataset, clustering takes around 24h on ec2 instance with 32 cores and 244 RAM. I've been currently researching for a faster solution. What I have already tested:

Faster Kmeans Clustering on High-dimensional Data with GPU Support

Deadly 提交于 2021-02-08 05:15:31
问题 We've been using Kmeans for clustering our logs. A typical dataset has 10 mill. samples with 100k+ features. To find the optimal k - we run multiple Kmeans in parallel and pick the one with the best silhouette score. In 90% of the cases we end up with k between 2 and 100. Currently, we are using scikit-learn Kmeans. For such a dataset, clustering takes around 24h on ec2 instance with 32 cores and 244 RAM. I've been currently researching for a faster solution. What I have already tested:

Why does pyspark fail with “Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder'”?

时光总嘲笑我的痴心妄想 提交于 2021-02-08 04:31:09
问题 For the life of me I cannot figure out what is wrong with my PySpark install. I have installed all dependencies, including Hadoop, but PySpark cant find it--am I diagnosing this correctly? See the full error message below, but it ultimately fails on PySpark SQL pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':" nickeleres@Nicks-MBP:~$ pyspark Python 2.7.10 (default, Feb 7 2017, 00:08:15) [GCC 4.2.1 Compatible Apple

Remove null values and shift values from the next column in pyspark

☆樱花仙子☆ 提交于 2021-02-08 03:59:53
问题 I need to transform a Python script to Pyspark and it's being a tough task for me. I'm trying to remove null values from a dataframe (without removing the entire column or row) and shift the next value to the prior column. Example: CLIENT| ANIMAL_1 | ANIMAL_2 | ANIMAL_3| ANIMAL_4 ROW_1 1 | cow | frog | null | dog ROW_2 2 | pig | null | cat | null My goal is to have: CLIENT| ANIMAL_1 | ANIMAL_2 | ANIMAL_3| ANIMAL_4 ROW_1 1 | cow | frog | dog | null ROW_2 2 | pig | cat | null | null The code I

Can't connect to Azure Data Lake Gen2 using PySpark and Databricks Connect

折月煮酒 提交于 2021-02-08 03:59:29
问题 Recently, Databricks launched Databricks Connect that allows you to write jobs using Spark native APIs and have them execute remotely on an Azure Databricks cluster instead of in the local Spark session. It works fine except when I try to access files in Azure Data Lake Storage Gen2. When I execute this: spark.read.json("abfss://...").count() I get this error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs

Reading partition columns without partition column names

半城伤御伤魂 提交于 2021-02-08 03:36:05
问题 We have data stored in s3 partitioned in the following structure: bucket/directory/table/aaaa/bb/cc/dd/ where aaaa is the year, bb is the month, cc is the day and dd is the hour. As you can see, there are no partition keys in the path ( year=aaaa , month=bb , day=cc , hour=dd) . As a result, when I read the table into Spark, there is no year , month , day or hour columns. Is there anyway I can read the table into Spark and include the partitioned column without : changing the path names in s3

java.lang.IllegalArgumentException when applying a Python UDF to a Spark dataframe

时光毁灭记忆、已成空白 提交于 2021-02-07 20:39:38
问题 I'm testing the example code provided in the documentation of pandas_udf (https://spark.apache.org/docs/2.3.1/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf), using Pyspark 2.3.1 on my local machine: from pyspark.sql import SparkSession from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame( [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v")) @pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP) def normalize(pdf): v = pdf.v

Count including null in PySpark Dataframe Aggregation

╄→尐↘猪︶ㄣ 提交于 2021-02-07 19:44:29
问题 I am trying to get some counts on a DataFrame using agg and count. from pyspark.sql import Row ,functions as F row = Row("Cat","Date") df = (sc.parallelize ([ row("A",'2017-03-03'), row('A',None), row('B','2017-03-04'), row('B','Garbage'), row('A','2016-03-04') ]).toDF()) df = df.withColumn("Casted", df['Date'].cast('date')) df.show() ( df.groupby(df['Cat']) .agg ( #F.count(col('Date').isNull() | col('Date').isNotNull()).alias('Date_Count'), F.count('Date').alias('Date_Count'), F.count(

Count including null in PySpark Dataframe Aggregation

╄→гoц情女王★ 提交于 2021-02-07 19:44:27
问题 I am trying to get some counts on a DataFrame using agg and count. from pyspark.sql import Row ,functions as F row = Row("Cat","Date") df = (sc.parallelize ([ row("A",'2017-03-03'), row('A',None), row('B','2017-03-04'), row('B','Garbage'), row('A','2016-03-04') ]).toDF()) df = df.withColumn("Casted", df['Date'].cast('date')) df.show() ( df.groupby(df['Cat']) .agg ( #F.count(col('Date').isNull() | col('Date').isNotNull()).alias('Date_Count'), F.count('Date').alias('Date_Count'), F.count(