pyspark | 易学教程

Optimal way of creating a cache in the PySpark environment

阅读更多关于 Optimal way of creating a cache in the PySpark environment

问题 I am using Spark Streaming for creating a system to enrich incoming data from a cloudant database. Example - Incoming Message: {"id" : 123} Outgoing Message: {"id" : 123, "data": "xxxxxxxxxxxxxxxxxxx"} My code for the driver class is as follows: from Sample.Job import EnrichmentJob from Sample.Job import FunctionJob import pyspark from pyspark.streaming.kafka import KafkaUtils from pyspark import SparkContext, SparkConf, SQLContext from pyspark.streaming import StreamingContext from pyspark

Faster Kmeans Clustering on High-dimensional Data with GPU Support

阅读更多关于 Faster Kmeans Clustering on High-dimensional Data with GPU Support

问题 We've been using Kmeans for clustering our logs. A typical dataset has 10 mill. samples with 100k+ features. To find the optimal k - we run multiple Kmeans in parallel and pick the one with the best silhouette score. In 90% of the cases we end up with k between 2 and 100. Currently, we are using scikit-learn Kmeans. For such a dataset, clustering takes around 24h on ec2 instance with 32 cores and 244 RAM. I've been currently researching for a faster solution. What I have already tested:

Faster Kmeans Clustering on High-dimensional Data with GPU Support

阅读更多关于 Faster Kmeans Clustering on High-dimensional Data with GPU Support

Why does pyspark fail with “Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder'”?

阅读更多关于 Why does pyspark fail with “Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder'”?

问题 For the life of me I cannot figure out what is wrong with my PySpark install. I have installed all dependencies, including Hadoop, but PySpark cant find it--am I diagnosing this correctly? See the full error message below, but it ultimately fails on PySpark SQL pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':" nickeleres@Nicks-MBP:~$ pyspark Python 2.7.10 (default, Feb 7 2017, 00:08:15) [GCC 4.2.1 Compatible Apple

Remove null values and shift values from the next column in pyspark

阅读更多关于 Remove null values and shift values from the next column in pyspark

Can't connect to Azure Data Lake Gen2 using PySpark and Databricks Connect

阅读更多关于 Can't connect to Azure Data Lake Gen2 using PySpark and Databricks Connect

问题 Recently, Databricks launched Databricks Connect that allows you to write jobs using Spark native APIs and have them execute remotely on an Azure Databricks cluster instead of in the local Spark session. It works fine except when I try to access files in Azure Data Lake Storage Gen2. When I execute this: spark.read.json("abfss://...").count() I get this error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs

Reading partition columns without partition column names

阅读更多关于 Reading partition columns without partition column names

问题 We have data stored in s3 partitioned in the following structure: bucket/directory/table/aaaa/bb/cc/dd/ where aaaa is the year, bb is the month, cc is the day and dd is the hour. As you can see, there are no partition keys in the path ( year=aaaa , month=bb , day=cc , hour=dd) . As a result, when I read the table into Spark, there is no year , month , day or hour columns. Is there anyway I can read the table into Spark and include the partitioned column without : changing the path names in s3

java.lang.IllegalArgumentException when applying a Python UDF to a Spark dataframe

阅读更多关于 java.lang.IllegalArgumentException when applying a Python UDF to a Spark dataframe

问题 I'm testing the example code provided in the documentation of pandas_udf (https://spark.apache.org/docs/2.3.1/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf), using Pyspark 2.3.1 on my local machine: from pyspark.sql import SparkSession from pyspark.sql.functions import pandas_udf, PandasUDFType df = spark.createDataFrame( [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v")) @pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP) def normalize(pdf): v = pdf.v

Count including null in PySpark Dataframe Aggregation

阅读更多关于 Count including null in PySpark Dataframe Aggregation

问题 I am trying to get some counts on a DataFrame using agg and count. from pyspark.sql import Row ,functions as F row = Row("Cat","Date") df = (sc.parallelize ([ row("A",'2017-03-03'), row('A',None), row('B','2017-03-04'), row('B','Garbage'), row('A','2016-03-04') ]).toDF()) df = df.withColumn("Casted", df['Date'].cast('date')) df.show() ( df.groupby(df['Cat']) .agg ( #F.count(col('Date').isNull() | col('Date').isNotNull()).alias('Date_Count'), F.count('Date').alias('Date_Count'), F.count(

Count including null in PySpark Dataframe Aggregation

阅读更多关于 Count including null in PySpark Dataframe Aggregation