pyspark | 易学教程

Unresolved dependency in spark-streaming-kafka-0-8_2.12;2.4.4

阅读更多关于 Unresolved dependency in spark-streaming-kafka-0-8_2.12;2.4.4

问题 I use Spark 2.4.4 and I have unresolved dependency error when I add in spark submit the following package: spark-streaming-kafka-0-8_2.12;2.4.4 My submit code: ./bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.12:2.4.4 回答1: I had the same issue with spark 2.4.4. I think it's a typo in the Scala version of the package. So, use the following: --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.4.4 来源： https://stackoverflow.com/questions/57962389/unresolved

Drop if all entries in a spark dataframe's specific column is null

阅读更多关于 Drop if all entries in a spark dataframe's specific column is null

问题 Using Pyspark, how can I select/keep all columns of a DataFrame which contain a non-null value; or equivalently remove all columns which contain no data. Edited: As per Suresh Request, for column in media.columns: if media.select(media[column]).distinct().count() == 1: media = media.drop(media[column]) Here I assumed that if count is one, then it should be Nan. But I wanted to check whether that is Nan. And if there's any other inbuilt spark function, let me know. 回答1: This is a function I

apache spark: Read large size files from a directory

阅读更多关于 apache spark: Read large size files from a directory

问题 I am reading each file of a directory using wholeTextFiles . After that I am calling a function on each element of the rdd using map . The whole program uses just 50 lines of each file. The code is as below: def processFiles(fileNameContentsPair): fileName= fileNameContentsPair[0] result = "\n\n"+fileName resultEr = "\n\n"+fileName input = StringIO.StringIO(fileNameContentsPair[1]) reader = csv.reader(input,strict=True) try: i=0 for row in reader: if i==50: break // do some processing and get

Handle unseen categorical string Spark CountVectorizer

阅读更多关于 Handle unseen categorical string Spark CountVectorizer

问题 I have seen StringIndexer has problems with unseen labels (see here). My question are: Does CountVectorizer have the same limitation? How does it treat a string not in the vocabulary? Moreover, is the vocabulary size affected by the input data or is it fixed according to the vocabulary size parameter? Last, from ML point of view, assuming a simple classifier such as Logistic Regression, shouldn't an unseen category be encoded as a row of zeros so to be treated as "unknown" so to get some sort

pyspark error: : java.io.IOException: No FileSystem for scheme: gs

阅读更多关于 pyspark error: : java.io.IOException: No FileSystem for scheme: gs

问题 I am trying to read a json file from a google bucket into a pyspark dataframe on a local spark machine. Here's the code: import pandas as pd import numpy as np from pyspark import SparkContext, SparkConf from pyspark.sql import SparkSession, SQLContext conf = SparkConf().setAll([('spark.executor.memory', '16g'), ('spark.executor.cores','4'), ('spark.cores.max','4')]).setMaster('local[*]') spark = (SparkSession. builder. config(conf=conf). getOrCreate()) sc = spark.sparkContext import glob

Group By and standardize in spark

阅读更多关于 Group By and standardize in spark

问题 I have the following data frame: import pandas as pd import numpy as np df = pd.DataFrame([[1,2,3],[1,2,1],[1,2,2],[2,2,2],[2,3,2],[2,4,2]],columns=["a","b","c"]) df = df.set_index("a") df.groupby("a").mean() df.groupby("a").std() I want to standardize the dataframe for each key and NOT standardize the whole column vector. So for the following example the output would be: a = 1: Column: b (2 - 2) / 0.0 (2 - 2) / 0.0 (2 - 2) / 0.0 Column: c (3 - 2) / 1.0 (1 - 2) / 1.0 (2 - 2) / 1.0 And then I

Spark UDF with dictionary argument fails

阅读更多关于 Spark UDF with dictionary argument fails

问题 I have a column (myCol) in a Spark dataframe that has values 1,2 and I want to create a new column with the description of this values like 1-> 'A', 2->'B' etc I know that this can be done with a join but I tried this because it seems more elegant: dictionary= { 1:'A' , 2:'B' } add_descriptions = udf(lambda x , dictionary: dictionary[x] if x in dictionary.keys() else None) df.withColumn("description",add_descriptions(df.myCol,dictionary)) And it fails with error lib/py4j-0.10.4-src.zip/py4j

Random Forest Classifier :To which class corresponds the probabilities

阅读更多关于 Random Forest Classifier :To which class corresponds the probabilities

问题 I am using the RandomForestClassifier from pyspark.ml.classification I run the model on a binary class dataset and display the probabilities. I have the following in the col probabilities : +-----+----------+---------------------------------------+ |label|prediction|probability | +-----+----------+---------------------------------------+ |0.0 |0.0 |[0.9005918461098429,0.0994081538901571]| |1.0 |1.0 |[0.6051335859900139,0.3948664140099861]| +-----+----------+-----------------------------------

grouping consecutive rows in PySpark Dataframe

阅读更多关于 grouping consecutive rows in PySpark Dataframe

问题 I have the following example Spark DataFrame: rdd = sc.parallelize([(1,"19:00:00", "19:30:00", 30), (1,"19:30:00", "19:40:00", 10),(1,"19:40:00", "19:43:00", 3), (2,"20:00:00", "20:10:00", 10), (1,"20:05:00", "20:15:00", 10),(1,"20:15:00", "20:35:00", 20)]) df = spark.createDataFrame(rdd, ["user_id", "start_time", "end_time", "duration"]) df.show() +-------+----------+--------+--------+ |user_id|start_time|end_time|duration| +-------+----------+--------+--------+ | 1| 19:00:00|19:30:00| 30| |

Spark dynamic frame show method yields nothing

阅读更多关于 Spark dynamic frame show method yields nothing

问题 So I am using AWS Glue auto-generated code to read csv file from S3 and write it to a table over a JDBC connection. Seems simple, Job runs successfully with no error but it writes nothing. When I checked the Glue Spark Dynamic Frame it does contents all the rows (using .count()). But when do a .show() on it yields nothing. .printSchema() works fine. Tried logging the error while using .show(), but no errors or nothing is printed. Converted the DynamicFrame to the data frame using .toDF and