pyspark

Unresolved dependency in spark-streaming-kafka-0-8_2.12;2.4.4

心已入冬 提交于 2020-01-24 21:50:40
问题 I use Spark 2.4.4 and I have unresolved dependency error when I add in spark submit the following package: spark-streaming-kafka-0-8_2.12;2.4.4 My submit code: ./bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.12:2.4.4 回答1: I had the same issue with spark 2.4.4. I think it's a typo in the Scala version of the package. So, use the following: --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.4.4 来源: https://stackoverflow.com/questions/57962389/unresolved

Drop if all entries in a spark dataframe's specific column is null

旧巷老猫 提交于 2020-01-24 21:50:09
问题 Using Pyspark, how can I select/keep all columns of a DataFrame which contain a non-null value; or equivalently remove all columns which contain no data. Edited: As per Suresh Request, for column in media.columns: if media.select(media[column]).distinct().count() == 1: media = media.drop(media[column]) Here I assumed that if count is one, then it should be Nan. But I wanted to check whether that is Nan. And if there's any other inbuilt spark function, let me know. 回答1: This is a function I

apache spark: Read large size files from a directory

南楼画角 提交于 2020-01-24 19:55:12
问题 I am reading each file of a directory using wholeTextFiles . After that I am calling a function on each element of the rdd using map . The whole program uses just 50 lines of each file. The code is as below: def processFiles(fileNameContentsPair): fileName= fileNameContentsPair[0] result = "\n\n"+fileName resultEr = "\n\n"+fileName input = StringIO.StringIO(fileNameContentsPair[1]) reader = csv.reader(input,strict=True) try: i=0 for row in reader: if i==50: break // do some processing and get

Handle unseen categorical string Spark CountVectorizer

只谈情不闲聊 提交于 2020-01-24 16:32:21
问题 I have seen StringIndexer has problems with unseen labels (see here). My question are: Does CountVectorizer have the same limitation? How does it treat a string not in the vocabulary? Moreover, is the vocabulary size affected by the input data or is it fixed according to the vocabulary size parameter? Last, from ML point of view, assuming a simple classifier such as Logistic Regression, shouldn't an unseen category be encoded as a row of zeros so to be treated as "unknown" so to get some sort

pyspark error: : java.io.IOException: No FileSystem for scheme: gs

不想你离开。 提交于 2020-01-24 13:53:08
问题 I am trying to read a json file from a google bucket into a pyspark dataframe on a local spark machine. Here's the code: import pandas as pd import numpy as np from pyspark import SparkContext, SparkConf from pyspark.sql import SparkSession, SQLContext conf = SparkConf().setAll([('spark.executor.memory', '16g'), ('spark.executor.cores','4'), ('spark.cores.max','4')]).setMaster('local[*]') spark = (SparkSession. builder. config(conf=conf). getOrCreate()) sc = spark.sparkContext import glob

Group By and standardize in spark

試著忘記壹切 提交于 2020-01-24 12:57:05
问题 I have the following data frame: import pandas as pd import numpy as np df = pd.DataFrame([[1,2,3],[1,2,1],[1,2,2],[2,2,2],[2,3,2],[2,4,2]],columns=["a","b","c"]) df = df.set_index("a") df.groupby("a").mean() df.groupby("a").std() I want to standardize the dataframe for each key and NOT standardize the whole column vector. So for the following example the output would be: a = 1: Column: b (2 - 2) / 0.0 (2 - 2) / 0.0 (2 - 2) / 0.0 Column: c (3 - 2) / 1.0 (1 - 2) / 1.0 (2 - 2) / 1.0 And then I

Spark UDF with dictionary argument fails

冷暖自知 提交于 2020-01-24 12:29:05
问题 I have a column (myCol) in a Spark dataframe that has values 1,2 and I want to create a new column with the description of this values like 1-> 'A', 2->'B' etc I know that this can be done with a join but I tried this because it seems more elegant: dictionary= { 1:'A' , 2:'B' } add_descriptions = udf(lambda x , dictionary: dictionary[x] if x in dictionary.keys() else None) df.withColumn("description",add_descriptions(df.myCol,dictionary)) And it fails with error lib/py4j-0.10.4-src.zip/py4j

Random Forest Classifier :To which class corresponds the probabilities

我只是一个虾纸丫 提交于 2020-01-24 11:17:11
问题 I am using the RandomForestClassifier from pyspark.ml.classification I run the model on a binary class dataset and display the probabilities. I have the following in the col probabilities : +-----+----------+---------------------------------------+ |label|prediction|probability | +-----+----------+---------------------------------------+ |0.0 |0.0 |[0.9005918461098429,0.0994081538901571]| |1.0 |1.0 |[0.6051335859900139,0.3948664140099861]| +-----+----------+-----------------------------------

grouping consecutive rows in PySpark Dataframe

白昼怎懂夜的黑 提交于 2020-01-24 05:38:25
问题 I have the following example Spark DataFrame: rdd = sc.parallelize([(1,"19:00:00", "19:30:00", 30), (1,"19:30:00", "19:40:00", 10),(1,"19:40:00", "19:43:00", 3), (2,"20:00:00", "20:10:00", 10), (1,"20:05:00", "20:15:00", 10),(1,"20:15:00", "20:35:00", 20)]) df = spark.createDataFrame(rdd, ["user_id", "start_time", "end_time", "duration"]) df.show() +-------+----------+--------+--------+ |user_id|start_time|end_time|duration| +-------+----------+--------+--------+ | 1| 19:00:00|19:30:00| 30| |

Spark dynamic frame show method yields nothing

会有一股神秘感。 提交于 2020-01-24 02:58:46
问题 So I am using AWS Glue auto-generated code to read csv file from S3 and write it to a table over a JDBC connection. Seems simple, Job runs successfully with no error but it writes nothing. When I checked the Glue Spark Dynamic Frame it does contents all the rows (using .count()). But when do a .show() on it yields nothing. .printSchema() works fine. Tried logging the error while using .show(), but no errors or nothing is printed. Converted the DynamicFrame to the data frame using .toDF and