pyspark

How to optimize percentage check and cols drop in large pyspark dataframe?

删除回忆录丶 提交于 2020-01-15 09:48:08
问题 I have a sample pandas dataframe like as shown below. But my real data is 40 million rows and 5200 columns df = pd.DataFrame({ 'subject_id':[1,1,1,1,2,2,2,2,3,3,4,4,4,4,4], 'readings' : ['READ_1','READ_2','READ_1','READ_3',np.nan,'READ_5',np.nan,'READ_8','READ_10','READ_12','READ_11','READ_14','READ_09','READ_08','READ_07'], 'val' :[5,6,7,np.nan,np.nan,7,np.nan,12,13,56,32,13,45,43,46], }) from pyspark.sql.types import * from pyspark.sql.functions import isnan, when, count, col mySchema =

Online (incremental) logistic regression in Spark [duplicate]

笑着哭i 提交于 2020-01-15 08:16:10
问题 This question already has answers here : Whether we can update existing model in spark-ml/spark-mllib? (2 answers) Closed 11 months ago . In Spark MLlib (RDD-based API) there is the StreamingLogisticRegressionWithSGD for incremental training of a Logistic Regression model. However, this class has been deprecated and offers little functionality (eg no access to model coefficients and output probabilities). In Spark ML (DataFrame-based API) I only find the class LogisticRegression , having only

How to efficiently upload a large .tsv file to a Hive table with split columns in pyspark?

我怕爱的太早我们不能终老 提交于 2020-01-15 07:59:29
问题 I have a large (~10 milion lines) .tsv file with two columns, 'id' and 'group'. 'Group' column is actually a list of all groups a certain id belongs to, so the file looks like this: id1 group1,group2 id2 group2,group3,group4 id3 group1 ... I need to upload it to a Hive table using pyspark, however I want to split the group column so that there is only one group in one row, so the resulting table looks like this: id1 group1 id1 group2 id2 group2 id2 group3 id2 group4 id3 group1 I have tried

repartition a dense matrix in pyspark

无人久伴 提交于 2020-01-15 07:50:27
问题 I have a Dense matrix(100*100) in pyspark, and I want to repartition it into ten groups with each containing 10 rows. from pyspark import SparkContext, SparkConf from pyspark.mllib import * sc = SparkContext("local", "Simple App") dm2 = Matrices.dense(100, 100, RandomRDDs.uniformRDD(sc, 10000).collect()) newRdd = sc.parallelize(dm2.toArray()) rerdd = newRdd.repartition(10) the above code results in rerdd containing 100 elements. I want to present this matrix dm2 as row-wise partitioned blocks

spark-submit with specific python librairies

一曲冷凌霜 提交于 2020-01-15 07:21:55
问题 I have a pyspark code depending on third party librairies. I want to execute this code on my cluster which run under mesos. I do have a zipped version of my python environment that is on a http server reachable by my cluster. I have some trouble to specify to my spark-submit query to use this environment. I use both --archives to load the zip file and --conf 'spark.pyspark.driver.python=path/to/my/env/bin/python' plus --conf 'spark.pyspark.python=path/to/my/env/bin/python' to specify the

spark-submit with specific python librairies

我的未来我决定 提交于 2020-01-15 07:21:26
问题 I have a pyspark code depending on third party librairies. I want to execute this code on my cluster which run under mesos. I do have a zipped version of my python environment that is on a http server reachable by my cluster. I have some trouble to specify to my spark-submit query to use this environment. I use both --archives to load the zip file and --conf 'spark.pyspark.driver.python=path/to/my/env/bin/python' plus --conf 'spark.pyspark.python=path/to/my/env/bin/python' to specify the

pyspark replace multiple values with null in dataframe

隐身守侯 提交于 2020-01-15 06:43:09
问题 I have a dataframe (df) and within the dataframe I have a column user_id df = sc.parallelize([(1, "not_set"), (2, "user_001"), (3, "user_002"), (4, "n/a"), (5, "N/A"), (6, "userid_not_set"), (7, "user_003"), (8, "user_004")]).toDF(["key", "user_id"]) df: +---+--------------+ |key| user_id| +---+--------------+ | 1| not_set| | 2| user_003| | 3| user_004| | 4| n/a| | 5| N/A| | 6|userid_not_set| | 7| user_003| | 8| user_004| +---+--------------+ I would like to replace the following values: not

PySpark DataFrames - filtering using comparisons between columns of different types

徘徊边缘 提交于 2020-01-15 04:41:53
问题 Suppose you have a dataframe with columns of various types (string, double...) and a special value "miss" that represents "missing value" in string-typed columns. from pyspark.sql import SparkSession import pandas as pd spark = SparkSession.builder.getOrCreate() pdf = pd.DataFrame([ [1, 'miss'], [2, 'x'], [None, 'y'] ], columns=['intcol', 'strcol']) df = spark.createDataFrame(data=pdf) I am trying to count the number of non-missing values for each column, using filtering like this: col = df[

PySpark DataFrames - filtering using comparisons between columns of different types

不羁的心 提交于 2020-01-15 04:41:31
问题 Suppose you have a dataframe with columns of various types (string, double...) and a special value "miss" that represents "missing value" in string-typed columns. from pyspark.sql import SparkSession import pandas as pd spark = SparkSession.builder.getOrCreate() pdf = pd.DataFrame([ [1, 'miss'], [2, 'x'], [None, 'y'] ], columns=['intcol', 'strcol']) df = spark.createDataFrame(data=pdf) I am trying to count the number of non-missing values for each column, using filtering like this: col = df[

PySpark DataFrames - filtering using comparisons between columns of different types

本秂侑毒 提交于 2020-01-15 04:41:12
问题 Suppose you have a dataframe with columns of various types (string, double...) and a special value "miss" that represents "missing value" in string-typed columns. from pyspark.sql import SparkSession import pandas as pd spark = SparkSession.builder.getOrCreate() pdf = pd.DataFrame([ [1, 'miss'], [2, 'x'], [None, 'y'] ], columns=['intcol', 'strcol']) df = spark.createDataFrame(data=pdf) I am trying to count the number of non-missing values for each column, using filtering like this: col = df[