pyspark

Cannot save model using PySpark xgboost4j

六眼飞鱼酱① 提交于 2021-02-07 08:09:33
问题 I have a small PySpark program that uses xgboost4j and xgboost4j-spark in order to train a given dataset in a spark dataframe form. The training is done, but It seems I cannot save the model. Current libraries versions: Pyspark 2.4.0 xgboost4j 0.90 xgboost4j-spark 0.90 Spark submit args: os.environ['PYSPARK_SUBMIT_ARGS'] = "--py-files dist/DNA-0.0.2-py3.6.egg " \ "--jars dna/resources/xgboost4j-spark-0.90.jar," \ "dna/resources/xgboost4j-0.90.jar pyspark-shell" The training process is as

How to delete rows in a table created from a Spark dataframe?

不羁岁月 提交于 2021-02-07 07:01:47
问题 Basically, I would like to do a simple delete using SQL statements but when I execute the sql script it throws me the following error: pyspark.sql.utils.ParseException: u"\nmissing 'FROM' at 'a'(line 2, pos 23)\n\n== SQL ==\n\n DELETE a.* FROM adsquare a \n-----------------------^^^\n" These is the script that I'm using: sq = SparkSession.builder.config('spark.rpc.message.maxSize','1536').config("spark.sql.shuffle.partitions",str(shuffle_value)).getOrCreate() adsquare = sq.read.csv(f, schema

Apply custom function to cells of selected columns of a data frame in PySpark

和自甴很熟 提交于 2021-02-07 03:32:39
问题 Let's say I have a data frame which looks like this: +---+-----------+-----------+ | id| address1| address2| +---+-----------+-----------+ | 1|address 1.1|address 1.2| | 2|address 2.1|address 2.2| +---+-----------+-----------+ I would like to apply a custom function directly to the strings in the address1 and address2 columns, for example: def example(string1, string2): name_1 = string1.lower().split(' ') name_2 = string2.lower().split(' ') intersection_count = len(set(name_1) & set(name_2))

Performance decrease for huge amount of columns. Pyspark

我们两清 提交于 2021-02-06 20:18:54
问题 I met problem with processing of spark wide dataframe (about 9000 columns and sometimes more). Task: Create wide DF via groupBy and pivot. Transform columns to vector and processing in to KMeans from pyspark.ml. So I made extensive frame and try to create vector with VectorAssembler, cached it and trained on it KMeans. It took about 11 minutes for assembling and 2 minutes for KMeans for 7 different count of clusters on my pc in standalone mode for frame ~500x9000. Another side this processing

Performance decrease for huge amount of columns. Pyspark

空扰寡人 提交于 2021-02-06 20:14:01
问题 I met problem with processing of spark wide dataframe (about 9000 columns and sometimes more). Task: Create wide DF via groupBy and pivot. Transform columns to vector and processing in to KMeans from pyspark.ml. So I made extensive frame and try to create vector with VectorAssembler, cached it and trained on it KMeans. It took about 11 minutes for assembling and 2 minutes for KMeans for 7 different count of clusters on my pc in standalone mode for frame ~500x9000. Another side this processing

Performance decrease for huge amount of columns. Pyspark

可紊 提交于 2021-02-06 20:10:08
问题 I met problem with processing of spark wide dataframe (about 9000 columns and sometimes more). Task: Create wide DF via groupBy and pivot. Transform columns to vector and processing in to KMeans from pyspark.ml. So I made extensive frame and try to create vector with VectorAssembler, cached it and trained on it KMeans. It took about 11 minutes for assembling and 2 minutes for KMeans for 7 different count of clusters on my pc in standalone mode for frame ~500x9000. Another side this processing

Performance decrease for huge amount of columns. Pyspark

允我心安 提交于 2021-02-06 20:09:07
问题 I met problem with processing of spark wide dataframe (about 9000 columns and sometimes more). Task: Create wide DF via groupBy and pivot. Transform columns to vector and processing in to KMeans from pyspark.ml. So I made extensive frame and try to create vector with VectorAssembler, cached it and trained on it KMeans. It took about 11 minutes for assembling and 2 minutes for KMeans for 7 different count of clusters on my pc in standalone mode for frame ~500x9000. Another side this processing

How do you create merge_asof functionality in PySpark?

…衆ロ難τιáo~ 提交于 2021-02-06 20:01:47
问题 Table A has many columns with a date column, Table B has a datetime and a value. The data in both tables are generated sporadically with no regular interval. Table A is small, table B is massive. I need to join B to A under the condition that a given element a of A.datetime corresponds to B[B['datetime'] <= a]]['datetime'].max() There are a couple ways to do this, but I would like the most efficient way. Option 1 Broadcast the small dataset as a Pandas DataFrame. Set up a Spark UDF that

How do you create merge_asof functionality in PySpark?

|▌冷眼眸甩不掉的悲伤 提交于 2021-02-06 20:01:02
问题 Table A has many columns with a date column, Table B has a datetime and a value. The data in both tables are generated sporadically with no regular interval. Table A is small, table B is massive. I need to join B to A under the condition that a given element a of A.datetime corresponds to B[B['datetime'] <= a]]['datetime'].max() There are a couple ways to do this, but I would like the most efficient way. Option 1 Broadcast the small dataset as a Pandas DataFrame. Set up a Spark UDF that

How do you create merge_asof functionality in PySpark?

那年仲夏 提交于 2021-02-06 20:00:43
问题 Table A has many columns with a date column, Table B has a datetime and a value. The data in both tables are generated sporadically with no regular interval. Table A is small, table B is massive. I need to join B to A under the condition that a given element a of A.datetime corresponds to B[B['datetime'] <= a]]['datetime'].max() There are a couple ways to do this, but I would like the most efficient way. Option 1 Broadcast the small dataset as a Pandas DataFrame. Set up a Spark UDF that