pyspark

How to set port for pyspark jupyter notebook?

痴心易碎 提交于 2020-04-30 09:59:22
问题 I am starting a pyspark jupyter notebook with a script: #!/bin/bash ipaddres=... echo "Start notebook server at IP address $ipaddress" function snotebook () { #Spark path (based on your computer) SPARK_PATH=/home/.../software/spark-2.3.1-bin-hadoop2.7 export PYSPARK_DRIVER_PYTHON="jupyter" export PYSPARK_DRIVER_PYTHON_OPTS="notebook" # For python 3 users, you have to add the line below or you will get an error export PYSPARK_PYTHON=python3 $SPARK_PATH/bin/pyspark --master local[10] }

How to set port for pyspark jupyter notebook?

◇◆丶佛笑我妖孽 提交于 2020-04-30 09:59:19
问题 I am starting a pyspark jupyter notebook with a script: #!/bin/bash ipaddres=... echo "Start notebook server at IP address $ipaddress" function snotebook () { #Spark path (based on your computer) SPARK_PATH=/home/.../software/spark-2.3.1-bin-hadoop2.7 export PYSPARK_DRIVER_PYTHON="jupyter" export PYSPARK_DRIVER_PYTHON_OPTS="notebook" # For python 3 users, you have to add the line below or you will get an error export PYSPARK_PYTHON=python3 $SPARK_PATH/bin/pyspark --master local[10] }

Spark dataframe to numpy array via udf or without collecting to driver

旧街凉风 提交于 2020-04-30 09:48:46
问题 Real life df is a massive dataframe that cannot be loaded into driver memory. Can this be done using regular or pandas udf? # Code to generate a sample dataframe from pyspark.sql import functions as F from pyspark.sql.types import * import pandas as pd import numpy as np sample = [['123',[[0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1], [0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1]]], ['345',[[1,0,0,0,0,1,1,1,0,1,1,0,1,0,0,0,1,1,1,1,1,1], [0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1]]], ['425',

Spark dataframe to numpy array via udf or without collecting to driver

家住魔仙堡 提交于 2020-04-30 09:47:45
问题 Real life df is a massive dataframe that cannot be loaded into driver memory. Can this be done using regular or pandas udf? # Code to generate a sample dataframe from pyspark.sql import functions as F from pyspark.sql.types import * import pandas as pd import numpy as np sample = [['123',[[0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1], [0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1]]], ['345',[[1,0,0,0,0,1,1,1,0,1,1,0,1,0,0,0,1,1,1,1,1,1], [0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1]]], ['425',

Spark sql Optimization Techniques loading csv to orc format of hive

独自空忆成欢 提交于 2020-04-30 07:15:04
问题 Hi I have 90 GB data In CSV file I'm loading this data into one temp table and then from temp table to orc table using select insert command but for converting and loading data into orc format its taking 4 hrs in spark sql.Is there any kind of optimization technique which i can use to reduce this time.As of now I'm not using any kind of optimization technique I'm just using spark sql and loading data from csv file to table(textformat) and then from this temp table to orc table(using select

Spark sql Optimization Techniques loading csv to orc format of hive

為{幸葍}努か 提交于 2020-04-30 07:14:46
问题 Hi I have 90 GB data In CSV file I'm loading this data into one temp table and then from temp table to orc table using select insert command but for converting and loading data into orc format its taking 4 hrs in spark sql.Is there any kind of optimization technique which i can use to reduce this time.As of now I'm not using any kind of optimization technique I'm just using spark sql and loading data from csv file to table(textformat) and then from this temp table to orc table(using select

Hive Table getting created but not able to see using hive shell

穿精又带淫゛_ 提交于 2020-04-30 07:10:27
问题 Hi I'm Saving My dataframe as hive table using spark-sql. mydf.write().format("orc").saveAsTable("myTableName") I'm able to see that table is getting created using hadoop fs -ls /apps/hive/warehouse\dbname.db Also able to see data using spark-shell spark.sql(use dbname) spark.sql(show tables).show(false) but same tables I'm not able to see using hive shell. I have place my hive-site.xml file using. sudo cp /etc/hive/conf.dist/hive-site.xml /etc/spark/conf/ but still not able to see. can

Pyspark - Create new column with the RMSE of two other columns in dataframe

柔情痞子 提交于 2020-04-30 06:27:29
问题 I am fairly new to Pyspark. I have a dataframe, and I would like to create a 3rd column with the calculation for RMSE between col1 and col2 . I am using a user defined lambda function to make the RMSE calculation, but keep getting this error AttributeError: 'int' object has no attribute 'mean' from pyspark.sql.functions import udf,col from pyspark.sql.types import IntegerType from pyspark import SparkContext from pyspark.sql import SparkSession from pyspark.sql import SQLContext spark =

Pyspark - Create new column with the RMSE of two other columns in dataframe

巧了我就是萌 提交于 2020-04-30 06:27:20
问题 I am fairly new to Pyspark. I have a dataframe, and I would like to create a 3rd column with the calculation for RMSE between col1 and col2 . I am using a user defined lambda function to make the RMSE calculation, but keep getting this error AttributeError: 'int' object has no attribute 'mean' from pyspark.sql.functions import udf,col from pyspark.sql.types import IntegerType from pyspark import SparkContext from pyspark.sql import SparkSession from pyspark.sql import SQLContext spark =

Spark marking duplicate user login within 24 hour after first login

人盡茶涼 提交于 2020-04-30 06:24:08
问题 i have a dataset with users and login time. I need to mark duplicate if there is/additional logins within 24 hour period AFTER First login. Activity window opens with user login. For example, here is sample data set user login ----------------------------- user1 12/1/19 8:00 user1 12/1/19 10:00 user1 12/1/19 23:00 user1 12/2/19 7:00 user1 12/2/19 8:00 user1 12/2/19 10:00 user1 12/3/19 23:00 user1 12/4/19 7:00 user2 12/4/19 8:00 user2 12/5/19 5:00 user2 12/6/19 0:00 Expected result user login