pyspark | 易学教程

How to set port for pyspark jupyter notebook?

阅读更多关于 How to set port for pyspark jupyter notebook?

问题 I am starting a pyspark jupyter notebook with a script: #!/bin/bash ipaddres=... echo "Start notebook server at IP address $ipaddress" function snotebook () { #Spark path (based on your computer) SPARK_PATH=/home/.../software/spark-2.3.1-bin-hadoop2.7 export PYSPARK_DRIVER_PYTHON="jupyter" export PYSPARK_DRIVER_PYTHON_OPTS="notebook" # For python 3 users, you have to add the line below or you will get an error export PYSPARK_PYTHON=python3 $SPARK_PATH/bin/pyspark --master local[10] }

How to set port for pyspark jupyter notebook?

阅读更多关于 How to set port for pyspark jupyter notebook?

Spark dataframe to numpy array via udf or without collecting to driver

阅读更多关于 Spark dataframe to numpy array via udf or without collecting to driver

问题 Real life df is a massive dataframe that cannot be loaded into driver memory. Can this be done using regular or pandas udf? # Code to generate a sample dataframe from pyspark.sql import functions as F from pyspark.sql.types import * import pandas as pd import numpy as np sample = [['123',[[0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1], [0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1]]], ['345',[[1,0,0,0,0,1,1,1,0,1,1,0,1,0,0,0,1,1,1,1,1,1], [0,1,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1,1,1,1,1]]], ['425',

Spark dataframe to numpy array via udf or without collecting to driver

阅读更多关于 Spark dataframe to numpy array via udf or without collecting to driver

Spark sql Optimization Techniques loading csv to orc format of hive

阅读更多关于 Spark sql Optimization Techniques loading csv to orc format of hive

问题 Hi I have 90 GB data In CSV file I'm loading this data into one temp table and then from temp table to orc table using select insert command but for converting and loading data into orc format its taking 4 hrs in spark sql.Is there any kind of optimization technique which i can use to reduce this time.As of now I'm not using any kind of optimization technique I'm just using spark sql and loading data from csv file to table(textformat) and then from this temp table to orc table(using select

Spark sql Optimization Techniques loading csv to orc format of hive

阅读更多关于 Spark sql Optimization Techniques loading csv to orc format of hive

Hive Table getting created but not able to see using hive shell

阅读更多关于 Hive Table getting created but not able to see using hive shell

问题 Hi I'm Saving My dataframe as hive table using spark-sql. mydf.write().format("orc").saveAsTable("myTableName") I'm able to see that table is getting created using hadoop fs -ls /apps/hive/warehouse\dbname.db Also able to see data using spark-shell spark.sql(use dbname) spark.sql(show tables).show(false) but same tables I'm not able to see using hive shell. I have place my hive-site.xml file using. sudo cp /etc/hive/conf.dist/hive-site.xml /etc/spark/conf/ but still not able to see. can

Pyspark - Create new column with the RMSE of two other columns in dataframe

阅读更多关于 Pyspark - Create new column with the RMSE of two other columns in dataframe

问题 I am fairly new to Pyspark. I have a dataframe, and I would like to create a 3rd column with the calculation for RMSE between col1 and col2 . I am using a user defined lambda function to make the RMSE calculation, but keep getting this error AttributeError: 'int' object has no attribute 'mean' from pyspark.sql.functions import udf,col from pyspark.sql.types import IntegerType from pyspark import SparkContext from pyspark.sql import SparkSession from pyspark.sql import SQLContext spark =

Pyspark - Create new column with the RMSE of two other columns in dataframe

阅读更多关于 Pyspark - Create new column with the RMSE of two other columns in dataframe

Spark marking duplicate user login within 24 hour after first login

阅读更多关于 Spark marking duplicate user login within 24 hour after first login

问题 i have a dataset with users and login time. I need to mark duplicate if there is/additional logins within 24 hour period AFTER First login. Activity window opens with user login. For example, here is sample data set user login ----------------------------- user1 12/1/19 8:00 user1 12/1/19 10:00 user1 12/1/19 23:00 user1 12/2/19 7:00 user1 12/2/19 8:00 user1 12/2/19 10:00 user1 12/3/19 23:00 user1 12/4/19 7:00 user2 12/4/19 8:00 user2 12/5/19 5:00 user2 12/6/19 0:00 Expected result user login