pyspark

Counting words after grouping records

不羁的心 提交于 2019-12-23 01:46:08
问题 Note: Although the provided answer is working, it can get rather slow on larger data sets. Take a look at this for a faster solution. I am having a data frame which consists of labelled document such as this one: df_ = spark.createDataFrame([ ('1', 'hello how are are you today'), ('1', 'hello how are you'), ('2', 'hello are you here'), ('2', 'how is it'), ('3', 'hello how are you'), ('3', 'hello how are you'), ('4', 'hello how is it you today') ], schema=['label', 'text']) What I want is to

Add list as column to Dataframe in pyspark

帅比萌擦擦* 提交于 2019-12-23 01:01:09
问题 I have a list of integers and a sqlcontext dataframe with the number of rows equal to the length of the list. I want to add the list as a column to this dataframe maintaining the order. I feel like this should be really simple but I can't find an elegant solution. 回答1: You cannot simply add a list as a dataframe column since list is local object and dataframe is distirbuted. You can try one of thw followin approaches: convert dataframe to local by collect() or toLocalIterator() and for each

reading data from URL using spark databricks platform

自作多情 提交于 2019-12-22 18:45:20
问题 trying to read data from url using spark on databricks community edition platform i tried to use spark.read.csv and using SparkFiles but still, i am missing some simple point url = "https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/adult.csv" from pyspark import SparkFiles spark.sparkContext.addFile(url) # sc.addFile(url) # sqlContext = SQLContext(sc) # df = sqlContext.read.csv(SparkFiles.get("adult.csv"), header=True, inferSchema= True) df = spark.read.csv(SparkFiles.get(

Error with training logistic regression model on Apache Spark. SPARK-5063

て烟熏妆下的殇ゞ 提交于 2019-12-22 18:30:43
问题 I am trying to build a Logistic Regression model with Apache Spark. Here is the code. parsedData = raw_data.map(mapper) # mapper is a function that generates pair of label and feature vector as LabeledPoint object featureVectors = parsedData.map(lambda point: point.features) # get feature vectors from parsed data scaler = StandardScaler(True, True).fit(featureVectors) #this creates a standardization model to scale the features scaledData = parsedData.map(lambda lp: LabeledPoint(lp.label,

unable to execute pyspark after installation

故事扮演 提交于 2019-12-22 18:02:03
问题 I have manually copied the spark-2.4.0-bin-hadoop2.7.tgz and extracted. Then I have made entry in .bash_profile as below: export SPARK_HOME=/Users/suman/Pyspark/spark-2.4.0-bin-hadoop2.7 export PATH=$SPARK_HOME/bin:$PATH I'm sure that i have installed jdk.Response below: ABCDEFGH:bin suman$ java -version java version "11" 2018-09-25 Java(TM) SE Runtime Environment 18.9 (build 11+28) Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11+28, mixed mode) Error below: ABCDEFGH:bin suman$ pyspark

Curried UDF - Pyspark

北城以北 提交于 2019-12-22 17:55:05
问题 I am trying to implement a UDF in spark; that can take both a literal and column as an argument. To achieve this, I believe I can use a curried UDF. The function is used to match a string literal to each value in the column of a DataFrame . I have summarized the code below:- def matching(match_string_1): def matching_inner(match_string_2): return difflib.SequenceMatcher(None, match_string_1, match_string_2).ratio() return matching hc.udf.register("matching", matching) matching_udf = F.udf

PySpark/Hive: how to CREATE TABLE with LazySimpleSerDe to convert boolean 't' / 'f'?

拟墨画扇 提交于 2019-12-22 17:46:25
问题 Hello dear stackoverflow community, here is my problem: A) I have data in csv with some boolean columns; unfortunately, the values in these columns are t or f (single letter); this is an artifact (from Redshift) that I cannot control. B) I need to create a spark dataframe from this data, hopefully converting t -> true and f -> false . For that, I create a Hive DB and a temp Hive table and then SELECT * from it, like this: sql_str = """SELECT * FROM {db}.{s}_{t} """.format( db=hive_db_name, s

access cassandra from pyspark

纵然是瞬间 提交于 2019-12-22 14:55:18
问题 I am working on an Azure Datalake. I want to access cassandra from my pyspark script. I tried : > pyspark --packages anguenot/pyspark-cassandra:0.7.0 --conf spark.cassandra.connection.host=12.34.56.78 SPARK_MAJOR_VERSION is set to 2, using Spark2 Python 2.7.12 |Anaconda custom (64-bit)| (default, Jul 2 2016, 17:42:40) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Anaconda is brought to you by Continuum Analytics. Please

access cassandra from pyspark

天涯浪子 提交于 2019-12-22 14:55:06
问题 I am working on an Azure Datalake. I want to access cassandra from my pyspark script. I tried : > pyspark --packages anguenot/pyspark-cassandra:0.7.0 --conf spark.cassandra.connection.host=12.34.56.78 SPARK_MAJOR_VERSION is set to 2, using Spark2 Python 2.7.12 |Anaconda custom (64-bit)| (default, Jul 2 2016, 17:42:40) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Anaconda is brought to you by Continuum Analytics. Please

Pyspark - how to backfill a DataFrame?

拟墨画扇 提交于 2019-12-22 13:50:24
问题 How can you do the same thing as df.fillna(method='bfill') for a pandas dataframe with a pyspark.sql.DataFrame ? The pyspark dataframe has the pyspark.sql.DataFrame.fillna method, however there is no support for a method parameter. In pandas you can use the following to backfill a time series: Create data import pandas as pd index = pd.date_range('2017-01-01', '2017-01-05') data = [1, 2, 3, None, 5] df = pd.DataFrame({'data': data}, index=index) Giving Out[1]: data 2017-01-01 1.0 2017-01-02 2