pyspark-sql

Why Mongo Spark connector returns different and incorrect counts for a query?

落花浮王杯 提交于 2019-12-01 04:32:48
I'm evaluating Mongo Spark connector for a project and I'm getting the inconsistent results. I use MongoDB server version 3.4.5, Spark (via PySpark) version 2.2.0, Mongo Spark Connector version 2.11;2.2.0 locally on my laptop. For my test DB I use the Enron dataset http://mongodb-enron-email.s3-website-us-east-1.amazonaws.com/ I'm interested in Spark SQL queries and when I started to run simple test queries for count I received different counts for each run. Here is output from my mongo shell: > db.messages.count({'headers.To': 'eric.bass@enron.com'}) 203 Here are some output from my PySpark

Spark Pipeline error

让人想犯罪 __ 提交于 2019-12-01 04:30:07
问题 I am trying to run a Multinomial Logistic Regression model from pyspark.sql import SparkSession spark = SparkSession.builder.appName('prepare_data').getOrCreate() from pyspark.sql.types import * spark.sql("DROP TABLE IF EXISTS customers") spark.sql("CREATE TABLE customers ( Customer_ID DOUBLE, Name STRING, Gender STRING, Address STRING, Nationality DOUBLE, Account_Type STRING, Age DOUBLE, Education STRING, Employment STRING, Salary DOUBLE, Employer_Stability STRING, Customer_Loyalty DOUBLE,

Why Mongo Spark connector returns different and incorrect counts for a query?

孤街醉人 提交于 2019-12-01 02:13:40
问题 I'm evaluating Mongo Spark connector for a project and I'm getting the inconsistent results. I use MongoDB server version 3.4.5, Spark (via PySpark) version 2.2.0, Mongo Spark Connector version 2.11;2.2.0 locally on my laptop. For my test DB I use the Enron dataset http://mongodb-enron-email.s3-website-us-east-1.amazonaws.com/ I'm interested in Spark SQL queries and when I started to run simple test queries for count I received different counts for each run. Here is output from my mongo shell

Date difference between consecutive rows - Pyspark Dataframe

二次信任 提交于 2019-12-01 01:37:10
问题 I have a table with following structure USER_ID Tweet_ID Date 1 1001 Thu Aug 05 19:11:39 +0000 2010 1 6022 Mon Aug 09 17:51:19 +0000 2010 1 1041 Sun Aug 19 11:10:09 +0000 2010 2 9483 Mon Jan 11 10:51:23 +0000 2012 2 4532 Fri May 21 11:11:11 +0000 2012 3 4374 Sat Jul 10 03:21:23 +0000 2013 3 4334 Sun Jul 11 04:53:13 +0000 2013 Basically what I would like to do is have a PysparkSQL query that calculates the date difference (in seconds) for consecutive records with the same user_id number. The

List to DataFrame in pyspark

无人久伴 提交于 2019-12-01 01:35:46
Can someone tell me how to convert a list containing strings to a Dataframe in pyspark. I am using python 3.6 with spark 2.2.1. I am just started learning spark environment and my data looks like below my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']] Now, i want to create a Dataframe as follows --------------------------------- |ID | words | --------------------------------- 1 | ['apple','ball','ballon'] | 2 | ['cat','camel','james'] | I even want to add ID column which is not associated in the data You can convert the list to a list of Row objects, then use

List to DataFrame in pyspark

烂漫一生 提交于 2019-11-30 20:29:16
问题 Can someone tell me how to convert a list containing strings to a Dataframe in pyspark. I am using python 3.6 with spark 2.2.1. I am just started learning spark environment and my data looks like below my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']] Now, i want to create a Dataframe as follows --------------------------------- |ID | words | --------------------------------- 1 | ['apple','ball','ballon'] | 2 | ['cat','camel','james'] | I even want to add ID

How to use matplotlib to plot pyspark sql results

耗尽温柔 提交于 2019-11-30 13:47:47
I am new to pyspark. I want to plot the result using matplotlib, but not sure which function to use. I searched for a way to convert sql result to pandas and then use plot. Hi Team I have found the solution for this. I converted sql dataframe to pandas dataframe and then I was able to plot the graphs. below is the sample code.from pyspark.sql import Row from pyspark.sql import HiveContext import pyspark from IPython.display import display import matplotlib import matplotlib.pyplot as plt %matplotlib inline sc = pyspark.SparkContext() sqlContext = HiveContext(sc) test_list = [(1, 'hasan'),(2,

PySpark: Take average of a column after using filter function

喜夏-厌秋 提交于 2019-11-30 13:45:40
问题 I am using the following code to get the average age of people whose salary is greater than some threshold. dataframe.filter(df['salary'] > 100000).agg({"avg": "age"}) the column age is numeric (float) but still I am getting this error. py4j.protocol.Py4JJavaError: An error occurred while calling o86.agg. : scala.MatchError: age (of class java.lang.String) Do you know any other way to obtain the avg etc. without using groupBy function and SQL queries. 回答1: Aggregation function should be a

PySpark: Take average of a column after using filter function

余生颓废 提交于 2019-11-30 11:29:29
I am using the following code to get the average age of people whose salary is greater than some threshold. dataframe.filter(df['salary'] > 100000).agg({"avg": "age"}) the column age is numeric (float) but still I am getting this error. py4j.protocol.Py4JJavaError: An error occurred while calling o86.agg. : scala.MatchError: age (of class java.lang.String) Do you know any other way to obtain the avg etc. without using groupBy function and SQL queries. zero323 Aggregation function should be a value and a column name a key: dataframe.filter(df['salary'] > 100000).agg({"age": "avg"})

Filtering Spark DataFrame on new column

空扰寡人 提交于 2019-11-30 09:58:30
问题 Context: I have a dataset too large to fit in memory I am training a Keras RNN on. I am using PySpark on an AWS EMR Cluster to train the model in batches that are small enough to be stored in memory. I was not able to implement the model as distributed using elephas and I suspect this is related to my model being stateful. I'm not entirely sure though. The dataframe has a row for every user and days elapsed from the day of install from 0 to 29. After querying the database I do a number of