pyspark | 易学教程

TensorFrames on IBM's Data Science Experience

阅读更多关于 TensorFrames on IBM's Data Science Experience

问题 This is a follow up from this question. I want to implement TensorFrames on IBM's Data Science Experience. I will consider it working if I can run all of the examples on the user guide for TensorFrames. I've had to import the following packages to do anything at all with TensorFrames: pixiedust.installPackage("http://central.maven.org/maven2/com/typesafe/scala-logging/scala-logging-slf4j_2.10/2.1.2/scala-logging-slf4j_2.10-2.1.2.jar") pixiedust.installPackage("http://central.maven.org/maven2

ValueError: could not convert string to float

阅读更多关于 ValueError: could not convert string to float

问题 I have a text file which contains some data. The data is as follows join2_train = sc.textFile('join2_train.csv',4) join2_train.take(3) [u'21.9059,TA-00002,S-0066,7/7/2013,0,0,Yes,1,SP-0019,6.35,0.71,137,8,19.05,N,N,N,N,EF-008,EF-008,0,0,0', u'12.3412,TA-00002,S-0066,7/7/2013,0,0,Yes,2,SP-0019,6.35,0.71,137,8,19.05,N,N,N,N,EF-008,EF-008,0,0,0', u'6.60183,TA-00002,S-0066,7/7/2013,0,0,Yes,5,SP-0019,6.35,0.71,137,8,19.05,N,N,N,N,EF-008,EF-008,0,0,0'] Now I am trying to parse this string into a

slice function in dstream spark streaming not work

阅读更多关于 slice function in dstream spark streaming not work

问题 Spark streaming providing sliding window function for get rdd for last k. But I want to try use slice function to get rdd for last k, in a case I want to query rdd during range time before current time. delta = timedelta(seconds=30) datates = datamap.slice(datetime.now()-delta,datetime.now()) And I get this error when execute the code --------------------------------------------------------------------------- Py4JJavaError Traceback (most recent call last) /home/hduser/spark-1.5.0/<ipython

How to I convert multiple Pandas DFs into a single Spark DF?

阅读更多关于 How to I convert multiple Pandas DFs into a single Spark DF?

问题 I have several Excel files that I need to load and pre-process before loading them into a Spark DF. I have a list of these files that need to be processed. I do something like this to read them in: file_list_rdd = sc.emptyRDD() for file_path in file_list: current_file_rdd = sc.binaryFiles(file_path) print(current_file_rdd.count()) file_list_rdd = file_list_rdd.union(current_file_rdd) I then have some mapper function that turns file_list_rdd from a set of (path, bytes) tuples to (path, Pandas

Why does my weights get normalized when I perform Logistic Regression With SGD in spark?

阅读更多关于 Why does my weights get normalized when I perform Logistic Regression With SGD in spark?

问题 I recently asked a question being confused about the weights I was receiving for the synthetic dataset I created. The answer I received was that the weights are being normalized. You can look at the details here. I'm wondering why LogisticRegressionWithSGD gives normalized weights whereas everything is fine in case of LBFGS in the same spark implementation. Is it possible that the weights weren't converging after all? Weights I'm getting [0.466521045342,0.699614292387,0.932673108363,0

Pyspark filter operation on Dstream

阅读更多关于 Pyspark filter operation on Dstream

问题 I have been trying to extend the network word count to be able to filter lines based on certain keyword I am using spark 1.6.2 from __future__ import print_function import sys from pyspark import SparkContext from pyspark.streaming import StreamingContext if __name__ == "__main__": if len(sys.argv) != 3: print("Usage: network_wordcount.py <hostname> <port>", file=sys.stderr) exit(-1) sc = SparkContext(appName="PythonStreamingNetworkWordCount") ssc = StreamingContext(sc, 5) lines = ssc

Spark SQL RDD loads in pyspark but not in spark-submit: “JDBCRDD: closed connection”

阅读更多关于 Spark SQL RDD loads in pyspark but not in spark-submit: “JDBCRDD: closed connection”

问题 I have the following simple code for loading a table from my Postgres database into an RDD. # this setup is just for spark-submit, will be ignored in pyspark from pyspark import SparkConf, SparkContext from pyspark.sql import SQLContext conf = SparkConf().setAppName("GA")#.setMaster("localhost") sc = SparkContext(conf=conf) sqlContext = SQLContext(sc) # func for loading table def get_db_rdd(table): url = "jdbc:postgresql://localhost:5432/harvest?user=postgres" print(url) lower = 0 upper =

Cloudant database not connecting using Spark python

阅读更多关于 Cloudant database not connecting using Spark python

问题 I am using Spark version 2.0.1 and trying to connect cloudant database using Python code but same time I am getting an error. Error is throwing at "load(cloudant_credentials['db_name'])" so is there any library I am missing to import? I am sure that I am using correct credentials of Cloudant. I tried using Java code but getting same error. Here is my Python code, import pandas import pyspark from pyspark.mllib.regression import LabeledPoint from pyspark.ml.evaluation import

Issues with Logistic Regression for multiclass classification using PySpark

阅读更多关于 Issues with Logistic Regression for multiclass classification using PySpark

问题 I am trying to use Logistic Regression to classify the datasets which has Sparse Vector in feature vector: For full code base and error log, please check my github repo Case 1 : I tried using the pipeline of ML as follow: # imported library from ML from pyspark.ml.feature import HashingTF from pyspark.ml import Pipeline from pyspark.ml.classification import LogisticRegression print(type(trainingData)) # for checking only print(trainingData.take(2)) # for of data type lr = LogisticRegression

PySpark Processing Stream data and saving processed data to file

阅读更多关于 PySpark Processing Stream data and saving processed data to file

问题 I am trying to replicate a device that is streaming it's location's coordinates, then process the data and save it to a text file. I am using Kafka and Spark streaming (on pyspark),this is my architecture: 1-Kafka producer emits data to a topic named test in the following string format : "LG float LT float" example : LG 8100.25191107 LT 8406.43141483 Producer code : from kafka import KafkaProducer import random producer = KafkaProducer(bootstrap_servers='localhost:9092') for i in range(0