pyspark

TensorFrames on IBM's Data Science Experience

我们两清 提交于 2019-12-25 15:31:12
问题 This is a follow up from this question. I want to implement TensorFrames on IBM's Data Science Experience. I will consider it working if I can run all of the examples on the user guide for TensorFrames. I've had to import the following packages to do anything at all with TensorFrames: pixiedust.installPackage("http://central.maven.org/maven2/com/typesafe/scala-logging/scala-logging-slf4j_2.10/2.1.2/scala-logging-slf4j_2.10-2.1.2.jar") pixiedust.installPackage("http://central.maven.org/maven2

ValueError: could not convert string to float

放肆的年华 提交于 2019-12-25 14:04:29
问题 I have a text file which contains some data. The data is as follows join2_train = sc.textFile('join2_train.csv',4) join2_train.take(3) [u'21.9059,TA-00002,S-0066,7/7/2013,0,0,Yes,1,SP-0019,6.35,0.71,137,8,19.05,N,N,N,N,EF-008,EF-008,0,0,0', u'12.3412,TA-00002,S-0066,7/7/2013,0,0,Yes,2,SP-0019,6.35,0.71,137,8,19.05,N,N,N,N,EF-008,EF-008,0,0,0', u'6.60183,TA-00002,S-0066,7/7/2013,0,0,Yes,5,SP-0019,6.35,0.71,137,8,19.05,N,N,N,N,EF-008,EF-008,0,0,0'] Now I am trying to parse this string into a

slice function in dstream spark streaming not work

陌路散爱 提交于 2019-12-25 10:02:10
问题 Spark streaming providing sliding window function for get rdd for last k. But I want to try use slice function to get rdd for last k, in a case I want to query rdd during range time before current time. delta = timedelta(seconds=30) datates = datamap.slice(datetime.now()-delta,datetime.now()) And I get this error when execute the code --------------------------------------------------------------------------- Py4JJavaError Traceback (most recent call last) /home/hduser/spark-1.5.0/<ipython

How to I convert multiple Pandas DFs into a single Spark DF?

谁说我不能喝 提交于 2019-12-25 09:41:04
问题 I have several Excel files that I need to load and pre-process before loading them into a Spark DF. I have a list of these files that need to be processed. I do something like this to read them in: file_list_rdd = sc.emptyRDD() for file_path in file_list: current_file_rdd = sc.binaryFiles(file_path) print(current_file_rdd.count()) file_list_rdd = file_list_rdd.union(current_file_rdd) I then have some mapper function that turns file_list_rdd from a set of (path, bytes) tuples to (path, Pandas

Why does my weights get normalized when I perform Logistic Regression With SGD in spark?

核能气质少年 提交于 2019-12-25 09:27:22
问题 I recently asked a question being confused about the weights I was receiving for the synthetic dataset I created. The answer I received was that the weights are being normalized. You can look at the details here. I'm wondering why LogisticRegressionWithSGD gives normalized weights whereas everything is fine in case of LBFGS in the same spark implementation. Is it possible that the weights weren't converging after all? Weights I'm getting [0.466521045342,0.699614292387,0.932673108363,0

Pyspark filter operation on Dstream

霸气de小男生 提交于 2019-12-25 09:17:29
问题 I have been trying to extend the network word count to be able to filter lines based on certain keyword I am using spark 1.6.2 from __future__ import print_function import sys from pyspark import SparkContext from pyspark.streaming import StreamingContext if __name__ == "__main__": if len(sys.argv) != 3: print("Usage: network_wordcount.py <hostname> <port>", file=sys.stderr) exit(-1) sc = SparkContext(appName="PythonStreamingNetworkWordCount") ssc = StreamingContext(sc, 5) lines = ssc

Spark SQL RDD loads in pyspark but not in spark-submit: “JDBCRDD: closed connection”

若如初见. 提交于 2019-12-25 09:01:45
问题 I have the following simple code for loading a table from my Postgres database into an RDD. # this setup is just for spark-submit, will be ignored in pyspark from pyspark import SparkConf, SparkContext from pyspark.sql import SQLContext conf = SparkConf().setAppName("GA")#.setMaster("localhost") sc = SparkContext(conf=conf) sqlContext = SQLContext(sc) # func for loading table def get_db_rdd(table): url = "jdbc:postgresql://localhost:5432/harvest?user=postgres" print(url) lower = 0 upper =

Cloudant database not connecting using Spark python

你说的曾经没有我的故事 提交于 2019-12-25 08:58:46
问题 I am using Spark version 2.0.1 and trying to connect cloudant database using Python code but same time I am getting an error. Error is throwing at "load(cloudant_credentials['db_name'])" so is there any library I am missing to import? I am sure that I am using correct credentials of Cloudant. I tried using Java code but getting same error. Here is my Python code, import pandas import pyspark from pyspark.mllib.regression import LabeledPoint from pyspark.ml.evaluation import

Issues with Logistic Regression for multiclass classification using PySpark

女生的网名这么多〃 提交于 2019-12-25 08:58:10
问题 I am trying to use Logistic Regression to classify the datasets which has Sparse Vector in feature vector: For full code base and error log, please check my github repo Case 1 : I tried using the pipeline of ML as follow: # imported library from ML from pyspark.ml.feature import HashingTF from pyspark.ml import Pipeline from pyspark.ml.classification import LogisticRegression print(type(trainingData)) # for checking only print(trainingData.take(2)) # for of data type lr = LogisticRegression

PySpark Processing Stream data and saving processed data to file

坚强是说给别人听的谎言 提交于 2019-12-25 08:04:31
问题 I am trying to replicate a device that is streaming it's location's coordinates, then process the data and save it to a text file. I am using Kafka and Spark streaming (on pyspark),this is my architecture: 1-Kafka producer emits data to a topic named test in the following string format : "LG float LT float" example : LG 8100.25191107 LT 8406.43141483 Producer code : from kafka import KafkaProducer import random producer = KafkaProducer(bootstrap_servers='localhost:9092') for i in range(0