pyspark

How to correctly get the weights using spark for synthetic dataset?

烈酒焚心 提交于 2020-01-07 03:14:12
问题 I'm doing LogisticRegressionWithSGD on spark for synthetic dataset. I've calculated the error on matlab using vanilla gradient descent and on R which is ~5%. I got similar weight that was used in the model that I used to generate y. The dataset was generated using this example. While I am able to get very close error rate at the end with different stepsize tuning, the weights for individual feature isn't the same. In fact, it varies a lot. I tried LBFGS for spark and it's able to predict both

PySpark : KeyError when converting a DataFrame column of String type to Double

生来就可爱ヽ(ⅴ<●) 提交于 2020-01-07 03:00:16
问题 I'm trying to learn machine learning with PySpark . I have a dataset that has a couple of String columns which have either True or False or Yes or No as its value. I'm working with DecisionTree and I wanted to convert these String values to corresponding Double values i.e. True, Yes should change to 1.0 and False, No should change to 0.0 . I saw a tutorial where they did the same thing and I came up with this code df = sqlContext.read.csv("C:/../churn-bigml-20.csv",inferSchema=True,header

pyspark: In spite of adding winutils to HADOOP_HOME, getting error: Could not locate executable null\bin\winutils.exe in the Hadoop binaries

江枫思渺然 提交于 2020-01-07 02:36:52
问题 I set winutils.exe path in HADOOP_HOME environment variable. I also set other paths such as python,spark,java and all these paths in PATH variable as well for pyspark. When running pyspark from command prompt I'm still facing the error : ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379) at org.apache.hadoop

No module named py4j.protocol on Eclipse (PyDev)

穿精又带淫゛_ 提交于 2020-01-06 21:43:09
问题 I configured Eclipse in order to develop with Spark and Python. I configured : 1. PyDev with the Python interpreter 2. PyDev with the Spark Python sources 3. PyDev with the Spark Environment variables. This is my Libraries configuration : And this is my Environment configuration : I created a project named CompensationStudy and I want to run an small example and be sure that everything will go smoothly. This is my code : from pyspark import SparkConf, SparkContext import os sparkConf =

pyspark streaming restore from checkpoint

∥☆過路亽.° 提交于 2020-01-06 20:11:03
问题 I use pyspark streaming with enabled checkpoints. The first launch is successful but when restart crashes with the error: INFO scheduler.DAGScheduler: ResultStage 6 (runJob at PythonRDD.scala:441) failed in 1,160 s due to Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 86, h-1.e-contenta.com, executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File"/data1/yarn/nm/usercache

Spark 1.6 DirectFileOutputCommitter

青春壹個敷衍的年華 提交于 2020-01-06 19:59:37
问题 I am having a problem saving text files to S3 using pyspark. I am able to save to S3, but it first uploads to a _temporary on S3 and then proceeds to copy to the intended location. This increases the jobs run time significantly. I have attempted to compile a DirectFileOutputComitter which should write directly to the intended S3 url, but I cannot get Spark to utilize this class. Example: someRDD.saveAsTextFile("s3a://somebucket/savefolder") this creates a s3a://somebucket/savefolder/

How to batch up items from a PySpark DataFrame

佐手、 提交于 2020-01-06 15:01:17
问题 I have a PySpark data frame and for each (batch of) record(s), I want to call an API. So basically say I have 100000k records, I want to batch up items into groups of say 1000 and call an API. How can I do this with PySpark? Reason for the batching is because the API probably will not accept a huge chunk of data from a Big Data system. I first thought of LIMIT but that wont be "deterministic". Furthermore it seems like it would be inefficient? 回答1: df.foreachPartition { ele => ele.grouped

What is the best possible way of interacting with Hbase using Pyspark

被刻印的时光 ゝ 提交于 2020-01-06 12:51:14
问题 I am using pyspark [spark2.3.1] and Hbase1.2.1, I am wondering what could be the best possible way of accessing Hbase using pyspark? I did some initial level of search and found that there are few options available like using shc-core:1.1.1-2.1-s_2.11.jar this could be achieved, but whereever I try to look for some example, at most of the places code is written in Scala or examples are also scala based. I tried implementing basic code in pyspark: from pyspark import SparkContext from pyspark

conditional aggregation using pyspark

 ̄綄美尐妖づ 提交于 2020-01-06 11:40:27
问题 consider the below as the dataframe a b c d e africa 123 1 10 121.2 africa 123 1 10 321.98 africa 123 2 12 43.92 africa 124 2 12 43.92 usa 121 1 12 825.32 usa 121 1 12 89.78 usa 123 2 10 32.24 usa 123 5 21 43.92 canada 132 2 13 63.21 canada 132 2 13 89.23 canada 132 3 21 85.32 canada 131 3 10 43.92 now I want to convert the below case statement to equivalent statement in PYSPARK using dataframes. we can directly use this in case statement using hivecontex/sqlcontest nut looking for the

Ambiguous behavior while adding new column to StructType

丶灬走出姿态 提交于 2020-01-06 09:55:31
问题 I defined a function in PySpark which is- def add_ids(X): schema_new = X.schema.add("id_col", LongType(), False) _X = X.rdd.zipWithIndex().map(lambda l: list(l[0]) + [l[1]]).toDF(schema_new) cols_arranged = [_X.columns[-1]] + _X.columns[0:len(_X.columns) - 1] return _X.select(*cols_arranged) In the function above, I'm creating a new column(with the name of id_col ) that gets appended to the dataframe which is basically just the index number of each row and it finally moves the id_col to the