pyspark | 易学教程

How to correctly get the weights using spark for synthetic dataset?

阅读更多关于 How to correctly get the weights using spark for synthetic dataset?

问题 I'm doing LogisticRegressionWithSGD on spark for synthetic dataset. I've calculated the error on matlab using vanilla gradient descent and on R which is ~5%. I got similar weight that was used in the model that I used to generate y. The dataset was generated using this example. While I am able to get very close error rate at the end with different stepsize tuning, the weights for individual feature isn't the same. In fact, it varies a lot. I tried LBFGS for spark and it's able to predict both

PySpark : KeyError when converting a DataFrame column of String type to Double

阅读更多关于 PySpark : KeyError when converting a DataFrame column of String type to Double

问题 I'm trying to learn machine learning with PySpark . I have a dataset that has a couple of String columns which have either True or False or Yes or No as its value. I'm working with DecisionTree and I wanted to convert these String values to corresponding Double values i.e. True, Yes should change to 1.0 and False, No should change to 0.0 . I saw a tutorial where they did the same thing and I came up with this code df = sqlContext.read.csv("C:/../churn-bigml-20.csv",inferSchema=True,header

pyspark: In spite of adding winutils to HADOOP_HOME, getting error: Could not locate executable null\bin\winutils.exe in the Hadoop binaries

阅读更多关于 pyspark: In spite of adding winutils to HADOOP_HOME, getting error: Could not locate executable null\bin\winutils.exe in the Hadoop binaries

问题 I set winutils.exe path in HADOOP_HOME environment variable. I also set other paths such as python,spark,java and all these paths in PATH variable as well for pyspark. When running pyspark from command prompt I'm still facing the error : ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379) at org.apache.hadoop

No module named py4j.protocol on Eclipse (PyDev)

阅读更多关于 No module named py4j.protocol on Eclipse (PyDev)

问题 I configured Eclipse in order to develop with Spark and Python. I configured : 1. PyDev with the Python interpreter 2. PyDev with the Spark Python sources 3. PyDev with the Spark Environment variables. This is my Libraries configuration : And this is my Environment configuration : I created a project named CompensationStudy and I want to run an small example and be sure that everything will go smoothly. This is my code : from pyspark import SparkConf, SparkContext import os sparkConf =

pyspark streaming restore from checkpoint

阅读更多关于 pyspark streaming restore from checkpoint

问题 I use pyspark streaming with enabled checkpoints. The first launch is successful but when restart crashes with the error: INFO scheduler.DAGScheduler: ResultStage 6 (runJob at PythonRDD.scala:441) failed in 1,160 s due to Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 86, h-1.e-contenta.com, executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File"/data1/yarn/nm/usercache

Spark 1.6 DirectFileOutputCommitter

阅读更多关于 Spark 1.6 DirectFileOutputCommitter

问题 I am having a problem saving text files to S3 using pyspark. I am able to save to S3, but it first uploads to a _temporary on S3 and then proceeds to copy to the intended location. This increases the jobs run time significantly. I have attempted to compile a DirectFileOutputComitter which should write directly to the intended S3 url, but I cannot get Spark to utilize this class. Example: someRDD.saveAsTextFile("s3a://somebucket/savefolder") this creates a s3a://somebucket/savefolder/

How to batch up items from a PySpark DataFrame

阅读更多关于 How to batch up items from a PySpark DataFrame

问题 I have a PySpark data frame and for each (batch of) record(s), I want to call an API. So basically say I have 100000k records, I want to batch up items into groups of say 1000 and call an API. How can I do this with PySpark? Reason for the batching is because the API probably will not accept a huge chunk of data from a Big Data system. I first thought of LIMIT but that wont be "deterministic". Furthermore it seems like it would be inefficient? 回答1: df.foreachPartition { ele => ele.grouped

What is the best possible way of interacting with Hbase using Pyspark

阅读更多关于 What is the best possible way of interacting with Hbase using Pyspark

问题 I am using pyspark [spark2.3.1] and Hbase1.2.1, I am wondering what could be the best possible way of accessing Hbase using pyspark? I did some initial level of search and found that there are few options available like using shc-core:1.1.1-2.1-s_2.11.jar this could be achieved, but whereever I try to look for some example, at most of the places code is written in Scala or examples are also scala based. I tried implementing basic code in pyspark: from pyspark import SparkContext from pyspark

conditional aggregation using pyspark

阅读更多关于 conditional aggregation using pyspark

问题 consider the below as the dataframe a b c d e africa 123 1 10 121.2 africa 123 1 10 321.98 africa 123 2 12 43.92 africa 124 2 12 43.92 usa 121 1 12 825.32 usa 121 1 12 89.78 usa 123 2 10 32.24 usa 123 5 21 43.92 canada 132 2 13 63.21 canada 132 2 13 89.23 canada 132 3 21 85.32 canada 131 3 10 43.92 now I want to convert the below case statement to equivalent statement in PYSPARK using dataframes. we can directly use this in case statement using hivecontex/sqlcontest nut looking for the

Ambiguous behavior while adding new column to StructType

阅读更多关于 Ambiguous behavior while adding new column to StructType

问题 I defined a function in PySpark which is- def add_ids(X): schema_new = X.schema.add("id_col", LongType(), False) _X = X.rdd.zipWithIndex().map(lambda l: list(l[0]) + [l[1]]).toDF(schema_new) cols_arranged = [_X.columns[-1]] + _X.columns[0:len(_X.columns) - 1] return _X.select(*cols_arranged) In the function above, I'm creating a new column(with the name of id_col ) that gets appended to the dataframe which is basically just the index number of each row and it finally moves the id_col to the