apache-spark-mllib

Spark data type guesser UDAF

强颜欢笑 提交于 2019-11-28 06:47:23
问题 Wanted to take something like this https://github.com/fitzscott/AirQuality/blob/master/HiveDataTypeGuesser.java and create a Hive UDAF to create an aggregate function that returns a data type guess. Does Spark have something like this already built-in? Would be very useful for new wide datasets to explore data. Would be helpful for ML too, e.g. to decide categorical vs numerical variables. How do you normally determine data types in Spark? P.S. Frameworks like h2o automatically determine data

How can I evaluate the implicit feedback ALS algorithm for recommendations in Apache Spark?

守給你的承諾、 提交于 2019-11-28 06:39:20
How can you evaluate the implicit feedback collaborative filtering algorithm of Apache Spark, given that the implicit "ratings" can vary from zero to anything, so a simple MSE or RMSE does not have much meaning? To answer this question, you'll need to go back to the original paper that defined what is implicit feedback and the ALS algorithm Collaborative Filtering for Implicit Feedback Datasets by Yifan Hu, Yehuda Koren and Chris Volinsky . What is implicit feedback ? In the absence of explicit ratings, recommender systems can infer user preferences from the more abundant implicit feedback ,

Spark Multiclass Classification Example

故事扮演 提交于 2019-11-28 06:06:40
Do you guys know where can I find examples of multiclass classification in Spark. I spent a lot of time searching in books and in the web, and so far I just know that it is possible since the latest version according the documentation. zero323 ML ( Recommended in Spark 2.0+ ) We'll use the same data as in the MLlib below. There are two basic options. If Estimator supports multilclass classification out-of-the-box (for example random forest) you can use it directly: val trainRawDf = trainRaw.toDF import org.apache.spark.ml.feature.{Tokenizer, CountVectorizer, StringIndexer} import org.apache

Create labeledPoints from Spark DataFrame in Python

一世执手 提交于 2019-11-28 06:04:25
What .map() function in python do I use to create a set of labeledPoints from a spark dataframe? What is the notation if The label/outcome is not the first column but I can refer to its column name, 'status'? I create the Python dataframe with this .map() function: def parsePoint(line): listmp = list(line.split('\t')) dataframe = pd.DataFrame(pd.get_dummies(listmp[1:]).sum()).transpose() dataframe.insert(0, 'status', dataframe['accepted']) if 'NULL' in dataframe.columns: dataframe = dataframe.drop('NULL', axis=1) if '' in dataframe.columns: dataframe = dataframe.drop('', axis=1) if 'rejected'

How to update Spark MatrixFactorizationModel for ALS

廉价感情. 提交于 2019-11-28 05:04:55
I build a simple recommendation system for the MovieLens DB inspired by https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html . I also have problems with explicit training like here: Apache Spark ALS collaborative filtering results. They don't make sense Using implicit training (on both explicit and implicit data) gives me reasonable results, but explicit training doesn't. While this is ok for me by now, im curious on how to update a model. While my current solution works like having all user ratings generate model get recommendations for user I want to have a flow

RDD to LabeledPoint conversion

痞子三分冷 提交于 2019-11-28 04:19:59
问题 If I have a RDD with about 500 columns and 200 million rows, and RDD.columns.indexOf("target", 0) shows Int = 77 which tells me my targeted dependent variable is at column number 77. But I don't have enough knowledge on how to select desired (partial) columns as features (say I want columns from 23 to 59, 111 to 357, 399 to 489). I am wondering if I can apply such: val data = rdd.map(col => new LabeledPoint( col(77).toDouble, Vectors.dense(??.map(x => x.toDouble).toArray)) Any suggestions or

Forward fill missing values in Spark/Python

老子叫甜甜 提交于 2019-11-28 03:58:45
问题 I am attempting to fill in missing values in my Spark dataframe with the previous non-null value (if it exists). I've done this type of thing in Python/Pandas but my data is too big for Pandas (on a small cluster) and I'm Spark noob. Is this something Spark can do? Can it do it for multiple columns? If so, how? If not, any suggestions for alternative approaches within the who Hadoop suite of tools? Thanks! 回答1: I've found a solution that works without additional coding by using a Window here.

How to use the PySpark CountVectorizer on columns that maybe null

折月煮酒 提交于 2019-11-28 02:17:53
I have a column in my Spark DataFrame: |-- topics_A: array (nullable = true) | |-- element: string (containsNull = true) I'm using CountVectorizer on it: topic_vectorizer_A = CountVectorizer(inputCol="topics_A", outputCol="topics_vec_A") I get NullPointerExceptions, because sometimes the topic_A column contains null. Is there a way around this? Filling it with a zero-length array would work ok (although it will blow out the data size quite a lot) - but I can't work out how to do a fillNa on an Array column in PySpark. Personally I would drop columns with NULL values because there is no useful

Spark MlLib linear regression (Linear least squares) giving random results

心已入冬 提交于 2019-11-28 02:01:44
Im new in spark and Machine learning in general. I have followed with success some of the Mllib tutorials, i can't get this one working: i found the sample code here : https://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression (section LinearRegressionWithSGD) here is the code: import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.regression.LinearRegressionModel import org.apache.spark.mllib.regression.LinearRegressionWithSGD import org.apache.spark.mllib.linalg.Vectors // Load and parse the data val data = sc

Why does StandardScaler not attach metadata to the output column?

妖精的绣舞 提交于 2019-11-28 01:38:36
I noticed that the ml StandardScaler does not attach metadata to the output column: import org.apache.spark.ml.Pipeline import org.apache.spark.ml.feature._ val df = spark.read.option("header", true) .option("inferSchema", true) .csv("/path/to/cars.data") val strId1 = new StringIndexer() .setInputCol("v7") .setOutputCol("v7_IDX") val strId2 = new StringIndexer() .setInputCol("v8") .setOutputCol("v8_IDX") val assmbleFeatures: VectorAssembler = new VectorAssembler() .setInputCols(Array("v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7_IDX")) .setOutputCol("featuresRaw") val scalerModel = new