apache-spark-mllib | 易学教程

Spark data type guesser UDAF

阅读更多关于 Spark data type guesser UDAF

Wanted to take something like this https://github.com/fitzscott/AirQuality/blob/master/HiveDataTypeGuesser.java and create a Hive UDAF to create an aggregate function that returns a data type guess. Does Spark have something like this already built-in? Would be very useful for new wide datasets to explore data. Would be helpful for ML too, e.g. to decide categorical vs numerical variables. How do you normally determine data types in Spark? P.S. Frameworks like h2o automatically determine data type scanning a sample of data, or whole dataset. So then one can decide e.g. if a variable should be

RDD to LabeledPoint conversion

阅读更多关于 RDD to LabeledPoint conversion

If I have a RDD with about 500 columns and 200 million rows, and RDD.columns.indexOf("target", 0) shows Int = 77 which tells me my targeted dependent variable is at column number 77. But I don't have enough knowledge on how to select desired (partial) columns as features (say I want columns from 23 to 59, 111 to 357, 399 to 489). I am wondering if I can apply such: val data = rdd.map(col => new LabeledPoint( col(77).toDouble, Vectors.dense(??.map(x => x.toDouble).toArray)) Any suggestions or guidance will be much appreciated. Maybe I messed up RDD with DataFrame, I can convert the RDD to

Sparse Vector vs Dense Vector

阅读更多关于 Sparse Vector vs Dense Vector

问题 How to create SparseVector and dense Vector representations if the DenseVector is: denseV = np.array([0., 3., 0., 4.]) What will be the Sparse Vector representation ? 回答1: Unless I have thoroughly misunderstood your doubt, the MLlib data type documentation illustrates this quite clearly: import org.apache.spark.mllib.linalg.Vector; import org.apache.spark.mllib.linalg.Vectors; // Create a dense vector (1.0, 0.0, 3.0). Vector dv = Vectors.dense(1.0, 0.0, 3.0); // Create a sparse vector (1.0, 0

How to use mllib.recommendation if the user ids are string instead of contiguous integers?

阅读更多关于 How to use mllib.recommendation if the user ids are string instead of contiguous integers?

I want to use Spark's mllib.recommendation library to build a prototype recommender system. However, the format of the user data I have is something of the following format: AB123XY45678 CD234WZ12345 EF345OOO1234 GH456XY98765 .... If I want to use the mllib.recommendation library, according to the API of the Rating class, the user ids have to be integers (also have to be contiguous?) It looks like some kind of conversion between the real user ids and the numeric ones used by Spark must be done. But how should I do this? Spark don't really require numeric id, it just needs to bee some unique

Calculate Cosine Similarity Spark Dataframe

阅读更多关于 Calculate Cosine Similarity Spark Dataframe

I am using Spark Scala to calculate cosine similarity between the Dataframe rows. Dataframe format is below root |-- SKU: double (nullable = true) |-- Features: vector (nullable = true) Sample of the dataframe below +-------+--------------------+ | SKU| Features| +-------+--------------------+ | 9970.0|[4.7143,0.0,5.785...| |19676.0|[5.5,0.0,6.4286,4...| | 3296.0|[4.7143,1.4286,6....| |13658.0|[6.2857,0.7143,4....| | 1.0|[4.2308,0.7692,5....| | 513.0|[3.0,0.0,4.9091,5...| | 3753.0|[5.9231,0.0,4.846...| |14967.0|[4.5833,0.8333,5....| | 2803.0|[4.2308,0.0,4.846...| |11879.0|[3.1429,0.0,4.5,4...|

PySpark & MLLib: Class Probabilities of Random Forest Predictions

阅读更多关于 PySpark & MLLib: Class Probabilities of Random Forest Predictions

问题 I'm trying to extract the class probabilities of a random forest object I have trained using PySpark. However, I do not see an example of it anywhere in the documentation, nor is it a a method of RandomForestModel . How can I extract class probabilities from a RandomForestModel classifier in PySpark? Here's the sample code provided in the documentation that only provides the final class (not the probability): from pyspark.mllib.tree import RandomForest from pyspark.mllib.util import MLUtils #

Spark Word2vec vector mathematics

阅读更多关于 Spark Word2vec vector mathematics

问题 I was looking at the example of Spark site for Word2Vec: val input = sc.textFile("text8").map(line => line.split(" ").toSeq) val word2vec = new Word2Vec() val model = word2vec.fit(input) val synonyms = model.findSynonyms("country name here", 40) How do I do the interesting vector such as king - man + woman = queen. I can use model.getVectors, but not sure how to proceed further. 回答1: Here is an example in pyspark , which I guess is straightforward to port to Scala - the key is the use of

How to encode categorical features in Apache Spark

阅读更多关于 How to encode categorical features in Apache Spark

I have a set of data based on which I want to create a classification model. Each row has the following form: user1,class1,product1 user1,class1,product2 user1,class1,product5 user2,class1,product2 user2,class1,product5 user3,class2,product1 There are about 1M users, 2 classes, and 1M products. What I would like to do next is create the sparse vectors (something already supported by MLlib) BUT in order to apply that function I have to create the dense vectors (with the 0s), first. In other words, I have to binarize my data. What's the easiest (or most elegant) way of doing that? Given that I

Optimal way to create a ml pipeline in Apache Spark for dataset with high number of columns

阅读更多关于 Optimal way to create a ml pipeline in Apache Spark for dataset with high number of columns

问题 I am working with Spark 2.1.1 on a dataset with ~2000 features and trying to create a basic ML Pipeline, consisting of some Transformers and a Classifier. Let's assume for the sake of simplicity that the Pipeline I am working with consists of a VectorAssembler, StringIndexer and a Classifier, which would be a fairly common usecase. // Pipeline elements val assmbleFeatures: VectorAssembler = new VectorAssembler() .setInputCols(featureColumns) .setOutputCol("featuresRaw") val labelIndexer:

How to map variable names to features after pipeline

阅读更多关于 How to map variable names to features after pipeline

I have modified the OneHotEncoder example to actually train a LogisticRegression. My question is how to map the generated weights back to the categorical variables? def oneHotEncoderExample(sqlContext: SQLContext): Unit = { val df = sqlContext.createDataFrame(Seq( (0, "a", 1.0), (1, "b", 1.0), (2, "c", 0.0), (3, "d", 1.0), (4, "e", 1.0), (5, "f", 0.0) )).toDF("id", "category", "label") df.show() val indexer = new StringIndexer() .setInputCol("category") .setOutputCol("categoryIndex") .fit(df) val indexed = indexer.transform(df) indexed.select("id", "categoryIndex").show() val encoder = new