apache-spark-mllib

Spark data type guesser UDAF

徘徊边缘 提交于 2019-11-29 12:54:31
Wanted to take something like this https://github.com/fitzscott/AirQuality/blob/master/HiveDataTypeGuesser.java and create a Hive UDAF to create an aggregate function that returns a data type guess. Does Spark have something like this already built-in? Would be very useful for new wide datasets to explore data. Would be helpful for ML too, e.g. to decide categorical vs numerical variables. How do you normally determine data types in Spark? P.S. Frameworks like h2o automatically determine data type scanning a sample of data, or whole dataset. So then one can decide e.g. if a variable should be

RDD to LabeledPoint conversion

有些话、适合烂在心里 提交于 2019-11-29 10:56:51
If I have a RDD with about 500 columns and 200 million rows, and RDD.columns.indexOf("target", 0) shows Int = 77 which tells me my targeted dependent variable is at column number 77. But I don't have enough knowledge on how to select desired (partial) columns as features (say I want columns from 23 to 59, 111 to 357, 399 to 489). I am wondering if I can apply such: val data = rdd.map(col => new LabeledPoint( col(77).toDouble, Vectors.dense(??.map(x => x.toDouble).toArray)) Any suggestions or guidance will be much appreciated. Maybe I messed up RDD with DataFrame, I can convert the RDD to

Sparse Vector vs Dense Vector

喜欢而已 提交于 2019-11-29 03:06:02
问题 How to create SparseVector and dense Vector representations if the DenseVector is: denseV = np.array([0., 3., 0., 4.]) What will be the Sparse Vector representation ? 回答1: Unless I have thoroughly misunderstood your doubt, the MLlib data type documentation illustrates this quite clearly: import org.apache.spark.mllib.linalg.Vector; import org.apache.spark.mllib.linalg.Vectors; // Create a dense vector (1.0, 0.0, 3.0). Vector dv = Vectors.dense(1.0, 0.0, 3.0); // Create a sparse vector (1.0, 0

How to use mllib.recommendation if the user ids are string instead of contiguous integers?

ⅰ亾dé卋堺 提交于 2019-11-29 01:56:42
I want to use Spark's mllib.recommendation library to build a prototype recommender system. However, the format of the user data I have is something of the following format: AB123XY45678 CD234WZ12345 EF345OOO1234 GH456XY98765 .... If I want to use the mllib.recommendation library, according to the API of the Rating class, the user ids have to be integers (also have to be contiguous?) It looks like some kind of conversion between the real user ids and the numeric ones used by Spark must be done. But how should I do this? Spark don't really require numeric id, it just needs to bee some unique

Calculate Cosine Similarity Spark Dataframe

我的未来我决定 提交于 2019-11-29 01:51:54
I am using Spark Scala to calculate cosine similarity between the Dataframe rows. Dataframe format is below root |-- SKU: double (nullable = true) |-- Features: vector (nullable = true) Sample of the dataframe below +-------+--------------------+ | SKU| Features| +-------+--------------------+ | 9970.0|[4.7143,0.0,5.785...| |19676.0|[5.5,0.0,6.4286,4...| | 3296.0|[4.7143,1.4286,6....| |13658.0|[6.2857,0.7143,4....| | 1.0|[4.2308,0.7692,5....| | 513.0|[3.0,0.0,4.9091,5...| | 3753.0|[5.9231,0.0,4.846...| |14967.0|[4.5833,0.8333,5....| | 2803.0|[4.2308,0.0,4.846...| |11879.0|[3.1429,0.0,4.5,4...|

PySpark & MLLib: Class Probabilities of Random Forest Predictions

会有一股神秘感。 提交于 2019-11-29 01:00:37
问题 I'm trying to extract the class probabilities of a random forest object I have trained using PySpark. However, I do not see an example of it anywhere in the documentation, nor is it a a method of RandomForestModel . How can I extract class probabilities from a RandomForestModel classifier in PySpark? Here's the sample code provided in the documentation that only provides the final class (not the probability): from pyspark.mllib.tree import RandomForest from pyspark.mllib.util import MLUtils #

Spark Word2vec vector mathematics

我怕爱的太早我们不能终老 提交于 2019-11-29 01:00:25
问题 I was looking at the example of Spark site for Word2Vec: val input = sc.textFile("text8").map(line => line.split(" ").toSeq) val word2vec = new Word2Vec() val model = word2vec.fit(input) val synonyms = model.findSynonyms("country name here", 40) How do I do the interesting vector such as king - man + woman = queen. I can use model.getVectors, but not sure how to proceed further. 回答1: Here is an example in pyspark , which I guess is straightforward to port to Scala - the key is the use of

How to encode categorical features in Apache Spark

点点圈 提交于 2019-11-29 00:24:23
I have a set of data based on which I want to create a classification model. Each row has the following form: user1,class1,product1 user1,class1,product2 user1,class1,product5 user2,class1,product2 user2,class1,product5 user3,class2,product1 There are about 1M users, 2 classes, and 1M products. What I would like to do next is create the sparse vectors (something already supported by MLlib) BUT in order to apply that function I have to create the dense vectors (with the 0s), first. In other words, I have to binarize my data. What's the easiest (or most elegant) way of doing that? Given that I

Optimal way to create a ml pipeline in Apache Spark for dataset with high number of columns

人盡茶涼 提交于 2019-11-28 22:02:45
问题 I am working with Spark 2.1.1 on a dataset with ~2000 features and trying to create a basic ML Pipeline, consisting of some Transformers and a Classifier. Let's assume for the sake of simplicity that the Pipeline I am working with consists of a VectorAssembler, StringIndexer and a Classifier, which would be a fairly common usecase. // Pipeline elements val assmbleFeatures: VectorAssembler = new VectorAssembler() .setInputCols(featureColumns) .setOutputCol("featuresRaw") val labelIndexer:

How to map variable names to features after pipeline

非 Y 不嫁゛ 提交于 2019-11-28 21:59:06
I have modified the OneHotEncoder example to actually train a LogisticRegression. My question is how to map the generated weights back to the categorical variables? def oneHotEncoderExample(sqlContext: SQLContext): Unit = { val df = sqlContext.createDataFrame(Seq( (0, "a", 1.0), (1, "b", 1.0), (2, "c", 0.0), (3, "d", 1.0), (4, "e", 1.0), (5, "f", 0.0) )).toDF("id", "category", "label") df.show() val indexer = new StringIndexer() .setInputCol("category") .setOutputCol("categoryIndex") .fit(df) val indexed = indexer.transform(df) indexed.select("id", "categoryIndex").show() val encoder = new