apache-spark-mllib

How can I vectorize Tweets using Spark's MLLib?

流过昼夜 提交于 2019-12-14 04:10:50
问题 I'd like to turn tweets into vectors for machine learning, so that I can categorize them based on content using Spark's K-Means clustering. Ex, all tweets relating to Amazon get put into one category. I have tried splitting the tweet into words and creating a vector using HashingTF, which wasn't very successful. Are there any other ways to vectorize tweets? 回答1: You can try this pipeline: First, tokenize the input Tweet (located in the column text ). basically, it creates a new column

Preparing data for LDA in spark

半世苍凉 提交于 2019-12-14 03:44:37
问题 I'm working on implementing a Spark LDA model (via the Scala API), and am having trouble with the necessary formatting steps for my data. My raw data (stored in a text file) is in the following format, essentially a list of tokens and the documents they correspond to. A simplified example: doc XXXXX term XXXXX 1 x 'a' x 1 x 'a' x 1 x 'b' x 2 x 'b' x 2 x 'd' x ... Where the XXXXX columns are garbage data I don't care about. I realize this is an atypical way of storing corpus data, but it's

How to get Spark MLlib RandomForestModel.predict response as text value YES/NO?

只谈情不闲聊 提交于 2019-12-14 02:51:21
问题 Hi I am trying to implement RandomForest algorithm using Apache Spark MLLib. I have the dataset in the csv format with the following features DayOfWeek(int),AlertType(String),Application(String),Router(String),Symptom(String),Action(String) 0,Network1,App1,Router1,Not reachable,YES 0,Network1,App2,Router5,Not reachable,NO I want to use RandomForest MLlib and do prediction on last field Action and I want response as YES/NO. I am following code from github to create RandomForest model. Since I

SPARK: How to create categoricalFeaturesInfo for decision trees from LabeledPoint?

懵懂的女人 提交于 2019-12-13 15:40:27
问题 I've got a LabeledPoint on witch I want to run a decision tree (and later random forest) scala> transformedData.collect res8: Array[org.apache.spark.mllib.regression.LabeledPoint] = Array((0.0,(400036,[7744],[2.0])), (0.0,(400036,[7744,8608],[3.0,3.0])), (0.0,(400036,[7744],[2.0])), (0.0,(400036,[133,218,2162,7460,7744,9567],[1.0,1.0,2.0,1.0,42.0,21.0])), (0.0,(400036,[133,218,1589,2162,2784,2922,3274,6914,7008,7131,7460,8608,9437,9567,199999,200021,200035,200048,200051,200056,200058,200064

PySpark: Calculate grouped-by AUC

社会主义新天地 提交于 2019-12-13 15:11:59
问题 Spark version: 1.6.0 I tried computing AUC (area under ROC) grouped by the field id . Given the following data: # Within each key-value pair # key is "id" # value is a list of (score, label) data = sc.parallelize( [('id1', [(0.5, 1.0), (0.6, 0.0), (0.7, 1.0), (0.8, 0.0)), ('id2', [(0.5, 1.0), (0.6, 0.0), (0.7, 1.0), (0.8, 0.0)) ] The BinaryClassificationMetrics class can calculate the AUC given a list of (score, label) . I want to compute AUC by key (i.e. id1, id2 ). But how to "map" a class

Spark - IllegalArgumentException in KMeans.train

╄→尐↘猪︶ㄣ 提交于 2019-12-13 13:24:34
问题 I am running into an exception while inside KMeans.train() like below: java.lang.IllegalArgumentException: requirement failed at scala.Predef$.require(Predef.scala:212) at org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:487) at org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:589) at org.apache.spark.mllib.clustering.KMeans$$anonfun$runAlgorithm$3.apply(KMeans.scala:304) at org.apache.spark.mllib.clustering.KMeans$$anonfun$runAlgorithm$3.apply

How to use the PySpark CountVectorizer on columns that maybe null

梦想与她 提交于 2019-12-13 05:47:27
问题 I have a column in my Spark DataFrame: |-- topics_A: array (nullable = true) | |-- element: string (containsNull = true) I'm using CountVectorizer on it: topic_vectorizer_A = CountVectorizer(inputCol="topics_A", outputCol="topics_vec_A") I get NullPointerExceptions, because sometimes the topic_A column contains null. Is there a way around this? Filling it with a zero-length array would work ok (although it will blow out the data size quite a lot) - but I can't work out how to do a fillNa on

How to resolve java.lang.NoSuchMethodError org.apache.spark.ml.util.SchemaUtils$.checkColumnType

风流意气都作罢 提交于 2019-12-13 04:46:20
问题 I am trying to run the CountVectorizerDemo program provided here: https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaCountVectorizerExample.java I'm getting the following error and don't know what the problem is. Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.ml.util.SchemaUtils$.checkColumnType$default$4()Ljava/lang/String; at org.apache.spark.ml.feature.CountVectorizerParams$class.validateAndTransformSchema

Is there in PySpark a parameter equivalent to scikit-learn's sample_weight?

匆匆过客 提交于 2019-12-13 04:38:41
问题 I am currently using the SGDClassifier provided by the scikit-learn library. When I use the fit method I can set the sample_weight parameter: Weights applied to individual samples. If not provided, uniform weights are assumed. These weights will be multiplied with class_weight (passed through the constructor) if class_weight is specified I want to switch to PySpark and to use the LogisticRegression class. Anyway I cannot find a parameter similar to sample_weight . There is a weightCol

Create a Diagonal Matrix with specified number of rows and columns in Scala

本秂侑毒 提交于 2019-12-13 04:11:29
问题 I have an input mllib Block matrix named matrix like, matrix : org.apache.spark.mllib.linalg.Matrix = 0.0 2.0 1.0 2.0 2.0 0.0 2.0 4.0 1.0 2.0 0.0 3.0 2.0 4.0 3.0 0.0 As per my Scala code, diagonals will be zero for sure. I need the diagonals of the matrix to be 1. If I have a diagonal matrix with diagonal values as 1 like, diagonalMatrix: org.apache.spark.mllib.linalg.Matrix = 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 I can add those matrices, So the diagonals of matrix