apache-spark-mllib

MatchError while accessing vector column in Spark 2.0

拈花ヽ惹草 提交于 2019-11-26 07:49:19
问题 I am trying to create a LDA model on a JSON file. Creating a spark context with the JSON file : import org.apache.spark.sql.SparkSession val sparkSession = SparkSession.builder .master(\"local\") .appName(\"my-spark-app\") .config(\"spark.some.config.option\", \"config-value\") .getOrCreate() val df = spark.read.json(\"dbfs:/mnt/JSON6/JSON/sampleDoc.txt\") Displaying the df should show the DataFrame display(df) Tokenize the text import org.apache.spark.ml.feature.RegexTokenizer // Set params

Save ML model for future usage

二次信任 提交于 2019-11-26 06:35:54
问题 I was applying some Machine Learning algorithms like Linear Regression, Logistic Regression, and Naive Bayes to some data, but I was trying to avoid using RDDs and start using DataFrames because the RDDs are slower than Dataframes under pyspark (see pic 1). The other reason why I am using DataFrames is because the ml library has a class very useful to tune models which is CrossValidator this class returns a model after fitting it, obviously this method has to test several scenarios, and after

Matrix Multiplication in Apache Spark [closed]

别等时光非礼了梦想. 提交于 2019-11-26 06:35:47
问题 I am trying to perform matrix multiplication using Apache Spark and Java. I have 2 main questions: How to create RDD that can represent matrix in Apache Spark? How to multiply two such RDDs? 回答1: All depends on the input data and dimensions but generally speaking what you want is not a RDD but one of the distributed data structures from org.apache.spark.mllib.linalg.distributed. At this moment it provides four different implementations of the DistributedMatrix IndexedRowMatrix - can be

Encode and assemble multiple features in PySpark

血红的双手。 提交于 2019-11-26 02:19:40
问题 I have a Python class that I\'m using to load and process some data in Spark. Among various things I need to do, I\'m generating a list of dummy variables derived from various columns in a Spark dataframe. My problem is that I\'m not sure how to properly define a User Defined Function to accomplish what I need. I do currently have a method that, when mapped over the underlying dataframe RDD, solves half the problem (remember that this is a method in a larger data_processor class): def build

How to serve a Spark MLlib model?

我的未来我决定 提交于 2019-11-26 01:39:56
问题 I\'m evaluating tools for production ML based applications and one of our options is Spark MLlib , but I have some questions about how to serve a model once its trained? For example in Azure ML, once trained, the model is exposed as a web service which can be consumed from any application, and it\'s a similar case with Amazon ML. How do you serve/deploy ML models in Apache Spark ? 回答1: From one hand, a machine learning model built with spark can't be served the way you serve in Azure ML or

Calling Java/Scala function from a task

一世执手 提交于 2019-11-26 00:33:14
问题 Background My original question here was Why using DecisionTreeModel.predict inside map function raises an exception? and is related to How to generate tuples of (original lable, predicted label) on Spark with MLlib? When we use Scala API a recommended way of getting predictions for RDD[LabeledPoint] using DecisionTreeModel is to simply map over RDD : val labelAndPreds = testData.map { point => val prediction = model.predict(point.features) (point.label, prediction) } Unfortunately similar