apache-spark-mllib | 易学教程

MatchError while accessing vector column in Spark 2.0

阅读更多关于 MatchError while accessing vector column in Spark 2.0

问题 I am trying to create a LDA model on a JSON file. Creating a spark context with the JSON file : import org.apache.spark.sql.SparkSession val sparkSession = SparkSession.builder .master(\"local\") .appName(\"my-spark-app\") .config(\"spark.some.config.option\", \"config-value\") .getOrCreate() val df = spark.read.json(\"dbfs:/mnt/JSON6/JSON/sampleDoc.txt\") Displaying the df should show the DataFrame display(df) Tokenize the text import org.apache.spark.ml.feature.RegexTokenizer // Set params

Save ML model for future usage

阅读更多关于 Save ML model for future usage

问题 I was applying some Machine Learning algorithms like Linear Regression, Logistic Regression, and Naive Bayes to some data, but I was trying to avoid using RDDs and start using DataFrames because the RDDs are slower than Dataframes under pyspark (see pic 1). The other reason why I am using DataFrames is because the ml library has a class very useful to tune models which is CrossValidator this class returns a model after fitting it, obviously this method has to test several scenarios, and after

Matrix Multiplication in Apache Spark [closed]

阅读更多关于 Matrix Multiplication in Apache Spark [closed]

问题 I am trying to perform matrix multiplication using Apache Spark and Java. I have 2 main questions: How to create RDD that can represent matrix in Apache Spark? How to multiply two such RDDs? 回答1: All depends on the input data and dimensions but generally speaking what you want is not a RDD but one of the distributed data structures from org.apache.spark.mllib.linalg.distributed. At this moment it provides four different implementations of the DistributedMatrix IndexedRowMatrix - can be

Encode and assemble multiple features in PySpark

阅读更多关于 Encode and assemble multiple features in PySpark

问题 I have a Python class that I\'m using to load and process some data in Spark. Among various things I need to do, I\'m generating a list of dummy variables derived from various columns in a Spark dataframe. My problem is that I\'m not sure how to properly define a User Defined Function to accomplish what I need. I do currently have a method that, when mapped over the underlying dataframe RDD, solves half the problem (remember that this is a method in a larger data_processor class): def build

How to serve a Spark MLlib model?

阅读更多关于 How to serve a Spark MLlib model?

问题 I\'m evaluating tools for production ML based applications and one of our options is Spark MLlib , but I have some questions about how to serve a model once its trained? For example in Azure ML, once trained, the model is exposed as a web service which can be consumed from any application, and it\'s a similar case with Amazon ML. How do you serve/deploy ML models in Apache Spark ? 回答1: From one hand, a machine learning model built with spark can't be served the way you serve in Azure ML or

Calling Java/Scala function from a task

阅读更多关于 Calling Java/Scala function from a task

问题 Background My original question here was Why using DecisionTreeModel.predict inside map function raises an exception? and is related to How to generate tuples of (original lable, predicted label) on Spark with MLlib? When we use Scala API a recommended way of getting predictions for RDD[LabeledPoint] using DecisionTreeModel is to simply map over RDD : val labelAndPreds = testData.map { point => val prediction = model.predict(point.features) (point.label, prediction) } Unfortunately similar