apache-spark-mllib | 易学教程

Split Contents of String column in PySpark Dataframe

阅读更多关于 Split Contents of String column in PySpark Dataframe

问题 I have a pyspark data frame whih has a column containing strings. I want to split this column into words Code: >>> sentenceData = sqlContext.read.load('file://sample1.csv', format='com.databricks.spark.csv', header='true', inferSchema='true') >>> sentenceData.show(truncate=False) +---+---------------------------+ |key|desc | +---+---------------------------+ |1 |Virat is good batsman | |2 |sachin was good | |3 |but modi sucks big big time| |4 |I love the formulas | +---+----------------------

MatchError while accessing vector column in Spark 2.0

阅读更多关于 MatchError while accessing vector column in Spark 2.0

I am trying to create a LDA model on a JSON file. Creating a spark context with the JSON file : import org.apache.spark.sql.SparkSession val sparkSession = SparkSession.builder .master("local") .appName("my-spark-app") .config("spark.some.config.option", "config-value") .getOrCreate() val df = spark.read.json("dbfs:/mnt/JSON6/JSON/sampleDoc.txt") Displaying the df should show the DataFrame display(df) Tokenize the text import org.apache.spark.ml.feature.RegexTokenizer // Set params for RegexTokenizer val tokenizer = new RegexTokenizer() .setPattern("[\\W_]+") .setMinTokenLength(4) // Filter

Matrix Multiplication in Apache Spark [closed]

阅读更多关于 Matrix Multiplication in Apache Spark [closed]

I am trying to perform matrix multiplication using Apache Spark and Java. I have 2 main questions: How to create RDD that can represent matrix in Apache Spark? How to multiply two such RDDs? All depends on the input data and dimensions but generally speaking what you want is not a RDD but one of the distributed data structures from org.apache.spark.mllib.linalg.distributed . At this moment it provides four different implementations of the DistributedMatrix IndexedRowMatrix - can be created directly from a RDD[IndexedRow] where IndexedRow consist of row index and org.apache.spark.mllib.linalg

How to assign unique contiguous numbers to elements in a Spark RDD

阅读更多关于 How to assign unique contiguous numbers to elements in a Spark RDD

问题 I have a dataset of (user, product, review) , and want to feed it into mllib's ALS algorithm. The algorithm needs users and products to be numbers, while mine are String usernames and String SKUs. Right now, I get the distinct users and SKUs, then assign numeric IDs to them outside of Spark. I was wondering whether there was a better way of doing this. The one approach I've thought of is to write a custom RDD that essentially enumerates 1 through n , then call zip on the two RDDs. 回答1:

How to create correct data frame for classification in Spark ML

阅读更多关于 How to create correct data frame for classification in Spark ML

问题 I am trying to run random forest classification by using Spark ML api but I am having issues with creating right data frame input into pipeline. Here is sample data: age,hours_per_week,education,sex,salaryRange 38,40,"hs-grad","male","A" 28,40,"bachelors","female","A" 52,45,"hs-grad","male","B" 31,50,"masters","female","B" 42,40,"bachelors","male","B" age and hours_per_week are integers while other features including label salaryRange are categorical (String) Loading this csv file (lets call

Save ML model for future usage

阅读更多关于 Save ML model for future usage

I was applying some Machine Learning algorithms like Linear Regression, Logistic Regression, and Naive Bayes to some data, but I was trying to avoid using RDDs and start using DataFrames because the RDDs are slower than Dataframes under pyspark (see pic 1). The other reason why I am using DataFrames is because the ml library has a class very useful to tune models which is CrossValidator this class returns a model after fitting it, obviously this method has to test several scenarios, and after that returns a fitted model (with the best combinations of parameters). The cluster I use isn't so

Spark mllib predicting weird number or NaN

阅读更多关于 Spark mllib predicting weird number or NaN

问题 I am new to Apache Spark and trying to use the machine learning library to predict some data. My dataset right now is only about 350 points. Here are 7 of those points: "365","4",41401.387,5330569 "364","3",51517.886,5946290 "363","2",55059.838,6097388 "362","1",43780.977,5304694 "361","7",46447.196,5471836 "360","6",50656.121,5849862 "359","5",44494.476,5460289 Here's my code: def parsePoint(line): split = map(sanitize, line.split(',')) rev = split.pop(-2) return LabeledPoint(rev, split) def

Spark CrossValidatorModel access other models than the bestModel?

阅读更多关于 Spark CrossValidatorModel access other models than the bestModel?

问题 I am using Spark 1.6.1: Currently I am using a CrossValidator to train my ML Pipeline with various parameters. After the training process I can use the bestModel property of the CrossValidatorModel to get the Model that performed best during the Cross Validation. Are the other models of the cross validation automatically discarded or can I select a model that performed worse than the bestModel? I am asking because I am using the F1 Score metric for the cross validation but I am also

Spark ALS predictAll returns empty

阅读更多关于 Spark ALS predictAll returns empty

问题 I have the following Python test code (the arguments to ALS.train are defined elsewhere): r1 = (2, 1) r2 = (3, 1) test = sc.parallelize([r1, r2]) model = ALS.train(ratings, rank, numIter, lmbda) predictions = model.predictAll(test) print test.take(1) print predictions.count() print predictions Which works, because it has a count of 1 against the predictions variable and outputs: [(2, 1)] 1 ParallelCollectionRDD[2691] at parallelize at PythonRDD.scala:423 However, when I try and use an RDD I

How to vectorize DataFrame columns for ML algorithms?

阅读更多关于 How to vectorize DataFrame columns for ML algorithms?

问题 have a DataFrame with some categorical string values (e.g uuid|url|browser). I would to convert it in a double to execute an ML algorithm that accept double matrix. As convertion method I used StringIndexer (spark 1.4) that map my string values to double values, so I defined a function like this: def str(arg: String, df:DataFrame) : DataFrame = ( val indexer = new StringIndexer().setInputCol(arg).setOutputCol(arg+"_index") val newDF = indexer.fit(df).transform(df) return newDF ) Now the issue