apache-spark-mllib

Split Contents of String column in PySpark Dataframe

时光毁灭记忆、已成空白 提交于 2019-11-26 21:44:46
问题 I have a pyspark data frame whih has a column containing strings. I want to split this column into words Code: >>> sentenceData = sqlContext.read.load('file://sample1.csv', format='com.databricks.spark.csv', header='true', inferSchema='true') >>> sentenceData.show(truncate=False) +---+---------------------------+ |key|desc | +---+---------------------------+ |1 |Virat is good batsman | |2 |sachin was good | |3 |but modi sucks big big time| |4 |I love the formulas | +---+----------------------

MatchError while accessing vector column in Spark 2.0

天涯浪子 提交于 2019-11-26 21:06:51
I am trying to create a LDA model on a JSON file. Creating a spark context with the JSON file : import org.apache.spark.sql.SparkSession val sparkSession = SparkSession.builder .master("local") .appName("my-spark-app") .config("spark.some.config.option", "config-value") .getOrCreate() val df = spark.read.json("dbfs:/mnt/JSON6/JSON/sampleDoc.txt") Displaying the df should show the DataFrame display(df) Tokenize the text import org.apache.spark.ml.feature.RegexTokenizer // Set params for RegexTokenizer val tokenizer = new RegexTokenizer() .setPattern("[\\W_]+") .setMinTokenLength(4) // Filter

Matrix Multiplication in Apache Spark [closed]

纵饮孤独 提交于 2019-11-26 20:09:58
I am trying to perform matrix multiplication using Apache Spark and Java. I have 2 main questions: How to create RDD that can represent matrix in Apache Spark? How to multiply two such RDDs? All depends on the input data and dimensions but generally speaking what you want is not a RDD but one of the distributed data structures from org.apache.spark.mllib.linalg.distributed . At this moment it provides four different implementations of the DistributedMatrix IndexedRowMatrix - can be created directly from a RDD[IndexedRow] where IndexedRow consist of row index and org.apache.spark.mllib.linalg

How to assign unique contiguous numbers to elements in a Spark RDD

谁都会走 提交于 2019-11-26 19:08:08
问题 I have a dataset of (user, product, review) , and want to feed it into mllib's ALS algorithm. The algorithm needs users and products to be numbers, while mine are String usernames and String SKUs. Right now, I get the distinct users and SKUs, then assign numeric IDs to them outside of Spark. I was wondering whether there was a better way of doing this. The one approach I've thought of is to write a custom RDD that essentially enumerates 1 through n , then call zip on the two RDDs. 回答1:

How to create correct data frame for classification in Spark ML

笑着哭i 提交于 2019-11-26 18:57:16
问题 I am trying to run random forest classification by using Spark ML api but I am having issues with creating right data frame input into pipeline. Here is sample data: age,hours_per_week,education,sex,salaryRange 38,40,"hs-grad","male","A" 28,40,"bachelors","female","A" 52,45,"hs-grad","male","B" 31,50,"masters","female","B" 42,40,"bachelors","male","B" age and hours_per_week are integers while other features including label salaryRange are categorical (String) Loading this csv file (lets call

Save ML model for future usage

筅森魡賤 提交于 2019-11-26 18:49:47
I was applying some Machine Learning algorithms like Linear Regression, Logistic Regression, and Naive Bayes to some data, but I was trying to avoid using RDDs and start using DataFrames because the RDDs are slower than Dataframes under pyspark (see pic 1). The other reason why I am using DataFrames is because the ml library has a class very useful to tune models which is CrossValidator this class returns a model after fitting it, obviously this method has to test several scenarios, and after that returns a fitted model (with the best combinations of parameters). The cluster I use isn't so

Spark mllib predicting weird number or NaN

谁说胖子不能爱 提交于 2019-11-26 17:48:30
问题 I am new to Apache Spark and trying to use the machine learning library to predict some data. My dataset right now is only about 350 points. Here are 7 of those points: "365","4",41401.387,5330569 "364","3",51517.886,5946290 "363","2",55059.838,6097388 "362","1",43780.977,5304694 "361","7",46447.196,5471836 "360","6",50656.121,5849862 "359","5",44494.476,5460289 Here's my code: def parsePoint(line): split = map(sanitize, line.split(',')) rev = split.pop(-2) return LabeledPoint(rev, split) def

Spark CrossValidatorModel access other models than the bestModel?

你说的曾经没有我的故事 提交于 2019-11-26 17:24:24
问题 I am using Spark 1.6.1: Currently I am using a CrossValidator to train my ML Pipeline with various parameters. After the training process I can use the bestModel property of the CrossValidatorModel to get the Model that performed best during the Cross Validation. Are the other models of the cross validation automatically discarded or can I select a model that performed worse than the bestModel? I am asking because I am using the F1 Score metric for the cross validation but I am also

Spark ALS predictAll returns empty

旧街凉风 提交于 2019-11-26 17:07:51
问题 I have the following Python test code (the arguments to ALS.train are defined elsewhere): r1 = (2, 1) r2 = (3, 1) test = sc.parallelize([r1, r2]) model = ALS.train(ratings, rank, numIter, lmbda) predictions = model.predictAll(test) print test.take(1) print predictions.count() print predictions Which works, because it has a count of 1 against the predictions variable and outputs: [(2, 1)] 1 ParallelCollectionRDD[2691] at parallelize at PythonRDD.scala:423 However, when I try and use an RDD I

How to vectorize DataFrame columns for ML algorithms?

大城市里の小女人 提交于 2019-11-26 17:07:49
问题 have a DataFrame with some categorical string values (e.g uuid|url|browser). I would to convert it in a double to execute an ML algorithm that accept double matrix. As convertion method I used StringIndexer (spark 1.4) that map my string values to double values, so I defined a function like this: def str(arg: String, df:DataFrame) : DataFrame = ( val indexer = new StringIndexer().setInputCol(arg).setOutputCol(arg+"_index") val newDF = indexer.fit(df).transform(df) return newDF ) Now the issue