apache-spark-mllib | 易学教程

how to convert mix of text and numerical data to feature data in apache spark

阅读更多关于 how to convert mix of text and numerical data to feature data in apache spark

问题 I have a CSV of both textual and numerical data. I need to convert it to feature vector data in Spark (Double values). Is there any way to do that ? I see some e.g where each keyword is mapped to some double value and use this to convert. However if there are multiple keywords, it is difficult to do this way. Is there any other way out? I see Spark provides Extractors which will convert into feature vectors. Could someone please give an example? 48, Private, 105808, 9th, 5, Widowed, Transport

Exception on using VectorAssembler in apache spark ml

阅读更多关于 Exception on using VectorAssembler in apache spark ml

问题 I'm trying to create a vectorAssembler to create an input for logistic regression and am using the following code : //imports import org.apache.spark.ml.feature.VectorAssembler import org.apache.spark.mllib.linalg.{Vector, Vectors, VectorUDT} 1 val assembler = new VectorAssembler() 2 .setInputCols(flattenedPath.columns.diff(Seq("userid", "Conversion"))) 3 .setOutputCol("features") 4 val output = assembler.transform(flattenedPath) 5 println(output.select("features", "Conversion").first()) Im

Using Spark ML Pipelines just for Transformations

阅读更多关于 Using Spark ML Pipelines just for Transformations

问题 I am working on a project where configurable pipelines and lineage tracking of alterations to Spark DataFrames are both essential. The endpoints of this pipeline are usually just modified DataFrames (think of it as an ETL task). What made the most sense to me was to leverage the already existing Spark ML Pipeline API to track these alterations. In particular, the alterations (adding columns based on others, etc.) are implemented as custom Spark ML Transformers. However, we are now having an

XGBoost Spark One Model Per Worker Integration

阅读更多关于 XGBoost Spark One Model Per Worker Integration

问题 Trying to work through this notebook https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1526931011080774/3624187670661048/6320440561800420/latest.html. Using spark version 2.4.3 and xgboost 0.90 Keep getting this error ValueError: bad input shape () when trying to execute ... features = inputTrainingDF.select("features").collect() lables = inputTrainingDF.select("label").collect() X = np.asarray(map(lambda v: v[0].toArray(), features)) Y = np

how to load a word2vec model and call its function into the mapper

阅读更多关于 how to load a word2vec model and call its function into the mapper

问题 I want to load a word2vec model and evaluate it by executing word analogy tasks (e.g. a is to b as c is to something? ). To do this, first I load my w2v model: model = Word2VecModel.load(spark.sparkContext, str(sys.argv[1])) and then I call the mapper to evaluate the model: rdd_lines = spark.read.text("questions-words.txt").rdd.map(getAnswers) The getAnswers function reads one line per time from questions-words.txt , in which each line contains the question and the answer to evaluate my model

Spark Model to use in Java Application

阅读更多关于 Spark Model to use in Java Application

问题 For analysis. I know we can use the Save function and load the Model in Spark application. But it works only in Spark application (Java, Scala, Python). We can also use the PMML and export the model to other type of application. Is there any way to use a Spark model in a Java application? 回答1: I am one of the creators of MLeap. Check us out, it is meant for exactly your use case. If there is a transformer you need that is not currently supported, get in touch with me and we will get it in

Spark Model to use in Java Application

阅读更多关于 Spark Model to use in Java Application

Get wrong recommendation with ALS.recommendation

阅读更多关于 Get wrong recommendation with ALS.recommendation

问题 I write a spark program for making recommendations. Then I used ALS.recommendation library. And I made a small test with the following dataset called trainData: (u1, m1, 1) (u1, m4, 1) (u2, m2, 1) (u2, m3, 1) (u3, m1, 1) (u3, m3, 1) (u3, m4, 1) (u4, m3, 1) (u4, m4, 1) (u5, m2, 1) (u5, m4, 1) The first column contains the user, the second contains the items rated by the users and the third contains the ratings. In my code written in scala I trained the model using: myModel = ALS.trainImplicit

Tagging columns as Categorical in Spark

阅读更多关于 Tagging columns as Categorical in Spark

问题 I am currently using StringIndexer to convert lot of columns into unique integers for classification in RandomForestModel. I am also using a pipeline for the ML process. Some queries are How does the RandomForestModel know which columns are categorical. StringIndexer converts non--numerical to numerical but does it add some meta-data of somesort to indicate that it is a categorical column? In mllib.tree.RF there was parameter call categoricalInfo which indicated columns which are categorical.

Spark MLLib 2.0 Categorical Features in pipeline

阅读更多关于 Spark MLLib 2.0 Categorical Features in pipeline

问题 I'm trying to build a decision tree based on log files. Some feature sets are large containing thousands of unique values. I'm trying to use the new idioms of pipeline and data frame in Java. I've built a pipeline with several StringIndexer pipeline stages for each of the categorical feature columns. Then I use a VectorAssembler to create a features vector. The resultant data frame looks perfect to me after the VectorAssembler stage. My pipeline looks approximately like StringIndexer->