prediction | 易学教程

spark ml 的例子

阅读更多关于 spark ml 的例子

一、关于spark ml pipeline与机器学习一个典型的机器学习构建包含若干个过程 1、源数据ETL 2、数据预处理 3、特征选取 4、模型训练与验证以上四个步骤可以抽象为一个包括多个步骤的流水线式工作，从数据收集开始至输出我们需要的最终结果。因此，对以上多个步骤、进行抽象建模，简化为流水线式工作流程则存在着可行性，对利用spark进行机器学习的用户来说，流水线式机器学习比单个步骤独立建模更加高效、易用。受 scikit-learn 项目的启发，并且总结了MLlib在处理复杂机器学习问题的弊端(主要为工作繁杂，流程不清晰)，旨在向用户提供基于DataFrame 之上的更加高层次的 API 库，以更加方便的构建复杂的机器学习工作流式应用。一个pipeline 在结构上会包含一个或多个Stage，每一个 Stage 都会完成一个任务，如数据集处理转化，模型训练，参数设置或数据预测等，这样的Stage 在 ML 里按照处理问题类型的不同都有相应的定义和实现。两个主要的stage为Transformer和Estimator。Transformer主要是用来操作一个DataFrame 数据并生成另外一个DataFrame 数据，比如svm模型、一个特征提取工具，都可以抽象为一个Transformer。Estimator 则主要是用来做模型拟合用的

Predict next event occurrence, based on past occurrences

阅读更多关于 Predict next event occurrence, based on past occurrences

I'm looking for an algorithm or example material to study for predicting future events based on known patterns. Perhaps there is a name for this, and I just don't know/remember it. Something this general may not exist, but I'm not a master of math or algorithms, so I'm here asking for direction. An example, as I understand it would be something like this: A static event occurs on January 1st, February 1st, March 3rd, April 4th. A simple solution would be to average the days/hours/minutes/something between each occurrence, add that number to the last known occurrence, and have the prediction.

Pybrain time series prediction using LSTM recurrent nets

阅读更多关于 Pybrain time series prediction using LSTM recurrent nets

问题 I have a question in mind which relates to the usage of pybrain to do regression of a time series. I plan to use the LSTM layer in pybrain to train and predict a time series. I found an example code here in the link below Request for example: Recurrent neural network for predicting next value in a sequence In the example above, the network is able to predict a sequence after its being trained. But the issue is, network takes in all the sequential data by feeding it in one go to the input

xgboost in R: how does xgb.cv pass the optimal parameters into xgb.train

阅读更多关于 xgboost in R: how does xgb.cv pass the optimal parameters into xgb.train

I've been exploring the xgboost package in R and went through several demos as well as tutorials but this still confuses me: after using xgb.cv to do cross validation, how does the optimal parameters get passed to xgb.train ? Or should I calculate the ideal parameters (such as nround , max.depth ) based on the output of xgb.cv ? param <- list("objective" = "multi:softprob", "eval_metric" = "mlogloss", "num_class" = 12) cv.nround <- 11 cv.nfold <- 5 mdcv <-xgb.cv(data=dtrain,params = param,nthread=6,nfold = cv.nfold,nrounds = cv.nround,verbose = T) md <-xgb.train(data=dtrain,params = param

Adding statsmodels 'predict' results to a Pandas dataframe

阅读更多关于 Adding statsmodels 'predict' results to a Pandas dataframe

It is common to want to append the results of predictions to the dataset used to make the predictions, but the statsmodels predict function returns (non-indexed) results of a potentially different length than the dataset on which predictions are based. For example, if the test dataset, test , contains any null entries, then mod_fit = sm.Logit.from_formula('Y ~ A B C', train).fit() press = mod_fit.predict(test) will produce an array that is shorter than the length of test , and cannot be usefully appended with test['preds'] = preds And since the result of predict is not indexed, there is no way

ValueError: Wrong number of items passed - Meaning and suggestions?

阅读更多关于 ValueError: Wrong number of items passed - Meaning and suggestions?

I am receiving the error: ValueError: Wrong number of items passed 3, placement implies 1 , and I am struggling to figure out where, and how I may begin addressing the problem. I don't really understand the meaning of the error; which is making it difficult for me to troubleshoot. I have also included the block of code that is triggering the error in my Jupyter Notebook. The data is tough to attach; so I am not looking for anyone to try and re-create this error for me. I am just looking for some feedback on how I could address this error. KeyError Traceback (most recent call last) C:\Users

What is the difference between lm(offense$R ~ offense$OBP) and lm(R ~ OBP)?

阅读更多关于 What is the difference between lm(offense$R ~ offense$OBP) and lm(R ~ OBP)?

问题 I am trying to use R to create a linear model and use that to predict some values. The subject matter is baseball stats. If I do this: obp <- lm(offense$R ~ offense$OBP) predict(obp, newdata=data.frame(OBP=0.5), interval="predict") I get the error: Warning message: 'newdata' had 1 row but variables found have 20 rows. However, if I do this: attach(offense) obp <- lm(R ~ OBP) predict(obp, newdata=data.frame(OBP=0.5), interval="predict") It works as expected and I get one result. What is the

Predicting a multiple forward time step of a time series using LSTM

阅读更多关于 Predicting a multiple forward time step of a time series using LSTM

I want to predict certain values that are weekly predictable (low SNR). I need to predict the whole time series of a year formed by the weeks of the year (52 values - Figure 1) My first idea was to develop a many-to-many LSTM model (Figure 2) using Keras over TensorFlow. I'm training the model with a 52 input layer (the given time series of previous year) and 52 predicted output layer (the time series of next year). The shape of train_X is (X_examples, 52, 1), in other words, X_examples to train, 52 timesteps of 1 feature each. I understand that Keras will consider the 52 inputs as a time

Prediction using Recurrent Neural Network on Time series dataset

阅读更多关于 Prediction using Recurrent Neural Network on Time series dataset

Description Given a dataset that has 10 sequences - a sequence corresponds to a day of stock value recordings - where each constitutes 50 sample recordings of stock values that are separated by 5 minute intervals starting from the morning or 9:05 am. However, there is one extra recording (the 51th sample) that is only available in the training set which is 2 hours later, not 5 minutes, than the last recorded sample in the 50 sample recordings. That 51th sample is required to be predicted for the testing set where the first 50 samples are also given. I am using the pybrain recurrent neural

Keep same dummy variable in training and testing data

阅读更多关于 Keep same dummy variable in training and testing data

I am building a prediction model in python with two separate training and testing sets. The training data contains numerical type categorical variable, e.g., zip code,[91521,23151,12355, ...], and also string categorical variables, e.g., city ['Chicago', 'New York', 'Los Angeles', ...]. To train the data, I first use the 'pd.get_dummies' to get dummy variable of these variable, and then fit the model with the transformed training data. I do the same transformation on my test data and predict the result using the trained model. However, I got the error 'ValueError: Number of features of the

订阅 prediction