prediction

spark ml 的例子

核能气质少年 提交于 2019-11-28 22:34:43
一、关于spark ml pipeline与机器学习 一个典型的机器学习构建包含若干个过程 1、源数据ETL 2、数据预处理 3、特征选取 4、模型训练与验证 以上四个步骤可以抽象为一个包括多个步骤的流水线式工作,从数据收集开始至输出我们需要的最终结果。因此,对以上多个步骤、进行抽象建模,简化为流水线式工作流程则存在着可行性,对利用spark进行机器学习的用户来说,流水线式机器学习比单个步骤独立建模更加高效、易用。 受 scikit-learn 项目的启发,并且总结了MLlib在处理复杂机器学习问题的弊端(主要为工作繁杂,流程不清晰),旨在向用户提供基于DataFrame 之上的更加高层次的 API 库,以更加方便的构建复杂的机器学习工作流式应用。一个pipeline 在结构上会包含一个或多个Stage,每一个 Stage 都会完成一个任务,如数据集处理转化,模型训练,参数设置或数据预测等,这样的Stage 在 ML 里按照处理问题类型的不同都有相应的定义和实现。两个主要的stage为Transformer和Estimator。Transformer主要是用来操作一个DataFrame 数据并生成另外一个DataFrame 数据,比如svm模型、一个特征提取工具,都可以抽象为一个Transformer。Estimator 则主要是用来做模型拟合用的

Predict next event occurrence, based on past occurrences

不想你离开。 提交于 2019-11-28 18:51:12
I'm looking for an algorithm or example material to study for predicting future events based on known patterns. Perhaps there is a name for this, and I just don't know/remember it. Something this general may not exist, but I'm not a master of math or algorithms, so I'm here asking for direction. An example, as I understand it would be something like this: A static event occurs on January 1st, February 1st, March 3rd, April 4th. A simple solution would be to average the days/hours/minutes/something between each occurrence, add that number to the last known occurrence, and have the prediction.

Pybrain time series prediction using LSTM recurrent nets

二次信任 提交于 2019-11-28 17:46:10
问题 I have a question in mind which relates to the usage of pybrain to do regression of a time series. I plan to use the LSTM layer in pybrain to train and predict a time series. I found an example code here in the link below Request for example: Recurrent neural network for predicting next value in a sequence In the example above, the network is able to predict a sequence after its being trained. But the issue is, network takes in all the sequential data by feeding it in one go to the input

xgboost in R: how does xgb.cv pass the optimal parameters into xgb.train

寵の児 提交于 2019-11-28 15:44:31
I've been exploring the xgboost package in R and went through several demos as well as tutorials but this still confuses me: after using xgb.cv to do cross validation, how does the optimal parameters get passed to xgb.train ? Or should I calculate the ideal parameters (such as nround , max.depth ) based on the output of xgb.cv ? param <- list("objective" = "multi:softprob", "eval_metric" = "mlogloss", "num_class" = 12) cv.nround <- 11 cv.nfold <- 5 mdcv <-xgb.cv(data=dtrain,params = param,nthread=6,nfold = cv.nfold,nrounds = cv.nround,verbose = T) md <-xgb.train(data=dtrain,params = param

Adding statsmodels 'predict' results to a Pandas dataframe

十年热恋 提交于 2019-11-28 12:57:32
It is common to want to append the results of predictions to the dataset used to make the predictions, but the statsmodels predict function returns (non-indexed) results of a potentially different length than the dataset on which predictions are based. For example, if the test dataset, test , contains any null entries, then mod_fit = sm.Logit.from_formula('Y ~ A B C', train).fit() press = mod_fit.predict(test) will produce an array that is shorter than the length of test , and cannot be usefully appended with test['preds'] = preds And since the result of predict is not indexed, there is no way

ValueError: Wrong number of items passed - Meaning and suggestions?

微笑、不失礼 提交于 2019-11-28 08:54:43
I am receiving the error: ValueError: Wrong number of items passed 3, placement implies 1 , and I am struggling to figure out where, and how I may begin addressing the problem. I don't really understand the meaning of the error; which is making it difficult for me to troubleshoot. I have also included the block of code that is triggering the error in my Jupyter Notebook. The data is tough to attach; so I am not looking for anyone to try and re-create this error for me. I am just looking for some feedback on how I could address this error. KeyError Traceback (most recent call last) C:\Users

What is the difference between lm(offense$R ~ offense$OBP) and lm(R ~ OBP)?

我怕爱的太早我们不能终老 提交于 2019-11-28 07:42:41
问题 I am trying to use R to create a linear model and use that to predict some values. The subject matter is baseball stats. If I do this: obp <- lm(offense$R ~ offense$OBP) predict(obp, newdata=data.frame(OBP=0.5), interval="predict") I get the error: Warning message: 'newdata' had 1 row but variables found have 20 rows. However, if I do this: attach(offense) obp <- lm(R ~ OBP) predict(obp, newdata=data.frame(OBP=0.5), interval="predict") It works as expected and I get one result. What is the

Predicting a multiple forward time step of a time series using LSTM

安稳与你 提交于 2019-11-28 06:01:35
I want to predict certain values that are weekly predictable (low SNR). I need to predict the whole time series of a year formed by the weeks of the year (52 values - Figure 1) My first idea was to develop a many-to-many LSTM model (Figure 2) using Keras over TensorFlow. I'm training the model with a 52 input layer (the given time series of previous year) and 52 predicted output layer (the time series of next year). The shape of train_X is (X_examples, 52, 1), in other words, X_examples to train, 52 timesteps of 1 feature each. I understand that Keras will consider the 52 inputs as a time

Prediction using Recurrent Neural Network on Time series dataset

你说的曾经没有我的故事 提交于 2019-11-28 03:52:16
Description Given a dataset that has 10 sequences - a sequence corresponds to a day of stock value recordings - where each constitutes 50 sample recordings of stock values that are separated by 5 minute intervals starting from the morning or 9:05 am. However, there is one extra recording (the 51th sample) that is only available in the training set which is 2 hours later, not 5 minutes, than the last recorded sample in the 50 sample recordings. That 51th sample is required to be predicted for the testing set where the first 50 samples are also given. I am using the pybrain recurrent neural

Keep same dummy variable in training and testing data

佐手、 提交于 2019-11-27 17:52:07
I am building a prediction model in python with two separate training and testing sets. The training data contains numerical type categorical variable, e.g., zip code,[91521,23151,12355, ...], and also string categorical variables, e.g., city ['Chicago', 'New York', 'Los Angeles', ...]. To train the data, I first use the 'pd.get_dummies' to get dummy variable of these variable, and then fit the model with the transformed training data. I do the same transformation on my test data and predict the result using the trained model. However, I got the error 'ValueError: Number of features of the