data-science | 易学教程

Difference between random forest implementation

阅读更多关于 Difference between random forest implementation

问题 Is there a performance difference between the implementation of Random Forest in H2O and standard Random Forest library? Has anybody performed or done some analysis for these two implementations. 回答1: Here is an open benchmark you can start with. https://github.com/szilard/benchm-ml 回答2: I suppose you are looking for this: http://www.wise.io/tech/benchmarking-random-forest-part-1 来源： https://stackoverflow.com/questions/45190787/difference-between-random-forest-implementation

Pandas multiindex dataframe - Selecting max from one index within multiindex

阅读更多关于 Pandas multiindex dataframe - Selecting max from one index within multiindex

问题 I've got a dataframe with a multi index of Year and Month like the following | |Value Year |Month| | 1 | 3 1992 | 2 | 5 | 3 | 8 | ... | ... 1993 | 1 | 2 | ... | ... I'm trying to select the maximum Value for each year and put that in a DF like this: | Max Year | 1992 | 5 1993 | 2 | ... There's not much info on multi-indexes, should I simply do a group by and apply or something similar to make it more simple? 回答1: Exactly right: df.groupby(level=0).apply(max) In my sample DataFrame : 0 Caps

Combining heuristics when ranking social network news feed items

阅读更多关于 Combining heuristics when ranking social network news feed items

问题 We have a news feed, and we want to surface items to the user based on a number of criteria. Certain items will be surfaced because of factor A, another because of factor B, and yet another because of factor C. We can create individual heuristics for each factor, but we then need to combine these heuristics in such a way that it promotes the best content considering each factor while still giving a mix of content from each factor. Our naive approach is to load the top n from each factor, take

Improving loop performance with function call inside

阅读更多关于 Improving loop performance with function call inside

问题 library(plyr); library(sqldf); library(data.table) library(stringi); library(RODBC); dbhandle <- odbcDriverConnect('driver={SQL Server};server=.;database=TEST_DB;trusted_connection=true') res <- sqlQuery(dbhandle, 'Select Company_ID, AsOfDate, CashFlow FROM dbo.Accounts') resdatatable = as.data.table(res) odbcCloseAll(); sppv <- function(i, n) { return((1 + i / 100) ^ (-n)) } npv <- function(x, i) { npv = c() for (k in 1:length(i)) { pvs = x * sppv(i[k], 1:length(x)) npv = c(npv, sum(pvs)) }

Diagram/Graphical options to display cartesian product in Juniper Notebook/Python/Matplotlib?

阅读更多关于 Diagram/Graphical options to display cartesian product in Juniper Notebook/Python/Matplotlib?

问题 I'll working with 49 options (7 rows, 7 columns). Here is an example I'm observing what people do (position x actions) in public plazas (total of four) for a school project. I probabily will show it over a large range of time: Each hour from 8AM to 8PM in working days and in weekend days. The main idea is to understand how the plaza is used by people. I notice the most comon situation is: standing AND talking, sit AND talk, sit AND reading, standing AND recreation. But I found some option

Get ImageNet label for a specific index in the 1000-dimensional output tensor in torch

阅读更多关于 Get ImageNet label for a specific index in the 1000-dimensional output tensor in torch

问题 I have the output Tensor of a forward pass for a Facebook implementation of the ResNet model with a cat image. That is a 1000-dimensional Tensor with the classification probabilities. Using torch.topk I can obtain the top-5 probabilities and their indexes in the output tensor. Now I want to see the human-readable labels for those most-probable indexes. I searched online for the list of labels (which apparently are also called sysnets) and only found this: http://image-net.org/challenges/LSVRC

lightgbm ----ValueError: Circular reference detected

阅读更多关于 lightgbm ----ValueError: Circular reference detected

问题 Train the model import lightgbm as lgb lgb_train = lgb.Dataset(x_train, y_train) lgb_val = lgb.Dataset(x_test, y_test) parameters = { 'application': 'binary', 'objective': 'binary', 'metric': 'auc', 'is_unbalance': 'true', 'boosting': 'gbdt', 'num_leaves': 31, 'feature_fraction': 0.5, 'bagging_fraction': 0.5, 'bagging_freq': 20, 'learning_rate': 0.05, 'verbose': 0 } model = lgb.train(parameters, train_data, valid_sets=test_data, num_boost_round=5000, early_stopping_rounds=100) y_pred = model

How to calculate p-values in Spark's Logistic Regression?

阅读更多关于 How to calculate p-values in Spark's Logistic Regression?

问题 We are using LogisticRegressionWithSGD and would like to figure out which of our variables predict and with what significance. Some stats packages (StatsModels) return p-values for each term. A low p-value (< 0.05) indicates a meaningful addition to the model. How can we get/calculate p-values from LogisticRegressionWithSGD model? Any help with this is appreciated. 回答1: This is a very old question, but some guidance for people coming to it late might be valuable. LogisticRegressionWithSGD is

Real-Time streaming prediction in Flink using scala

阅读更多关于 Real-Time streaming prediction in Flink using scala

问题 Flink version : 1.2.0 Scala version : 2.11.8 I want to use a DataStream to predict using a model in flink using scala. I have a DataStream[String] in flink using scala which contains json formatted data from a kafka source.I want to use this datastream to predict on a Flink-ml model which is already trained. The problem is all the flink-ml examples use DataSet api to predict. I am relatively new to flink and scala so any help in the form of a code solution would be appreciated. Input : {

Append Multiple Excel Files(xlsx) together in python

阅读更多关于 Append Multiple Excel Files(xlsx) together in python

问题 import pandas as pd import os import glob all_data = pd.DataFrame() for f in glob.glob("output/test*.xlsx") df = pd.read_excel(f) all_data = all_data.append(df, ignore_index=True) I want to put multiple xlsx files into one xlsx. the excel files are in the output/test folder. The columns are the same, in all but I want concat the rows. the above code doesn't seem to work 回答1: Let all_data be a list. all_data = [] for f in glob.glob("output/test/*.xlsx"): all_data.append(pd.read_excel(f)) Now,