data-science

Difference between random forest implementation

ぃ、小莉子 提交于 2019-12-10 12:21:20
问题 Is there a performance difference between the implementation of Random Forest in H2O and standard Random Forest library? Has anybody performed or done some analysis for these two implementations. 回答1: Here is an open benchmark you can start with. https://github.com/szilard/benchm-ml 回答2: I suppose you are looking for this: http://www.wise.io/tech/benchmarking-random-forest-part-1 来源: https://stackoverflow.com/questions/45190787/difference-between-random-forest-implementation

Pandas multiindex dataframe - Selecting max from one index within multiindex

这一生的挚爱 提交于 2019-12-10 10:09:22
问题 I've got a dataframe with a multi index of Year and Month like the following | |Value Year |Month| | 1 | 3 1992 | 2 | 5 | 3 | 8 | ... | ... 1993 | 1 | 2 | ... | ... I'm trying to select the maximum Value for each year and put that in a DF like this: | Max Year | 1992 | 5 1993 | 2 | ... There's not much info on multi-indexes, should I simply do a group by and apply or something similar to make it more simple? 回答1: Exactly right: df.groupby(level=0).apply(max) In my sample DataFrame : 0 Caps

Combining heuristics when ranking social network news feed items

痞子三分冷 提交于 2019-12-08 15:44:44
问题 We have a news feed, and we want to surface items to the user based on a number of criteria. Certain items will be surfaced because of factor A, another because of factor B, and yet another because of factor C. We can create individual heuristics for each factor, but we then need to combine these heuristics in such a way that it promotes the best content considering each factor while still giving a mix of content from each factor. Our naive approach is to load the top n from each factor, take

Improving loop performance with function call inside

元气小坏坏 提交于 2019-12-08 06:45:22
问题 library(plyr); library(sqldf); library(data.table) library(stringi); library(RODBC); dbhandle <- odbcDriverConnect('driver={SQL Server};server=.;database=TEST_DB;trusted_connection=true') res <- sqlQuery(dbhandle, 'Select Company_ID, AsOfDate, CashFlow FROM dbo.Accounts') resdatatable = as.data.table(res) odbcCloseAll(); sppv <- function(i, n) { return((1 + i / 100) ^ (-n)) } npv <- function(x, i) { npv = c() for (k in 1:length(i)) { pvs = x * sppv(i[k], 1:length(x)) npv = c(npv, sum(pvs)) }

Diagram/Graphical options to display cartesian product in Juniper Notebook/Python/Matplotlib?

谁说胖子不能爱 提交于 2019-12-08 05:09:43
问题 I'll working with 49 options (7 rows, 7 columns). Here is an example I'm observing what people do (position x actions) in public plazas (total of four) for a school project. I probabily will show it over a large range of time: Each hour from 8AM to 8PM in working days and in weekend days. The main idea is to understand how the plaza is used by people. I notice the most comon situation is: standing AND talking, sit AND talk, sit AND reading, standing AND recreation. But I found some option

Get ImageNet label for a specific index in the 1000-dimensional output tensor in torch

守給你的承諾、 提交于 2019-12-08 01:50:51
问题 I have the output Tensor of a forward pass for a Facebook implementation of the ResNet model with a cat image. That is a 1000-dimensional Tensor with the classification probabilities. Using torch.topk I can obtain the top-5 probabilities and their indexes in the output tensor. Now I want to see the human-readable labels for those most-probable indexes. I searched online for the list of labels (which apparently are also called sysnets) and only found this: http://image-net.org/challenges/LSVRC

lightgbm ----ValueError: Circular reference detected

不想你离开。 提交于 2019-12-07 22:02:52
问题 Train the model import lightgbm as lgb lgb_train = lgb.Dataset(x_train, y_train) lgb_val = lgb.Dataset(x_test, y_test) parameters = { 'application': 'binary', 'objective': 'binary', 'metric': 'auc', 'is_unbalance': 'true', 'boosting': 'gbdt', 'num_leaves': 31, 'feature_fraction': 0.5, 'bagging_fraction': 0.5, 'bagging_freq': 20, 'learning_rate': 0.05, 'verbose': 0 } model = lgb.train(parameters, train_data, valid_sets=test_data, num_boost_round=5000, early_stopping_rounds=100) y_pred = model

How to calculate p-values in Spark's Logistic Regression?

梦想与她 提交于 2019-12-07 15:03:58
问题 We are using LogisticRegressionWithSGD and would like to figure out which of our variables predict and with what significance. Some stats packages (StatsModels) return p-values for each term. A low p-value (< 0.05) indicates a meaningful addition to the model. How can we get/calculate p-values from LogisticRegressionWithSGD model? Any help with this is appreciated. 回答1: This is a very old question, but some guidance for people coming to it late might be valuable. LogisticRegressionWithSGD is

Real-Time streaming prediction in Flink using scala

允我心安 提交于 2019-12-07 09:22:41
问题 Flink version : 1.2.0 Scala version : 2.11.8 I want to use a DataStream to predict using a model in flink using scala. I have a DataStream[String] in flink using scala which contains json formatted data from a kafka source.I want to use this datastream to predict on a Flink-ml model which is already trained. The problem is all the flink-ml examples use DataSet api to predict. I am relatively new to flink and scala so any help in the form of a code solution would be appreciated. Input : {

Append Multiple Excel Files(xlsx) together in python

蓝咒 提交于 2019-12-07 06:24:20
问题 import pandas as pd import os import glob all_data = pd.DataFrame() for f in glob.glob("output/test*.xlsx") df = pd.read_excel(f) all_data = all_data.append(df, ignore_index=True) I want to put multiple xlsx files into one xlsx. the excel files are in the output/test folder. The columns are the same, in all but I want concat the rows. the above code doesn't seem to work 回答1: Let all_data be a list. all_data = [] for f in glob.glob("output/test/*.xlsx"): all_data.append(pd.read_excel(f)) Now,