data-science

“RelationRecord object of apyori module” apriori algorithm python

我只是一个虾纸丫 提交于 2020-01-01 18:35:02
问题 Excuse me for my english, I'm trying to recognize properties that come up frequently in a set of data to deduce a categorization using the apyori package of python. i'm practicing on a dataframe of 20772 transactions and the largest transaction is 543 items. DataFrame I converted this DataFrame into a list : liste = df.astype(str).values.tolist() I got this list list I used the apriori function of the library apyori to generate the association rules: from apyori import apriori rules = apriori

rvest, How to have NA values in html_nodes for creating datatables

守給你的承諾、 提交于 2019-12-31 02:32:26
问题 So I'm trying to make a data table of some information on a website. This is what I've done so far. library(rvest) url <- 'https://uws-community.symplicity.com/index.php?s=student_group' page <- html_session(url) name_nodes <- html_nodes(page,".grpl-name a") name_text <- html_text(name_nodes) df <- data.frame(matrix(unlist(name_text)), stringsAsFactors = FALSE) library(tidyverse) df <- df %>% mutate(id = row_number()) desc_nodes <- html_nodes(page, ".grpl-purpose") desc_text <- html_text(desc

Why did PCA reduced the performance of Logistic Regression?

牧云@^-^@ 提交于 2019-12-30 07:18:08
问题 I performed Logistic regression on a binary classification problem with data of 50000 X 370 dimensions.I got accuracy of about 90%.But when i did PCA + logistic on data, my accuracy reduced to 10%, I was very shocked to see this result. Can anybody explain what could have gone wrong? 回答1: There is no guarantee that PCA will ever help, or not harm the learning process. In particular - if you use PCA to reduce amount of dimensions - you are removing information from your data, thus everything

ValueError: Must pass DataFrame with boolean values only

*爱你&永不变心* 提交于 2019-12-30 06:25:39
问题 Question In this datafile, the United States is broken up into four regions using the "REGION" column. Create a query that finds the counties that belong to regions 1 or 2, whose name starts with 'Washington', and whose POPESTIMATE2015 was greater than their POPESTIMATE 2014. This function should return a 5x2 DataFrame with the columns = ['STNAME', 'CTYNAME'] and the same index ID as the census_df (sorted ascending by index). CODE def answer_eight(): counties=census_df[census_df['SUMLEV']==50

How to perform feature selection with gridsearchcv in sklearn in python

这一生的挚爱 提交于 2019-12-28 06:23:24
问题 I am using recursive feature elimination with cross validation (rfecv) as a feature selector for randomforest classifier as follows. X = df[[my_features]] #all my features y = df['gold_standard'] #labels clf = RandomForestClassifier(random_state = 42, class_weight="balanced") rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring='roc_auc') rfecv.fit(X,y) print("Optimal number of features : %d" % rfecv.n_features_) features=list(X.columns[rfecv.support_]) I am also performing

How to perform feature selection with gridsearchcv in sklearn in python

一曲冷凌霜 提交于 2019-12-28 06:23:11
问题 I am using recursive feature elimination with cross validation (rfecv) as a feature selector for randomforest classifier as follows. X = df[[my_features]] #all my features y = df['gold_standard'] #labels clf = RandomForestClassifier(random_state = 42, class_weight="balanced") rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring='roc_auc') rfecv.fit(X,y) print("Optimal number of features : %d" % rfecv.n_features_) features=list(X.columns[rfecv.support_]) I am also performing

The simplest way to convert a list with various length vectors to a data.frame in R

大憨熊 提交于 2019-12-28 03:04:25
问题 Here I have a list with different length vectors. And I'd want to get a data.frame. I've seen lots of posts about it in SO (see ref), but none of them are as simple as I expected because this is really a common task in data preprocessing. Thank you. Here simplest means as.data.frame(aa) if it works. So one function from the base package of R will be great. sapply(aa, "length<-", max(lengths(aa))) has four functions actually. An example is shown below. Input: aa <- list(A=c(1, 3, 4), B=c(3,5,7

Pandas Groupby using time frequency

痞子三分冷 提交于 2019-12-25 04:09:04
问题 My question is regarding a groupby of pandas dataframe. A sample dataset would look like this: cust_id | date | category A0001 | 20/02/2016 | cat1 A0001 | 24/02/2016 | cat2 A0001 | 02/03/2016 | cat3 A0002 | 03/04/2015 | cat2 Now I want to groupby cust_id and then find events that occur within 30days of each other and compile the list of categories for those. What I have figured so far is to use pd.grouper in the following manner. df.groupby(['cust_id', pd.Grouper(key='date', freq='30D')])[

I keep getting AttributeError in RandomSearchCV

痴心易碎 提交于 2019-12-25 02:13:48
问题 x_tu = data_cls_tu.iloc[:,1:].values y_tu = data_cls_tu.iloc[:,0].values classifier = DecisionTreeClassifier() parameters = [{"max_depth": [3,None], "min_samples_leaf": np.random.randint(1,9), "criterion": ["gini","entropy"]}] randomcv = RandomizedSearchCV(estimator=classifier, param_distributions=parameters, scoring='accuracy', cv=10, n_jobs=-1, random_state=0) randomcv.fit(x_tu, y_tu) --------------------------------------------------------------------------- AttributeError Traceback (most

Is it acceptable to scale target values for regressors?

可紊 提交于 2019-12-25 01:04:21
问题 I am getting very high RMSE and MAE for MLPRegressor , ForestRegression and Linear regression with only input variables scaled (30,000+) however when i scale target values aswell i get RMSE (0.2) , i will like to know if that is acceptable thing to do. Secondly is it normal to have better R squared values for Test (ie. 0.98 and 0.85 for train) Thank You 回答1: It is actually a common practice to scale target values in many cases. For example a highly skewed target may give better results if it