data-science | 易学教程

“RelationRecord object of apyori module” apriori algorithm python

阅读更多关于 “RelationRecord object of apyori module” apriori algorithm python

问题 Excuse me for my english, I'm trying to recognize properties that come up frequently in a set of data to deduce a categorization using the apyori package of python. i'm practicing on a dataframe of 20772 transactions and the largest transaction is 543 items. DataFrame I converted this DataFrame into a list : liste = df.astype(str).values.tolist() I got this list list I used the apriori function of the library apyori to generate the association rules: from apyori import apriori rules = apriori

rvest, How to have NA values in html_nodes for creating datatables

阅读更多关于 rvest, How to have NA values in html_nodes for creating datatables

问题 So I'm trying to make a data table of some information on a website. This is what I've done so far. library(rvest) url <- 'https://uws-community.symplicity.com/index.php?s=student_group' page <- html_session(url) name_nodes <- html_nodes(page,".grpl-name a") name_text <- html_text(name_nodes) df <- data.frame(matrix(unlist(name_text)), stringsAsFactors = FALSE) library(tidyverse) df <- df %>% mutate(id = row_number()) desc_nodes <- html_nodes(page, ".grpl-purpose") desc_text <- html_text(desc

Why did PCA reduced the performance of Logistic Regression?

阅读更多关于 Why did PCA reduced the performance of Logistic Regression?

问题 I performed Logistic regression on a binary classification problem with data of 50000 X 370 dimensions.I got accuracy of about 90%.But when i did PCA + logistic on data, my accuracy reduced to 10%, I was very shocked to see this result. Can anybody explain what could have gone wrong? 回答1: There is no guarantee that PCA will ever help, or not harm the learning process. In particular - if you use PCA to reduce amount of dimensions - you are removing information from your data, thus everything

ValueError: Must pass DataFrame with boolean values only

阅读更多关于 ValueError: Must pass DataFrame with boolean values only

问题 Question In this datafile, the United States is broken up into four regions using the "REGION" column. Create a query that finds the counties that belong to regions 1 or 2, whose name starts with 'Washington', and whose POPESTIMATE2015 was greater than their POPESTIMATE 2014. This function should return a 5x2 DataFrame with the columns = ['STNAME', 'CTYNAME'] and the same index ID as the census_df (sorted ascending by index). CODE def answer_eight(): counties=census_df[census_df['SUMLEV']==50

How to perform feature selection with gridsearchcv in sklearn in python

阅读更多关于 How to perform feature selection with gridsearchcv in sklearn in python

问题 I am using recursive feature elimination with cross validation (rfecv) as a feature selector for randomforest classifier as follows. X = df[[my_features]] #all my features y = df['gold_standard'] #labels clf = RandomForestClassifier(random_state = 42, class_weight="balanced") rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring='roc_auc') rfecv.fit(X,y) print("Optimal number of features : %d" % rfecv.n_features_) features=list(X.columns[rfecv.support_]) I am also performing

How to perform feature selection with gridsearchcv in sklearn in python

阅读更多关于 How to perform feature selection with gridsearchcv in sklearn in python

The simplest way to convert a list with various length vectors to a data.frame in R

阅读更多关于 The simplest way to convert a list with various length vectors to a data.frame in R

问题 Here I have a list with different length vectors. And I'd want to get a data.frame. I've seen lots of posts about it in SO (see ref), but none of them are as simple as I expected because this is really a common task in data preprocessing. Thank you. Here simplest means as.data.frame(aa) if it works. So one function from the base package of R will be great. sapply(aa, "length<-", max(lengths(aa))) has four functions actually. An example is shown below. Input: aa <- list(A=c(1, 3, 4), B=c(3,5,7

Pandas Groupby using time frequency

阅读更多关于 Pandas Groupby using time frequency

问题 My question is regarding a groupby of pandas dataframe. A sample dataset would look like this: cust_id | date | category A0001 | 20/02/2016 | cat1 A0001 | 24/02/2016 | cat2 A0001 | 02/03/2016 | cat3 A0002 | 03/04/2015 | cat2 Now I want to groupby cust_id and then find events that occur within 30days of each other and compile the list of categories for those. What I have figured so far is to use pd.grouper in the following manner. df.groupby(['cust_id', pd.Grouper(key='date', freq='30D')])[

I keep getting AttributeError in RandomSearchCV

阅读更多关于 I keep getting AttributeError in RandomSearchCV

问题 x_tu = data_cls_tu.iloc[:,1:].values y_tu = data_cls_tu.iloc[:,0].values classifier = DecisionTreeClassifier() parameters = [{"max_depth": [3,None], "min_samples_leaf": np.random.randint(1,9), "criterion": ["gini","entropy"]}] randomcv = RandomizedSearchCV(estimator=classifier, param_distributions=parameters, scoring='accuracy', cv=10, n_jobs=-1, random_state=0) randomcv.fit(x_tu, y_tu) --------------------------------------------------------------------------- AttributeError Traceback (most

Is it acceptable to scale target values for regressors?

阅读更多关于 Is it acceptable to scale target values for regressors?

问题 I am getting very high RMSE and MAE for MLPRegressor , ForestRegression and Linear regression with only input variables scaled (30,000+) however when i scale target values aswell i get RMSE (0.2) , i will like to know if that is acceptable thing to do. Secondly is it normal to have better R squared values for Test (ie. 0.98 and 0.85 for train) Thank You 回答1: It is actually a common practice to scale target values in many cases. For example a highly skewed target may give better results if it