data-science

Best way to subset a pandas dataframe [closed]

◇◆丶佛笑我妖孽 提交于 2019-12-04 11:40:35
问题 Closed . This question is opinion-based. It is not currently accepting answers. Want to improve this question? Update the question so it can be answered with facts and citations by editing this post. Closed last year . Hey I'm new to Pandas and I just came across df.query() . Why people would use df.query() when you can directly filter your Dataframes using brackets notation ? The official pandas tutorial also seems to prefer the latter approach. With brackets notation : df[df['age'] <= 21]

sklearn partial_fit() not showing accurate results as fit()

梦想的初衷 提交于 2019-12-04 05:03:21
问题 I am training 3 lists of data L1, L2, L3. First i train all one them with SGDClassifier fit() and later instance by instance with partial_fit(). I I test the data with L4, L5. [The data in lists is image data and L4, L5 images are same as L2]. The predictions with fit() is correct and it is what i am expecting with partial_fit(). However the output of below code shows that both behave differently irrespective of 10,000 number of iterations for partial_fit(). Output: fit [1] #Tested L1.

Is my python implementation of the Davies-Bouldin Index correct?

☆樱花仙子☆ 提交于 2019-12-03 17:03:43
I'm trying to calculate the Davies-Bouldin Index in Python. Here are the steps the code below tries to reproduce. 5 Steps : For each cluster, compute euclidean distances between each point to the centroid For each cluster, compute the average of these distances For each pair of clusters, compute the euclidean distance between their centroids Then, For each pair of clusters, make the sum of the average distances to their respective centroid (computed at step 2) and divide it by the distance separating them (computed at step 3). Finally, Compute the mean of all these divisions (= all indexes) to

pandas reset_index after groupby.value_counts()

杀马特。学长 韩版系。学妹 提交于 2019-12-03 08:31:53
问题 I am trying to groupby a column and compute value counts on another column. import pandas as pd dftest = pd.DataFrame({'A':[1,1,1,1,1,1,1,1,1,2,2,2,2,2], 'Amt':[20,20,20,30,30,30,30,40, 40,10, 10, 40,40,40]}) print(dftest) dftest looks like A Amt 0 1 20 1 1 20 2 1 20 3 1 30 4 1 30 5 1 30 6 1 30 7 1 40 8 1 40 9 2 10 10 2 10 11 2 40 12 2 40 13 2 40 perform grouping grouper = dftest.groupby('A') df_grouped = grouper['Amt'].value_counts() which gives A Amt 1 30 4 20 3 40 2 2 40 3 10 2 Name: Amt,

Best way to subset a pandas dataframe [closed]

核能气质少年 提交于 2019-12-03 07:09:17
Closed . This question is opinion-based. It is not currently accepting answers. Learn more . Want to improve this question? Update the question so it can be answered with facts and citations by editing this post . Hey I'm new to Pandas and I just came across df.query() . Why people would use df.query() when you can directly filter your Dataframes using brackets notation ? The official pandas tutorial also seems to prefer the latter approach. With brackets notation : df[df['age'] <= 21] With pandas query method : df.query('age <= 21') Besides some of the stylistic or flexibility differences

How to tell which Keras model is better?

◇◆丶佛笑我妖孽 提交于 2019-12-03 00:39:46
问题 I don't understand which accuracy in the output to use to compare my 2 Keras models to see which one is better. Do I use the "acc" (from the training data?) one or the "val acc" (from the validation data?) one? There are different accs and val accs for each epoch. How do I know the acc or val acc for my model as a whole? Do I average all of the epochs accs or val accs to find the acc or val acc of the model as a whole? Model 1 Output Train on 970 samples, validate on 243 samples Epoch 1/20 0s

pandas reset_index after groupby.value_counts()

匆匆过客 提交于 2019-12-02 23:54:31
I am trying to groupby a column and compute value counts on another column. import pandas as pd dftest = pd.DataFrame({'A':[1,1,1,1,1,1,1,1,1,2,2,2,2,2], 'Amt':[20,20,20,30,30,30,30,40, 40,10, 10, 40,40,40]}) print(dftest) dftest looks like A Amt 0 1 20 1 1 20 2 1 20 3 1 30 4 1 30 5 1 30 6 1 30 7 1 40 8 1 40 9 2 10 10 2 10 11 2 40 12 2 40 13 2 40 perform grouping grouper = dftest.groupby('A') df_grouped = grouper['Amt'].value_counts() which gives A Amt 1 30 4 20 3 40 2 2 40 3 10 2 Name: Amt, dtype: int64 what I want is to keep top two rows of each group Also, I was perplexed by an error when I

Plotting decision boundary for High Dimension Data

限于喜欢 提交于 2019-12-02 22:56:43
I am building a model for binary classification problem where each of my data points is of 300 dimensions (I am using 300 features). I am using a PassiveAggressiveClassifier from sklearn . The model is performing really well. I wish to plot the decision boundary of the model. How can I do so ? To get a sense of the data, I am plotting it in 2D using TSNE. I reduced the dimensions of the data in 2 steps - from 300 to 50, then from 50 to 2 (this is a common recomendation). Below is the code snippet for the same : from sklearn.manifold import TSNE from sklearn.decomposition import TruncatedSVD X

How to manually change the tick labels of the margin plots on a Seaborn jointplot

末鹿安然 提交于 2019-12-02 13:11:37
I am trying to use a log scale as the margin plots for my seaborn jointplot. I am usings set_xticks() and set_yticks(), but my changes do not appear. Here is my code below and the resulting graph: import matplotlib.pyplot as plt %matplotlib inline import numpy as np import seaborn as sns import pandas as pd tips = sns.load_dataset('tips') female_waiters = tips[tips['sex']=='Female'] def graph_joint_histograms(df1): g=sns.jointplot(x = 'total_bill',y = 'tip', data = tips, space = 0.3,ratio = 3) g.ax_joint.cla() g.ax_marg_x.cla() g.ax_marg_y.cla() for xlabel_i in g.ax_marg_x.get_xticklabels():

How to transform a key/value string into distinct rows?

蹲街弑〆低调 提交于 2019-12-02 06:06:19
问题 I have a R dataset with key value strings which looks like below: quest<-data.frame(city=c("Atlanta","New York","Atlanta","Tampa"), key_value=c("rev=63;qty=1;zip=45987","rev=10.60|34;qty=1|2;zip=12686|12694","rev=12;qty=1;zip=74268","rev=3|24|8;qty=1|6|3;zip=33684|36842|30254")) which translates to: city key_value 1 Atlanta rev=63;qty=1;zip=45987 2 New York rev=10.60|34;qty=1|2;zip=12686|12694 3 Atlanta rev=12;qty=1;zip=74268 4 Tampa rev=3|24|8;qty=1|6|3;zip=33684|36842|30254 Based on the