data-science

Python Pandas — Forward filling entire rows with value of one previous column

﹥>﹥吖頭↗ 提交于 2019-12-22 08:24:25
问题 New to pandas development. How do I forward fill a DataFrame with the value contained in one previously seen column? Self-contained example: import pandas as pd import numpy as np O = [1, np.nan, 5, np.nan] H = [5, np.nan, 5, np.nan] L = [1, np.nan, 2, np.nan] C = [5, np.nan, 2, np.nan] timestamps = ["2017-07-23 03:13:00", "2017-07-23 03:14:00", "2017-07-23 03:15:00", "2017-07-23 03:16:00"] dict = {'Open': O, 'High': H, 'Low': L, 'Close': C} df = pd.DataFrame(index=timestamps, data=dict) ohlc

How do I add limiting conditions when using GpyOpt?

孤人 提交于 2019-12-21 21:29:09
问题 Currently I try to minimize the function and get optimized parameters using GPyOpt. import GPy import GPyOpt from math import log def f(x): x0,x1,x2,x3,x4,x5 = x[:,0],x[:,1],x[:,2],x[:,3],x[:,4],x[:,5], f0 = 0.2 * log(x0) f1 = 0.3 * log(x1) f2 = 0.4 * log(x2) f3 = 0.2 * log(x3) f4 = 0.5 * log(x4) f5 = 0.2 * log(x5) return -(f0 + f1 + f2 + f3 + f4 + f5) bounds = [ {'name': 'x0', 'type': 'discrete', 'domain': (1,1000000)}, {'name': 'x1', 'type': 'discrete', 'domain': (1,1000000)}, {'name': 'x2'

Is my python implementation of the Davies-Bouldin Index correct?

给你一囗甜甜゛ 提交于 2019-12-21 02:45:37
问题 I'm trying to calculate the Davies-Bouldin Index in Python. Here are the steps the code below tries to reproduce. 5 Steps : For each cluster, compute euclidean distances between each point to the centroid For each cluster, compute the average of these distances For each pair of clusters, compute the euclidean distance between their centroids Then, For each pair of clusters, make the sum of the average distances to their respective centroid (computed at step 2) and divide it by the distance

How to manually change the tick labels of the margin plots on a Seaborn jointplot

怎甘沉沦 提交于 2019-12-20 06:43:28
问题 I am trying to use a log scale as the margin plots for my seaborn jointplot. I am usings set_xticks() and set_yticks(), but my changes do not appear. Here is my code below and the resulting graph: import matplotlib.pyplot as plt %matplotlib inline import numpy as np import seaborn as sns import pandas as pd tips = sns.load_dataset('tips') female_waiters = tips[tips['sex']=='Female'] def graph_joint_histograms(df1): g=sns.jointplot(x = 'total_bill',y = 'tip', data = tips, space = 0.3,ratio = 3

How to get a normalised slope of a trend

天涯浪子 提交于 2019-12-20 05:40:30
问题 I am analysing the distances of users to userx over 6 weeks in a social network. Note: 'No path' means the two users are not conncted yet (at least by friends of friends). week1 week2 week3 week4 week5 week6 user1 No path No path No path No path 3 1 user2 No path No path No path 5 3 1 user3 5 4 4 4 4 3 userN ... I want to see how well the users connect with userx . For that I initially thought of using the value of regression slope for the interpretation (i.e. the low regression slope, the

InvalidArgumentError: Expected dimension in the range [-1, 1) but got 1

柔情痞子 提交于 2019-12-19 17:27:10
问题 I'm not sure what this error means. This error occurs when I try to calculate acc : acc = accuracy.eval(feed_dict = {x: batch_images, y: batch_labels, keep_prob: 1.0}) I've tried looking up solutions, but I couldn't find any online. Any ideas on what's causing my error? Here's a link to my full code. 回答1: I had a similar error but the problem for me was that I was trying to use argmax on a 1 dimensional vector. So the shape of my label was (50,) and I was trying to do a tf.argmax(y,1) on that

light gbm - python API vs Scikit-learn API

隐身守侯 提交于 2019-12-19 04:02:16
问题 I was trying to apply lgbm in one of my problems. For that I was going through "http://lightgbm.readthedocs.io/en/latest/Python-API.html". However, I have a basic question. Is there any difference between Training API and Scikit-learn API? Can we use both the APIs to achieve same result for the same problem? Thanks, Dipanjan. 回答1: The short answer: yes, they will provide identical results if you will configure them in identical ways. The reason is that sklearn API is just a wrapper around the

Plotting decision boundary for High Dimension Data

£可爱£侵袭症+ 提交于 2019-12-18 11:55:17
问题 I am building a model for binary classification problem where each of my data points is of 300 dimensions (I am using 300 features). I am using a PassiveAggressiveClassifier from sklearn . The model is performing really well. I wish to plot the decision boundary of the model. How can I do so ? To get a sense of the data, I am plotting it in 2D using TSNE. I reduced the dimensions of the data in 2 steps - from 300 to 50, then from 50 to 2 (this is a common recomendation). Below is the code

GridSearchCV - XGBoost - Early Stopping

我的梦境 提交于 2019-12-18 10:37:14
问题 i am trying to do hyperparemeter search with using scikit-learn's GridSearchCV on XGBoost. During gridsearch i'd like it to early stop, since it reduce search time drastically and (expecting to) have better results on my prediction/regression task. I am using XGBoost via its Scikit-Learn API. model = xgb.XGBRegressor() GridSearchCV(model, paramGrid, verbose=verbose ,fit_params={'early_stopping_rounds':42}, cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX, trainY]), n_jobs=n_jobs, iid=iid)

Spark MLib Decision Trees: Probability of labels by features?

Deadly 提交于 2019-12-18 07:07:29
问题 I could manage to display total probabilities of my labels , for example after displaying my decision tree, I have a table : Total Predictions : 65% impressions 30% clicks 5% conversions But my issue is to find probabilities (or to count) by features (by node), for example : if feature1 > 5 if feature2 < 10 Predict Impressions samples : 30 Impressions else feature2 >= 10 Predict Clicks samples : 5 Clicks Scikit does it automatically , I am trying to find a way to do it with Spark 回答1: Note: