data-analysis

How do you stack two Pandas Dataframe columns on top of each other?

久未见 提交于 2019-12-08 00:49:45
问题 Is there a library function or correct way of stacking two Pandas data frame columns on top of each other? For example make 4 columns into 2: a1 b1 a2 b2 1 2 3 4 5 6 7 8 to c d 1 2 5 6 3 4 7 8 The documentation for Pandas Data Frames that I read for the most part only deal with concatenating rows and doing row manipulation, but I'm sure there has to be a way to do what I described and I am sure it's very simple. Any help would be great. 回答1: You can select the first two and second two columns

How to make Python decision tree more understandable?

╄→尐↘猪︶ㄣ 提交于 2019-12-08 00:03:57
问题 I have a data file. The last column of the data has +1 and -1 distinguishing variables. I also have the id names of each column in a separate file. e.g. 1 2 3 4 1 5 6 7 8 1 9 1 2 3 -1 4 5 6 7 -1 8 9 1 2 -1 and for each column I have Q1, Q2, Q3, Q4, Q5 names respectively. I want to implement decision tree classifier so I wrote the following code: import numpy from sklearn import tree print('Reading data from ' + fileName); data = numpy.loadtxt(fileName); print('Getting ids from ', idFile)

Analyze Data Frames In A List And Bind The Results

◇◆丶佛笑我妖孽 提交于 2019-12-07 17:24:43
问题 The data I have is a list of data frames. I want to loop through each of the data frame to find: If there are columns with duplicate column names. If yes, then I want to merge them by using rbind() in a parent data frame called output and remove all other columns of such data frames. I also want to check if there is any data frame that doesn't have duplicate columns. If yes, then remove all the columns except the first one. Then cbind() with output such that if rows are more or less than what

Trend analysis using iterative value increments

坚强是说给别人听的谎言 提交于 2019-12-07 07:29:06
问题 We have configured iReport to generate the following graph: The real data points are in blue, the trend line is green. The problems include: Too many data points for the trend line Trend line does not follow a Bezier curve (spline) The source of the problem is with the incrementer class. The incrementer is provided with the data points iteratively. There does not appear to be a way to get the set of data. The code that calculates the trend line looks as follows: import java.math.BigDecimal;

plotting a timeseries graph in python using matplotlib from a csv file

梦想的初衷 提交于 2019-12-06 16:36:23
I have some csv data in the following format. Ln Dr Tag Lab 0:01 0:02 0:03 0:04 0:05 0:06 0:07 0:08 0:09 L0 St vT 4R 0 0 0 0 0 0 0 0 0 L2 Tx st 4R 8 8 8 8 8 8 8 8 8 L2 Tx ss 4R 1 1 9 6 1 0 0 6 7 I want to plot a timeseries graph using the columns ( Ln , Dr , Tg , Lab ) as the keys and the 0:0n field as values on a timeseries graph. I have the following code. #!/usr/bin/env python import matplotlib.pyplot as plt import datetime import numpy as np import csv import sys with open("test.csv", 'r', newline='') as fin: reader = csv.DictReader(fin) for row in reader: key = (row['Ln'], row['Dr'], row[

Rolling comparison between a value and a past window, with percentile/quantile

自作多情 提交于 2019-12-06 15:35:47
问题 I'd like to compare each value x of an array with a rolling window of the n previous values. More precisely I'd like to see at which percentile this new value x would be, if we added it to the previous window : import numpy as np A = np.array([1, 4, 9, 28, 28.5, 2, 283, 3.2, 7, 15]) print A n = 4 # window width for i in range(len(A)-n): W = A[i:i+n] x = A[i+n] q = sum(W <= x) * 1.0 / n print 'Value:', x, ' Window before this value:', W, ' Quantile:', q [ 1. 4. 9. 28. 28.5 2. 283. 3.2 7. 15. ]

[Statsmodels]: How can I get statsmodel to return the pvalue of an OLS object?

馋奶兔 提交于 2019-12-06 14:03:52
问题 I'm quite new to programming and I'm jumping on python to get some familiarity with data analysis and machine learning. I am following a tutorial on backward elimination for a multiple linear regression. Here is the code right now: # Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('50_Startups.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, 4].values #Taking care of missin' data #np.set

Tensorflow gradient and hessian evaluation

本小妞迷上赌 提交于 2019-12-06 11:37:17
I find a problem in the evaluation of tensorflow r1.2 gradients and hessian function. In particular I give for granted that the evaluation of a gradient is numerically done at the point of values of the defined variables, probing the response of the placeholder function. However now I am trying with to evaluate the hessian function (thus gradients) before and after the training of the model, and I always get the same results (probably according to the feeding placeholders). I use the following function, def eval_Consts(sess): a_v_fin, a_s_fin, a_C_fin, a_a_fin, a_p_fin, loss_fin = sess.run([a

Row Aggregation after Cross Join in BigQuery

末鹿安然 提交于 2019-12-06 11:12:31
问题 Say you have the following table in BigQuery: A = user1 | 0 0 | user2 | 0 3 | user3 | 4 0 | After a cross join, you have dist = |user1 user2 0 0 , 0 3 | #comma is just showing user val seperation |user1 user3 0 0 , 4 0 | |user2 user3 0 3 , 4 0 | How can you perform row aggregation in BigQuery to compute pairwise aggregation across rows. As a typical use case, you could compute the euclidean distance between the two users. I want to compute the following metric between the two users: sum(min

R - Calculate difference (similarity measure) between similar datasets

时间秒杀一切 提交于 2019-12-06 06:08:41
I have seen many questions that touch on this topic but haven't yet found an answer. If I have missed a question that does answer this question, please do mark this and point us to the question. Scenario: We have a benchmark dataset, we have imputation methods, we systematically delete values from the benchmark and use two different imputation methods. Thus we have a benchmark, imputedData1 and imputedData2. Question: Is there a function that can produce a number that represents the difference between the benchmark and imputedData1 or/and the difference between the benchmark and imputedData2.