data-analysis | 易学教程

How do you stack two Pandas Dataframe columns on top of each other?

阅读更多关于 How do you stack two Pandas Dataframe columns on top of each other?

问题 Is there a library function or correct way of stacking two Pandas data frame columns on top of each other? For example make 4 columns into 2: a1 b1 a2 b2 1 2 3 4 5 6 7 8 to c d 1 2 5 6 3 4 7 8 The documentation for Pandas Data Frames that I read for the most part only deal with concatenating rows and doing row manipulation, but I'm sure there has to be a way to do what I described and I am sure it's very simple. Any help would be great. 回答1: You can select the first two and second two columns

How to make Python decision tree more understandable?

阅读更多关于 How to make Python decision tree more understandable?

问题 I have a data file. The last column of the data has +1 and -1 distinguishing variables. I also have the id names of each column in a separate file. e.g. 1 2 3 4 1 5 6 7 8 1 9 1 2 3 -1 4 5 6 7 -1 8 9 1 2 -1 and for each column I have Q1, Q2, Q3, Q4, Q5 names respectively. I want to implement decision tree classifier so I wrote the following code: import numpy from sklearn import tree print('Reading data from ' + fileName); data = numpy.loadtxt(fileName); print('Getting ids from ', idFile)

Analyze Data Frames In A List And Bind The Results

阅读更多关于 Analyze Data Frames In A List And Bind The Results

问题 The data I have is a list of data frames. I want to loop through each of the data frame to find: If there are columns with duplicate column names. If yes, then I want to merge them by using rbind() in a parent data frame called output and remove all other columns of such data frames. I also want to check if there is any data frame that doesn't have duplicate columns. If yes, then remove all the columns except the first one. Then cbind() with output such that if rows are more or less than what

Trend analysis using iterative value increments

阅读更多关于 Trend analysis using iterative value increments

问题 We have configured iReport to generate the following graph: The real data points are in blue, the trend line is green. The problems include: Too many data points for the trend line Trend line does not follow a Bezier curve (spline) The source of the problem is with the incrementer class. The incrementer is provided with the data points iteratively. There does not appear to be a way to get the set of data. The code that calculates the trend line looks as follows: import java.math.BigDecimal;

plotting a timeseries graph in python using matplotlib from a csv file

阅读更多关于 plotting a timeseries graph in python using matplotlib from a csv file

I have some csv data in the following format. Ln Dr Tag Lab 0:01 0:02 0:03 0:04 0:05 0:06 0:07 0:08 0:09 L0 St vT 4R 0 0 0 0 0 0 0 0 0 L2 Tx st 4R 8 8 8 8 8 8 8 8 8 L2 Tx ss 4R 1 1 9 6 1 0 0 6 7 I want to plot a timeseries graph using the columns ( Ln , Dr , Tg , Lab ) as the keys and the 0:0n field as values on a timeseries graph. I have the following code. #!/usr/bin/env python import matplotlib.pyplot as plt import datetime import numpy as np import csv import sys with open("test.csv", 'r', newline='') as fin: reader = csv.DictReader(fin) for row in reader: key = (row['Ln'], row['Dr'], row[

Rolling comparison between a value and a past window, with percentile/quantile

阅读更多关于 Rolling comparison between a value and a past window, with percentile/quantile

问题 I'd like to compare each value x of an array with a rolling window of the n previous values. More precisely I'd like to see at which percentile this new value x would be, if we added it to the previous window : import numpy as np A = np.array([1, 4, 9, 28, 28.5, 2, 283, 3.2, 7, 15]) print A n = 4 # window width for i in range(len(A)-n): W = A[i:i+n] x = A[i+n] q = sum(W <= x) * 1.0 / n print 'Value:', x, ' Window before this value:', W, ' Quantile:', q [ 1. 4. 9. 28. 28.5 2. 283. 3.2 7. 15. ]

[Statsmodels]: How can I get statsmodel to return the pvalue of an OLS object?

阅读更多关于 [Statsmodels]: How can I get statsmodel to return the pvalue of an OLS object?

问题 I'm quite new to programming and I'm jumping on python to get some familiarity with data analysis and machine learning. I am following a tutorial on backward elimination for a multiple linear regression. Here is the code right now: # Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('50_Startups.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, 4].values #Taking care of missin' data #np.set

Tensorflow gradient and hessian evaluation

阅读更多关于 Tensorflow gradient and hessian evaluation

I find a problem in the evaluation of tensorflow r1.2 gradients and hessian function. In particular I give for granted that the evaluation of a gradient is numerically done at the point of values of the defined variables, probing the response of the placeholder function. However now I am trying with to evaluate the hessian function (thus gradients) before and after the training of the model, and I always get the same results (probably according to the feeding placeholders). I use the following function, def eval_Consts(sess): a_v_fin, a_s_fin, a_C_fin, a_a_fin, a_p_fin, loss_fin = sess.run([a

Row Aggregation after Cross Join in BigQuery

阅读更多关于 Row Aggregation after Cross Join in BigQuery

R - Calculate difference (similarity measure) between similar datasets

阅读更多关于 R - Calculate difference (similarity measure) between similar datasets

I have seen many questions that touch on this topic but haven't yet found an answer. If I have missed a question that does answer this question, please do mark this and point us to the question. Scenario: We have a benchmark dataset, we have imputation methods, we systematically delete values from the benchmark and use two different imputation methods. Thus we have a benchmark, imputedData1 and imputedData2. Question: Is there a function that can produce a number that represents the difference between the benchmark and imputedData1 or/and the difference between the benchmark and imputedData2.