data-analysis | 易学教程

Find Lines in a cloud of points

阅读更多关于 Find Lines in a cloud of points

问题 I have an array of Points. I KNOW that these points represent many lines in my page. How can I find them? Do I need to find the spacing between clouds of points? Thanks Jonathan 回答1: Maybe the Hough Transform is what you are looking for? Or a linear regression? [EDIT] As it turns out the problem is to identify lines inside a list of 2d coordinates I would proceed this way. A linear regression can only be used for making the best linear adjustment for a set of points, not to detect many lines.

Removing duplicates with ignoring case sensitive and adding the next column values with the first one in pandas dataframe in python

阅读更多关于 Removing duplicates with ignoring case sensitive and adding the next column values with the first one in pandas dataframe in python

I have a df, Name Count Ram 1 ram 2 raM 1 Arjun 3 arjun 4 My desired output df, Name Count Ram 4 Arjun 7 I tried groupby but I cannot achieve the desired output, please help Use agg by values of Name s converted to lower - first and sum : df = (df.groupby(df['Name'].str.lower(), as_index=False, sort=False) .agg({'Name':'first', 'Count':'sum'})) print (df) Name Count 0 Ram 4 1 Arjun 7 Detail: print (df['Name'].str.lower()) 0 ram 1 ram 2 ram 3 arjun 4 arjun Name: Name, dtype: object In [71]: df.assign(Name=df['Name'].str.capitalize()).groupby('Name', as_index=False).sum() Out[71]: Name Count 0

Analyze Data Frames In A List And Bind The Results

阅读更多关于 Analyze Data Frames In A List And Bind The Results

The data I have is a list of data frames. I want to loop through each of the data frame to find: If there are columns with duplicate column names. If yes, then I want to merge them by using rbind() in a parent data frame called output and remove all other columns of such data frames. I also want to check if there is any data frame that doesn't have duplicate columns. If yes, then remove all the columns except the first one. Then cbind() with output such that if rows are more or less than what was created by (1) then zero should be added. I tried using lappy() , but my logic to get above two

reverse dataframe's rows' order with pandas [duplicate]

阅读更多关于 reverse dataframe's rows' order with pandas [duplicate]

问题 This question already has answers here : Right way to reverse pandas.DataFrame? (2 answers) Closed 9 months ago . How can I reverse the order of the rows in my pandas.dataframe ? I've looked everywhere and the only thing people are talking about is sorting the columns , reversing the order of the columns ... What I want is simple : If my DataFrame looks like this: A B C ------------------ LOVE IS ALL THAT MAT TERS I want it to become this: A B C ------------------ THAT MAT TERS LOVE IS ALL I

Trend analysis using iterative value increments

阅读更多关于 Trend analysis using iterative value increments

We have configured iReport to generate the following graph: The real data points are in blue, the trend line is green. The problems include: Too many data points for the trend line Trend line does not follow a Bezier curve (spline) The source of the problem is with the incrementer class. The incrementer is provided with the data points iteratively. There does not appear to be a way to get the set of data. The code that calculates the trend line looks as follows: import java.math.BigDecimal; import net.sf.jasperreports.engine.fill.*; /** * Used by an iReport variable to increment its average. *

Combine date column and time column into datetime column

阅读更多关于 Combine date column and time column into datetime column

I have a Pandas dataframe like this; (obtained by parsing an excel file) | | COMPANY NAME | MEETING DATE | MEETING TIME| -----------------------------------------------------------------------| |YKSGR| YAPI KREDİ SİGORTA A.Ş. | 2013-12-16 00:00:00 |14:00:00 | |TRCAS| TURCAS PETROL A.Ş. | 2013-12-12 00:00:00 |13:30:00 | Column MEETING DATE is a timestamp with a representation like Timestamp('2013-12-20 00:00:00', tz=None) and MEETING TIME is a datetime.time object with a representation like datetime.time(14, 0) I want to combine MEETING DATE and MEETING TIME into one column. datetime.combine

Python : How to use Multinomial Logistic Regression using SKlearn

阅读更多关于 Python : How to use Multinomial Logistic Regression using SKlearn

I have a test dataset and train dataset as below. I have provided a sample data with min records, but my data has than 1000's of records. Here E is my target variable which I need to predict using an algorithm. It has only four categories like 1,2,3,4. It can take only any of these values. Training Dataset: A B C D E 1 20 30 1 1 2 22 12 33 2 3 45 65 77 3 12 43 55 65 4 11 25 30 1 1 22 23 19 31 2 31 41 11 70 3 1 48 23 60 4 Test Dataset: A B C D E 11 21 12 11 1 2 3 4 5 6 7 8 99 87 65 34 11 21 24 12 Since E has only 4 categories, I thought of predicting this using Multinomial Logistic Regression

How do you deal with missing data using numpy/scipy?

阅读更多关于 How do you deal with missing data using numpy/scipy?

One of the things I deal with most in data cleaning is missing values. R deals with this well using its "NA" missing data label. In python, it appears that I'll have to deal with masked arrays which seem to be a major pain to set up and don't seem to be well documented. Any suggestions on making this process easier in Python? This is becoming a deal-breaker in moving into Python for data analysis. Thanks Update It's obviously been a while since I've looked at the methods in the numpy.ma module. It appears that at least the basic analysis functions are available for masked arrays, and the

Rolling comparison between a value and a past window, with percentile/quantile

阅读更多关于 Rolling comparison between a value and a past window, with percentile/quantile

I'd like to compare each value x of an array with a rolling window of the n previous values. More precisely I'd like to see at which percentile this new value x would be, if we added it to the previous window : import numpy as np A = np.array([1, 4, 9, 28, 28.5, 2, 283, 3.2, 7, 15]) print A n = 4 # window width for i in range(len(A)-n): W = A[i:i+n] x = A[i+n] q = sum(W <= x) * 1.0 / n print 'Value:', x, ' Window before this value:', W, ' Quantile:', q [ 1. 4. 9. 28. 28.5 2. 283. 3.2 7. 15. ] Value: 28.5 Window before this value: [ 1. 4. 9. 28.] Quantile: 1.0 Value: 2.0 Window before this

[Statsmodels]: How can I get statsmodel to return the pvalue of an OLS object?

阅读更多关于 [Statsmodels]: How can I get statsmodel to return the pvalue of an OLS object?

I'm quite new to programming and I'm jumping on python to get some familiarity with data analysis and machine learning. I am following a tutorial on backward elimination for a multiple linear regression. Here is the code right now: # Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('50_Startups.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, 4].values #Taking care of missin' data #np.set_printoptions(threshold=100) from sklearn.preprocessing import Imputer imputer = Imputer(missing_values =