pandas | 易学教程

Filtering multiple conditions from a Dataframe in Python

阅读更多关于 Filtering multiple conditions from a Dataframe in Python

问题 I want to filter out data from a dataframe using multiple conditions using multiple columns. I tried doing so like this: arrival_delayed_weather = [[[flight_data_finalcopy["ArrDelay"] > 0]] & [[flight_data_finalcopy["WeatherDelay"]>0]]] arrival_delayed_weather_filter = arrival_delayed_weather[["UniqueCarrier", "AirlineID"]] print arrival_delayed_weather_filter However I get this error message: TypeError: unsupported operand type(s) for &: 'list' and 'list' How do I solve this? Thanks in

Using multiple custom classes with Pipeline sklearn (Python)

阅读更多关于 Using multiple custom classes with Pipeline sklearn (Python)

问题 I try to do a tutorial on Pipeline for students but I block. I'm not an expert but I'm trying to improve. So thank you for your indulgence. In fact, I try in a pipeline to execute several steps in preparing a dataframe for a classifier: Step 1: Description of the dataframe Step 2: Fill NaN Values Step 3: Transforming Categorical Values into Numbers Here is my code: class Descr_df(object): def transform (self, X): print ("Structure of the data: \n {}".format(X.head(5))) print ("Features names:

Using Pandas to sample DataFrame using a specific column's weight

阅读更多关于 Using Pandas to sample DataFrame using a specific column's weight

问题 I have a DataFrame which look like: index name city 0 Yam Hadera 1 Meow Hadera 2 Don Hadera 3 Jazz Hadera 4 Bond Tel Aviv 5 James Tel Aviv I want Pandas to randomly choose values, using the number of appearances in the city column (kind of using: df.city.value_counts() ), so the results of my magic function, suppose: df.magic_sample(3, weight_column='city') might look like: 0 Yam Hadera 1 Meow Hadera 2 Bond Tel Aviv Thanks! :) 回答1: You can group by city and then sample each group based on

Using multiple custom classes with Pipeline sklearn (Python)

阅读更多关于 Using multiple custom classes with Pipeline sklearn (Python)

pandas: Keep only top n values and set others to 0

阅读更多关于 pandas: Keep only top n values and set others to 0

问题 In a pandas dataframe, for every row, I want to keep only the top N values and set everything else to 0. I can iterate through the rows and do it but I am sure python/pandas can do it elegantly in a single line. For e.g.: for N = 2 Input: A B C D 4 10 10 6 5 20 50 90 6 30 6 4 7 40 12 9 Output: A B C D 0 10 10 0 0 0 50 90 6 30 6 0 0 40 12 0 回答1: Using rank with parameters axis=1 and method='min' and ascending=False as: N = 2 df = df.mask(df.rank(axis=1, method='min', ascending=False) > N, 0)

Filling date gaps in pandas dataframe

阅读更多关于 Filling date gaps in pandas dataframe

问题 I have Pandas DataFrame (loaded from .csv) with Date-time as index.. where there is/have-to-be one entry per day. The problem is that I have gaps i.e. there is days for which I have no data at all. What is the easiest way to insert rows (days) in the gaps ? Also is there a way to control what is inserted in the columns as data ! Say 0 OR copy the prev day info OR to fill sliding increasing/decreasing values in the range from prev-date toward next-date data-values. thanks Here is example 01-03

add a different random number to every cell in a pandas dataframe

阅读更多关于 add a different random number to every cell in a pandas dataframe

问题 I need to add some 'noise' to my data, so I would like to add a different random number to every cell in my pandas dataframe. This code works, but seems unpythonic. Is there a better way? import pandas as pd import numpy as np df = pd.DataFrame(0.0, index=[1,2,3,4,5], columns=list('ABC') ) print df for x,line in df.iterrows(): for col in df: line[col] = line[col] + (np.random.rand()-0.5)/1000.0 print df 回答1: df + np.random.rand(*df.shape) / 10000.0 OR Let's use applymap: df = pd.DataFrame(1.0

Python & Pandas: Combine columns into a date

阅读更多关于 Python & Pandas: Combine columns into a date

问题 In my dataframe , the time is separated in 3 columns: year , month , day , like this: How can I convert them into date , so I can do time series analysis? I can do this: df.apply(lambda x:'%s %s %s' % (x['year'],x['month'], x['day']),axis=1) which gives: 1095 1954 1 1 1096 1954 1 2 1097 1954 1 3 1098 1954 1 4 1099 1954 1 5 1100 1954 1 6 1101 1954 1 7 1102 1954 1 8 1103 1954 1 9 1104 1954 1 10 1105 1954 1 11 1106 1954 1 12 1107 1954 1 13 But what follows? EDIT: This is what I end up with: from

Julia Dataframes vs Python pandas

阅读更多关于 Julia Dataframes vs Python pandas

问题 I am currently using python pandas and want to know if there is a way to output the data from pandas into julia Dataframes and vice versa. (I think you can call python from Julia with Pycall but I am not sure if it works with dataframes) Is there a way to call Julia from python and have it take in panda s dataframes? (without saving to another file format like csv) When would it be advantageous to use Julia Dataframes than Pandas other than extremely large datasets and running things with

Accessing a Pandas index like a regular column

阅读更多关于 Accessing a Pandas index like a regular column

问题 I have a Pandas DataFrame with a named index. I want to pass it off to a piece off code that takes a DataFrame, a column name, and some other stuff, and does a bunch of work involving that column. Only in this case the column I want to highlight is the index, but giving the index's label to this piece of code doesn't work because you can't extract an index like you can a regular column. For example, I can construct a DataFrame like this: import pandas as pd, numpy as np df=pd.DataFrame({'name