pandas

Read csv with pandas with commented header

限于喜欢 提交于 2021-02-07 13:32:34
问题 I have CSV files that have # in the header line: s = '#one two three\n1 2 3' If I use pd.read_csv the # sign gets into the first header: import pandas as pd from io import StringIO pd.read_csv(StringIO(s), delim_whitespace=True) #one two three 0 1 2 3 If I set the argument comment='#' , then pandas ignores the line completely. Is there an easy way to handle this case? Second issue, related, is how can I handle quoting in this case, it works with no # : s = '"one one" two three\n1 2 3' print

Drop duplicates, but ignore nulls

戏子无情 提交于 2021-02-07 13:26:58
问题 So I know you can use something like this to drop duplicate lines: the_data.drop_duplicates(subset=['the_key']) However, if the_key is null for some values, like below: the_key C D 1 NaN * * 2 NaN * 3 111 * * 4 111 It will keep the ones marked in the C column. Is it possible to get drop_duplicates to treat all nan as distinct and get an output keeping the data like in the D column? 回答1: Use duplicated chained with isna and filter by boolean indexing: df = df[(~df['the_key'].duplicated()) | df

Drop duplicates, but ignore nulls

。_饼干妹妹 提交于 2021-02-07 13:26:55
问题 So I know you can use something like this to drop duplicate lines: the_data.drop_duplicates(subset=['the_key']) However, if the_key is null for some values, like below: the_key C D 1 NaN * * 2 NaN * 3 111 * * 4 111 It will keep the ones marked in the C column. Is it possible to get drop_duplicates to treat all nan as distinct and get an output keeping the data like in the D column? 回答1: Use duplicated chained with isna and filter by boolean indexing: df = df[(~df['the_key'].duplicated()) | df

Replace NaN values of pandas.DataFrame with values from list

允我心安 提交于 2021-02-07 13:21:33
问题 In a python script using the library pandas , I have a dataset of let's say 100 lines with a feature "X", containing 36 NaN values, and a list of size 36. I want to replace all the 36 missing values of the column "X" by the 36 values I have in my list. It's likely to be a dumb question, but I went through all the doc and couldn't find a way to do it. Here's an example : INPUT Data: X Y 1 8 2 3 NaN 2 NaN 7 1 2 NaN 2 Filler List: [8, 6, 3] OUTPUT Data: X Y 1 8 2 3 8 2 6 7 1 2 3 2 回答1: Start

Replace NaN values of pandas.DataFrame with values from list

ε祈祈猫儿з 提交于 2021-02-07 13:21:23
问题 In a python script using the library pandas , I have a dataset of let's say 100 lines with a feature "X", containing 36 NaN values, and a list of size 36. I want to replace all the 36 missing values of the column "X" by the 36 values I have in my list. It's likely to be a dumb question, but I went through all the doc and couldn't find a way to do it. Here's an example : INPUT Data: X Y 1 8 2 3 NaN 2 NaN 7 1 2 NaN 2 Filler List: [8, 6, 3] OUTPUT Data: X Y 1 8 2 3 8 2 6 7 1 2 3 2 回答1: Start

Replace NaN values of pandas.DataFrame with values from list

限于喜欢 提交于 2021-02-07 13:21:21
问题 In a python script using the library pandas , I have a dataset of let's say 100 lines with a feature "X", containing 36 NaN values, and a list of size 36. I want to replace all the 36 missing values of the column "X" by the 36 values I have in my list. It's likely to be a dumb question, but I went through all the doc and couldn't find a way to do it. Here's an example : INPUT Data: X Y 1 8 2 3 NaN 2 NaN 7 1 2 NaN 2 Filler List: [8, 6, 3] OUTPUT Data: X Y 1 8 2 3 8 2 6 7 1 2 3 2 回答1: Start

Implement a classic martingale using Python and Pandas

陌路散爱 提交于 2021-02-07 13:11:50
问题 I want to implement a classic martingale using Python and Pandas in a betting system. Let's say that this DataFrame is defined like this df = pd.DataFrame(np.random.randint(0,2,100)*2-1, columns=['TossResults']) so it contains toss results (-1=lose 1=win) I would like to change stake (the amount I bet every bet) using classic martingale. Initial stake is 1. If I lose stake will be 2 times previous stake (multiplier=2). If I win stake will be stake_initial I did a function def stake_martingale

Using pandas to plot barplots with error bars

℡╲_俬逩灬. 提交于 2021-02-07 13:11:31
问题 I'm trying to generate bar plots from a DataFrame like this: Pre Post Measure1 0.4 1.9 These values are median values I calculated from elsewhere, and I have also their variance and standard deviation (and standard error, too). I would like to plot the results as a bar plot with the proper error bars, but specifying more than one error value to yerr yields an exception: # Data is a DataFrame instance fig = data.plot(kind="bar", yerr=[0.1, 0.3]) [...] ValueError: In safezip, len(args[0])=1 but

Python pandas dataframe: find max for each unique values of an another column

倖福魔咒の 提交于 2021-02-07 12:51:22
问题 I have a large dataframe (from 500k to 1M rows) which contains for example these 3 numeric columns: ID, A, B I want to filter the results in order to obtain a table like the one in the image below, where, for each unique value of column id, i have the maximum and minimum value of A and B. How can i do? EDIT: i have updated the image below in order to be more clear: when i get the max or min from a column i need to get also the data associated to it of the others columns 回答1: Sample data (note

pandas get unique values from column of lists

一曲冷凌霜 提交于 2021-02-07 12:37:41
问题 How do I get the unique values of a column of lists in pandas or numpy such that second column from would result in 'action', 'crime', 'drama' . The closest (but non-functional) solutions I could come up with were: genres = data['Genre'].unique() But this predictably results in a TypeError saying how lists aren't hashable. TypeError: unhashable type: 'list' Set seemed to be a good idea but genres = data.apply(set(), columns=['Genre'], axis=1) but also results in a TypeError: set() takes no