pandas

Filtering rows of a dataframe based on values in columns

时光总嘲笑我的痴心妄想 提交于 2021-02-20 19:08:01
问题 I want to filter the rows of a dataframe that contains values less than ,say 10. import numpy as np import pandas as pd from pprint import pprint df = pd.DataFrame(np.random.randint(0,100,size=(10, 4)), columns=list('ABCD')) df = df[df <10] gives, A B C D 0 5.0 NaN NaN NaN 1 NaN NaN NaN NaN 2 0.0 NaN 6.0 NaN 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN 5 6.0 NaN NaN NaN 6 NaN NaN NaN NaN 7 NaN NaN NaN 7.0 8 NaN NaN NaN NaN 9 NaN NaN NaN NaN Expected: 0 5 57 87 95 2 0 80 6 82 5 6 33 74 75 7 71 44 60 7

Pandas Dataframe - for each row, return count of other rows with overlapping dates

非 Y 不嫁゛ 提交于 2021-02-20 19:05:33
问题 I've got a dataframe with projects, start dates, and end dates. For each row I would like to return the number of other projects in process when the project started. How do you nest loops when using df.apply() ? I've tried using a for loop but my dataframe is large and it takes way too long. import datetime as dt data = {'project' :['A', 'B', 'C'], 'pr_start_date':[dt.datetime(2018, 9, 1), dt.datetime(2019, 4, 1), dt.datetime(2019, 6, 8)], 'pr_end_date': [dt.datetime(2019, 6, 15), dt.datetime

Filtering rows of a dataframe based on values in columns

99封情书 提交于 2021-02-20 19:05:12
问题 I want to filter the rows of a dataframe that contains values less than ,say 10. import numpy as np import pandas as pd from pprint import pprint df = pd.DataFrame(np.random.randint(0,100,size=(10, 4)), columns=list('ABCD')) df = df[df <10] gives, A B C D 0 5.0 NaN NaN NaN 1 NaN NaN NaN NaN 2 0.0 NaN 6.0 NaN 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN 5 6.0 NaN NaN NaN 6 NaN NaN NaN NaN 7 NaN NaN NaN 7.0 8 NaN NaN NaN NaN 9 NaN NaN NaN NaN Expected: 0 5 57 87 95 2 0 80 6 82 5 6 33 74 75 7 71 44 60 7

efficient function to find harmonic mean across different pandas dataframes

两盒软妹~` 提交于 2021-02-20 19:01:49
问题 I have several dataframes with identical shape/types, but slightly different numeric values. I can easily produce a new dataframe with the mean of all input dataframes via: df = pd.concat([input_dataframes]) df = df.groupby(df.index).mean() I want to do the same with harmonic mean (probably the scipy.stats.hmean function). I have attempted to do this using: .groupby(df.index).apply(scipy.stats.hmean) But this alters the structure of the dataframe. Is there a better way to do this, or do I

Merging Dataframe chunks in Pandas

纵然是瞬间 提交于 2021-02-20 18:54:42
问题 I currently have a script that will combine multiple csv files into one, the script works fine except that we run out of ram really quickly when larger files start being used. This is an issue for one reason, the script runs on an AWS server and running out of RAM means a server crash. Currently the file size limit is around 250mb each, and that limits us to 2 files, however as the company I work is in Biotech and we're using Genetic Sequencing files, the files we use can range in size from

Merging Dataframe chunks in Pandas

…衆ロ難τιáo~ 提交于 2021-02-20 18:54:40
问题 I currently have a script that will combine multiple csv files into one, the script works fine except that we run out of ram really quickly when larger files start being used. This is an issue for one reason, the script runs on an AWS server and running out of RAM means a server crash. Currently the file size limit is around 250mb each, and that limits us to 2 files, however as the company I work is in Biotech and we're using Genetic Sequencing files, the files we use can range in size from

pandas merge with MultiIndex, when only one level of index is to be used as key

不问归期 提交于 2021-02-20 17:56:35
问题 I have a data frame called df1 with a 2-level MultiIndex (levels: '_Date' and _'ItemId'). There are multiple instances of each value of '_ItemId', like this: _SomeOtherLabel _Date _ItemId 2014-10-05 6588921 AA 6592520 AB 6836143 BA 2014-10-11 6588921 CA 6592520 CB 6836143 DA I have a second data frame called df2 with '_ItemId' used as a key (not the index). In this df, there is only one occurrence of each value of _ItemId: _ItemId _Cat 0 6588921 6_1 1 6592520 6_1 2 6836143 7_1 I want to

Pandas: find maximum value, when and if conditions

纵然是瞬间 提交于 2021-02-20 10:09:56
问题 I have a dataframe, df: id volume saturation time_delay_normalised speed BPR_free_speed BPR_speed Volume time_normalised 27WESTBOUND 580 0.351515152 57 6.54248366 17.88 15.91366177 580 1.59375 27WESTBOUND 588 0.356363636 100 5.107142857 17.88 15.86519847 588 2.041666667 27WESTBOUND 475 0.287878788 64 6.25625 17.88 16.51161331 475 0.666666667 27EASTBOUND 401 0.243030303 59 6.458064516 17.88 16.88283672 401 1.0914583333 27EASTBOUND 438 0.265454545 46 7.049295775 17.88 16.70300418 438 1

Python pandas: how to obtain the datatypes of objects in a mixed-datatype column?

陌路散爱 提交于 2021-02-20 09:29:31
问题 Given a pandas.DataFrame with a column holding mixed datatypes, like e.g. df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string']}) I was wondering how to obtain the datatypes of the individual objects in the column (Series)? Suppose I want to modify all entries in the Series that are of a certain type, like multiply all integers by some factor. I could iteratively derive a mask and use it in loc , like m = np.array([isinstance(v, int) for v in df['mixed']]) df.loc[m, 'mixed

Python pandas: how to obtain the datatypes of objects in a mixed-datatype column?

ぐ巨炮叔叔 提交于 2021-02-20 09:29:13
问题 Given a pandas.DataFrame with a column holding mixed datatypes, like e.g. df = pd.DataFrame({'mixed': [pd.Timestamp('2020-10-04'), 999, 'a string']}) I was wondering how to obtain the datatypes of the individual objects in the column (Series)? Suppose I want to modify all entries in the Series that are of a certain type, like multiply all integers by some factor. I could iteratively derive a mask and use it in loc , like m = np.array([isinstance(v, int) for v in df['mixed']]) df.loc[m, 'mixed