pandas

Python: pandas.cut labels are ignored

こ雲淡風輕ζ 提交于 2021-02-10 18:39:08
问题 I want to cut one column in my pandas.DataFrame using pandas.cut(), but the labels I put into labels argument are not applied. Let me show you an example. I have got the following data frame: >>> import pandas as pd >>> df = pd.DataFrame({'x': [-0.009, 0.089, 0.095, 0.096, 0.198]}) >>> print(df) x 0 -0.009 1 0.089 2 0.095 3 0.096 4 0.198 And I cut x column like this: >>> bins = pd.IntervalIndex.from_tuples([(-0.2, -0.1), (-0.1, 0.0), (0.0, 0.1), (0.1, 0.2)]) >>> labels = [100, 200, 300, 400]

How to convert panda df to sparse df

生来就可爱ヽ(ⅴ<●) 提交于 2021-02-10 18:30:32
问题 I have a huge sparse dataset in a dataframe and have been using df.to_sparse but it will be deprecated soon so wanted to switch to pd.Series(pd.SparseArray()) but not sure how to do that for an entire dataframe? My final df is 100K rows and 49K columns so need an automated way. 回答1: You could try something like this : dtype = {key: pd.SparseDtype(df.dtypes[key].type, fill_value=df[key].value_counts().argmax()) for key in df.dtypes.keys()} df = df.astype(dtype) And then check the density with

Adding spaces between strings after sum()

此生再无相见时 提交于 2021-02-10 18:26:08
问题 Assuming that I have the following pandas dataframe: >>> data = pd.DataFrame({ 'X':['a','b'], 'Y':['c','d'], 'Z':['e','f']}) X Y Z 0 a c e 1 b d f The desired output is: 0 a c e 1 b d f When I run the following code, I get: >>> data.sum(axis=1) 0 ace 1 bdf So how do I add columns of strings with space between them? 回答1: Use apply per rows by axis=1 and join : a = data.apply(' '.join, axis=1) print (a) 0 a c e 1 b d f dtype: object Another solution with add spaces, sum and last str.rstrip: a =

Use keywords from dataframe to detect if any present in another dataframe or string

佐手、 提交于 2021-02-10 18:22:46
问题 I have two problems: First is... I have one dataframe with category and keywords like this: Category Keywords 0 Fruit ['apple', 'pear', 'plum', 'grape'] 1 Color ['red', 'purple', 'green'] Another dataframe like this: Summary 0 This is a basket of red apples. They are sour. 1 We found a bushel of fruit. They are red. 2 There is a peck of pears that taste sweet. 3 We have a box of plums. I want the end result like this: Category Summary 0 Fruit, Color This is a basket of red apples. They are

How do I efficiently apply pandas.Timestamp functions to a full dataframe/column?

一个人想着一个人 提交于 2021-02-10 18:22:03
问题 Pandas is a great tool for a number of data tasks. Many functions have been streamlined to efficiently be applied to columns rather than individual cells/rows. One such function is the to_datetime() function, which I use as an example later in this question. However, there are a number of commands in pandas that, as best I can tell from the documentation, do not directly relate to dataframes. The specific function I am interested in is the pandas.Timestamp.isocalendar() function, but there

reshape a pandas dataframe with multiple columns

坚强是说给别人听的谎言 提交于 2021-02-10 18:21:45
问题 I have an issue in reshaping a pandas DatFrame. It looks like this (the numbers of lines and columns can vary) : columns col1 col2 col3 col4 Species sp1 218.000000 521.000000 533.000000 793.000000 sp1 0.105569 0.252300 0.258111 0.384019 sp1 2 2 2 3 sp2 225.000000 521.000000 540.000000 800.000000 sp2 0.107862 0.249760 0.258869 0.383509 sp2 2 2 2 3 sp3 217.000000 477.000000 512.000000 725.000000 sp3 0.112377 0.247022 0.265148 0.375453 sp3 1 1 3 3 The column Species is my index. I want to

How do I efficiently apply pandas.Timestamp functions to a full dataframe/column?

久未见 提交于 2021-02-10 18:20:53
问题 Pandas is a great tool for a number of data tasks. Many functions have been streamlined to efficiently be applied to columns rather than individual cells/rows. One such function is the to_datetime() function, which I use as an example later in this question. However, there are a number of commands in pandas that, as best I can tell from the documentation, do not directly relate to dataframes. The specific function I am interested in is the pandas.Timestamp.isocalendar() function, but there

Excel Datetime SN Conversion in Python

Deadly 提交于 2021-02-10 18:20:37
问题 My csv input file sometimes has excel serial numbers in the date field. I am using the following code as my input file should never contain dates prior to 01/2000. However, this solution is quite time consuming and I am hoping to find a better solution. Thank you. def DateCorrection(x): if pd.to_datetime(x) < pd.to_datetime('2000-01-01'): return pd.to_datetime(datetime.fromordinal(datetime(1900, 1, 1).toordinal() + int(x) - 2)) else: return pd.to_datetime(x) 回答1: Assuming your input looks

Use keywords from dataframe to detect if any present in another dataframe or string

末鹿安然 提交于 2021-02-10 18:19:19
问题 I have two problems: First is... I have one dataframe with category and keywords like this: Category Keywords 0 Fruit ['apple', 'pear', 'plum', 'grape'] 1 Color ['red', 'purple', 'green'] Another dataframe like this: Summary 0 This is a basket of red apples. They are sour. 1 We found a bushel of fruit. They are red. 2 There is a peck of pears that taste sweet. 3 We have a box of plums. I want the end result like this: Category Summary 0 Fruit, Color This is a basket of red apples. They are

How to efficiently load mixed-type pandas DataFrame into an Oracle DB

删除回忆录丶 提交于 2021-02-10 18:18:18
问题 Happy new year everyone! I'm currently struggling with ETL performance issues as I'm trying to write larger Pandas DataFrames (1-2 mio rows, 150 columns) into an Oracle data base . Even for just 1000 rows, Panda's default to_sql() method runs well over 2 minutes (see code snippet below). My strong hypothesis is that these performance issues are in some way related to the underlying data types (mostly strings). I ran the same job on 1000 rows of random strings (benchmark: 3 min) and 1000 rows