series

Pandas Multiindex count on levels

谁都会走 提交于 2021-02-05 10:59:47
问题 The data: index = [('A', 'aa', 'aaa'), ('A', 'aa', 'aab'), ('B', 'bb', 'bbb'), ('B', 'bb', 'bbc'), ('C', 'cc', 'ccc') ] values = [0.07, 0.04, 0.04, 0.06, 0.07] s = pd.Series(data=values, index=pd.MultiIndex.from_tuples(index)) s A aa aaa 0.07 aab 0.04 B bb bbb 0.04 bbc 0.06 C cc ccc 0.07 To get a mean of first two levels is easy: s.mean(level=[0,1]) Result: A aa 0.055 B bb 0.050 C cc 0.070 But to get a count on first two levels does not work the same: #s.count(level=[0,1]) # does not work I

How to delete text before a specific character - Python (Pandas)

人盡茶涼 提交于 2021-02-05 07:43:14
问题 I have a column in a larger dataset that looks like: Name ---- Mr. John Doe Jack Daw Prof. Charles Winchester Jane Shaw ... etc. (Names anonymized) Basically, its a list of names that have prefixes mixed in. All prefixes end with a dot. So far, the prefixes have been limited to: Mr. Mrs. Ms. Dr. and Prof. The output I would like is: Name ---- John Doe Jack Daw Charles Winchester Jane Shaw ... etc. Ideally, I would like a solution that relies on the position of the dot instead of having to

Numpy Where with more than 2 conditions

可紊 提交于 2021-02-05 05:52:07
问题 Good Morning, I have the following a dataframe with two columns of integers and a Series (diff) computed as: diff = (df["col_1"] - df["col_2"]) / (df["col_2"]) I would like to create a column of the dataframe whose values are: equal to 0, if (diff >= 0) & (diff <= 0.35) equal to 1, if (diff > 0.35) equal to 2, if (diff < 0) & (diff >= - 0.35) equal to 3, if (diff < - 0.35) I tried with: df["Class"] = np.where( (diff >= 0) & (diff <= 0.35), 0, np.where( (diff > 0.35), 1, np.where( (diff < 0) &

Adding new column to pandas df based on condition

邮差的信 提交于 2021-01-29 14:41:55
问题 I have the following dataset: ID Asset Boolean 1 "A" True 1 "B" False 1 "B" False 2 "A" True 3 "A" True 3 "A" True 3 "B" False 3 "B" False 4 "A" True 4 "A" True 5 "A" True 5 "B" False I want to add another column, which should evaluate to True only if all values in the column Boolean evaluate to True for the same ID . So something like this: ID Asset Boolean Check 1 "A" True False 1 "B" False False 1 "B" False False 2 "A" True True 3 "A" True False 3 "A" True False 3 "B" False False 3 "B"

Keep elements with pattern in pandas series without converting them to list

让人想犯罪 __ 提交于 2021-01-28 06:25:35
问题 I have the following dataframe: df = pd.DataFrame(["Air type:1, Space kind:2, water", "something, Space blu:3, somethingelse"], columns = ['A']) and I want to create a new column that contains for each row all the elements that have a ":" in them. So for example in the first row I want to return "type:1, kind:2" and for the second row I want "blu:3". I managed by using a list comprehension in the following way: df['new'] = [[y for y in x if ":" in y] for x in df['A'].str.split(",")] But my

Counting the amount of times a boolean goes from True to False in a column

微笑、不失礼 提交于 2021-01-27 04:53:08
问题 I have a column in a dataframe which is filled with booleans and i want to count how many times it changes from True to False. I can do this when I convert the booleans to 1's and 0's ,then use df.diff and then divide that answer by 2 import pandas as pd d = {'Col1': [True, True, True, False, False, False, True, True, True, True, False, False, False, True, True, False, False, True, ]} df = pd.DataFrame(data=d) print(df) 0 True 1 True 2 True 3 False 4 False 5 False 6 True 7 True 8 True 9 True

Hausdorff distance for large dataset in a fastest way

前提是你 提交于 2021-01-22 06:48:11
问题 Number of rows in my dataset is 500000+. I need Hausdorff distance of every id between itself and others. and repeat it for the whole dataset I have a huge data set. Here is the small part: df = id_easy ordinal latitude longitude epoch day_of_week 0 aaa 1.0 22.0701 2.6685 01-01-11 07:45 Friday 1 aaa 2.0 22.0716 2.6695 01-01-11 07:45 Friday 2 aaa 3.0 22.0722 2.6696 01-01-11 07:46 Friday 3 bbb 1.0 22.1166 2.6898 01-01-11 07:58 Friday 4 bbb 2.0 22.1162 2.6951 01-01-11 07:59 Friday 5 ccc 1.0 22

TypeError: string indices must be integers using pandas apply with lambda

北城以北 提交于 2021-01-19 05:02:30
问题 I have a dataframe, one column is a URL, the other is a name. I'm simply trying to add a third column that takes the URL, and creates an HTML link. The column newsSource has the Link name, and url has the URL. For each row in the dataframe, I want to create a column that has: <a href="[the url]">[newsSource name]</a> Trying the below throws the error File "C:\Users\AwesomeMan\Documents\Python\MISC\News Alerts\simple_news.py", line 254, in df['sourceURL'] = df['url'].apply(lambda x: '{1}'

How to plot a bar graph from a pandas series?

冷暖自知 提交于 2021-01-17 17:37:17
问题 Consider my series as below: First column is article_id and the second column is frequency count. article_id 1 39 2 49 3 187 4 159 5 158 ... 16947 14 16948 7 16976 2 16977 1 16978 1 16980 1 Name: article_id, dtype: int64 I got this series from a dataframe with the following command: logs.loc[logs['article_id'] <= 17029].groupby('article_id')['article_id'].count() logs is the dataframe here and article_id is one of the columns in it. How do I plot a bar chart(using Matlplotlib) such that the

How to plot a bar graph from a pandas series?

允我心安 提交于 2021-01-17 17:34:14
问题 Consider my series as below: First column is article_id and the second column is frequency count. article_id 1 39 2 49 3 187 4 159 5 158 ... 16947 14 16948 7 16976 2 16977 1 16978 1 16980 1 Name: article_id, dtype: int64 I got this series from a dataframe with the following command: logs.loc[logs['article_id'] <= 17029].groupby('article_id')['article_id'].count() logs is the dataframe here and article_id is one of the columns in it. How do I plot a bar chart(using Matlplotlib) such that the