dataframe

Using dplyr to create new dataframe depending on thresholds

六月ゝ 毕业季﹏ 提交于 2021-02-19 08:57:11
问题 Groups Names COL1 COL2 COL3 COL4 1 G1 SP1 1 0.400 0.500 Sequence1 2 G1 SP1 1 0.004 0.005 Sequence2 3 G1 SP1 0 0.004 0.005 Sequence3 4 G1 SP2 0 0.400 0.005 Sequence123 5 G1 SP2 0 0.004 0.500 Sequence14 6 G1 SP3 0 0.005 0.006 Sequence15 7 G1 SP5 1 0.400 0.006 Sequence16 8 G1 SP6 1 0.008 0.002 Sequence20 10 G2 Sp1 0 0.004 0.005 Sequence17 11 G2 SP1 0 0.050 0.600 Sequence18 12 G2 SP1 0 0.400 0.600 Sequence3 13 G2 SP2 0 0.004 0.005 Sequence22 14 G2 SP2 0 0.004 0.005 Sequence23 15 G2 SP5 0 0.004 0

When using cut in a pandas dataframe to bin it, why is the binning not properly done?

为君一笑 提交于 2021-02-19 07:40:29
问题 I have a dataframe that I want to bin (i.e., group into sub-ranges) by one column, and take the mean of the second column for each of the bins: import pandas as pd import numpy as np data = pd.DataFrame(columns=['Score', 'Age']) data.Score = [1, 1, 1, 1, 0, 1, 2, 1, 0, 1, 1, 0, 2, 1, 1, 2, 1, 0, 1, 1, -1, 1, 0, 1, 1, 0, 1, 0, -2, 1] data.Age = [29, 59, 44, 52, 60, 53, 45, 47, 57, 54, 35, 32, 48, 31, 49, 43, 67, 32, 31, 42, 37, 45, 52, 59, 56, 57, 48, 45, 56, 31] _, bins = np.histogram(data

Create distribution in Pandas

瘦欲@ 提交于 2021-02-19 07:34:38
问题 I want to generate a random/simulated data set with a specific distribution. As an example the distribution has the following properties. A population of 1000 The Gender mix is: male 49%, female 50%, other 1% The age has the following distribution: 0-30 (30%), 31-60 (40%), 61-100 (30%) The resulting data frame would have 1000 rows, and two columns called gender and age (with the above value distributions) Is there a way to do this in Pandas or another library? 回答1: You may try: N = 1000

Pyspark Dataframe get unique elements from column with string as list of elements

我的未来我决定 提交于 2021-02-19 07:34:05
问题 I have a dataframe (which is created by loading from multiple blobs in azure) where I have a column which is list of IDs. Now, I want a list of unique IDs from this entire column: Here is an example - df - | col1 | col2 | col3 | | "a" | "b" |"[q,r]"| | "c" | "f" |"[s,r]"| Here is my expected response: resp = [q, r, s] Any idea how to get there? My current approach is to convert the strings in col3 to python lists and then maybe flaten them out somehow. But so far I am not able to do so. I

How to find the count of consecutive same string values in a pandas dataframe?

别说谁变了你拦得住时间么 提交于 2021-02-19 07:20:26
问题 Assume that we have the following pandas dataframe: df = pd.DataFrame({'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G'],'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG'], 'start':[1000,2000,3000,4000,5000,6000,10000,20000]}) input: col1 col2 start 0 A>G TCT 1000 1 C>T ACA 2000 2 C>T TCA 3000 3 G>T TCA 4000 4 C>T GCT 5000 5 A>G ACT 6000 6 A>G CTG 10000 7 A>G ATG 20000 8 C>A TCT 10000 9 C>T ACA 2000 10 C>T TCA 3000 11 C>T TCA 4000 What I want to get is the number of consecutive

How to find the count of consecutive same string values in a pandas dataframe?

拈花ヽ惹草 提交于 2021-02-19 07:19:10
问题 Assume that we have the following pandas dataframe: df = pd.DataFrame({'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G'],'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG'], 'start':[1000,2000,3000,4000,5000,6000,10000,20000]}) input: col1 col2 start 0 A>G TCT 1000 1 C>T ACA 2000 2 C>T TCA 3000 3 G>T TCA 4000 4 C>T GCT 5000 5 A>G ACT 6000 6 A>G CTG 10000 7 A>G ATG 20000 8 C>A TCT 10000 9 C>T ACA 2000 10 C>T TCA 3000 11 C>T TCA 4000 What I want to get is the number of consecutive

How to find the count of consecutive same string values in a pandas dataframe?

a 夏天 提交于 2021-02-19 07:18:16
问题 Assume that we have the following pandas dataframe: df = pd.DataFrame({'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G'],'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG'], 'start':[1000,2000,3000,4000,5000,6000,10000,20000]}) input: col1 col2 start 0 A>G TCT 1000 1 C>T ACA 2000 2 C>T TCA 3000 3 G>T TCA 4000 4 C>T GCT 5000 5 A>G ACT 6000 6 A>G CTG 10000 7 A>G ATG 20000 8 C>A TCT 10000 9 C>T ACA 2000 10 C>T TCA 3000 11 C>T TCA 4000 What I want to get is the number of consecutive

Pandas dataframe to dict, while keeping duplicate rows

一笑奈何 提交于 2021-02-19 07:02:47
问题 I have a dataframe that looks like this: kenteken status code 0 XYZ A 123 1 XYZ B 456 2 ABC C 789 And I want to convert it to a dictionary in a dictionary like this: {'XYZ':{'code':'123', 'status':'A'}, {'code':'456', 'status':'B'}, 'ABC' : {'code':'789', 'status:'C'}} The closest I've been able to come was the folling: df.groupby('kenteken')['status', 'code'].apply(lambda x: x.to_dict()).to_dict() Which yields: {'ABC': {'status': {2: 'C'}, 'code': {2: '789'}},'XYZ': {'status': {0: 'A', 1: 'B

Finding rows with same column values in pandas dataframe

帅比萌擦擦* 提交于 2021-02-19 06:37:07
问题 I have two dataframes with different column size, where four columns can have the same values in both dataframes. I want to make a new column in df1, that takes the value 1 if there is a row in df2 that has the same values for column 'A','B','C', and 'D' as a row in df1. If there isn't such a row, I want the value to be 0. Rows 'E' and 'F' are not important for checking the values. Is there a pandas function that can do this, or do I have to this in a loop. For example: df1 = A B C D E F 1 1

Creating percentage stacked bar chart using groupby

六眼飞鱼酱① 提交于 2021-02-19 06:20:08
问题 I'm looking at home ownership within levels of different loan statuses, and I'd like to display this using a stacked bar chart in percentages. I've been able to create a frequency stacked bar chart using this code: df_trunc1=df[['loan_status','home_ownership','id']] sub_df1=df_trunc1.groupby(['loan_status','home_ownership'])['id'].count() sub_df1.unstack().plot(kind='bar',stacked=True,rot=1,figsize=(8,8),title="Home ownership across Loan Types") which gives me this picture:1 but I can't