pandas

Cosine Similarity rows in a dataframe of pandas

て烟熏妆下的殇ゞ 提交于 2021-02-10 06:45:09
问题 I have a CSV file which have content as belows and I want to calculate the cosine similarity from one the remaining ID in the CSV file. I have load it into a dataframe of pandas as follows: old_df['Vector']=old_df.apply(lambda row: np.array(np.matrix(row.Vector)).ravel(), axis = 1) l=[] for a in old_df['Vector']: l.append(a) A=np.array(l) similarities = cosine_similarity(A) The output looks fine. However, i do not know how to find which the GUID (or ID)similar to other GUID (or ID), and I

pandas read_csv. How to ignore delimiter before line break

拜拜、爱过 提交于 2021-02-10 06:38:29
问题 I'm reading a file with numerical values. data = pd.read_csv('data.dat', sep=' ', header=None) In the text file, each row end with a space, So pandas wait for a value that is not there and add a "nan" at the end of each row. For example: 2.343 4.234 is read as: [2.343, 4.234, nan] I can avoid it using , usecols = [0 1] but I would prefer a more general solution 回答1: You can use regular expressions in your sep argument. Instead of specifying the separator to be one space, you can ask it to use

How to use rolling in pandas?

我怕爱的太早我们不能终老 提交于 2021-02-10 06:38:18
问题 I am working on the code below: # Resample, interpolate and inspect ozone data here data = data.resample('D').interpolate() data.info() # Create the rolling window ***rolling = data.rolling(360)['Ozone'] # Insert the rolling quantiles to the monthly returns data['q10'] = rolling.quantile(.1) data['q50'] = rolling.quantile(.5) data['q90'] = rolling.quantile(.9) # Plot the data data.plot() plt.show() For the starred line (***), I was wondering, can I use the following instead? data['Ozone']

pandas.io.common.CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file

99封情书 提交于 2021-02-10 06:36:41
问题 I have large csv files with size more than 10 mb each and about 50+ such files. These inputs have more than 25 columns and more than 50K rows. All these have same headers and I am trying to merge them into one csv with headers to be mentioned only one time. Option: One Code: Working for small sized csv -- 25+ columns but size of the file in kbs. import pandas as pd import glob interesting_files = glob.glob("*.csv") df_list = [] for filename in sorted(interesting_files): df_list.append(pd.read

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 388: invalid continuation byte

*爱你&永不变心* 提交于 2021-02-10 06:36:35
问题 I am really beginning at python, but I am hours in this line, can't go anywhere without fixing it. cadastro_2019_10= pd.read_csv("inf_cadastral_fi_20191015.csv",delimiter=";")[["CNPJ_FUNDO","DENOM_SOCIAL","CLASSE"]] UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 49: invalid continuation byte cadastro_2019_10= pd.read_csv("inf_cadastral_fi_20191015.csv",delimiter=";")[["CNPJ_FUNDO","DENOM_SOCIAL","CLASSE"]] again: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9

Copy argument vs Series.Copy()

扶醉桌前 提交于 2021-02-10 06:32:12
问题 y = pd.Series(x, copy=True,dtype=float) z = pd.Series(x, copy=True) a = pd.Series(x) f = pd.Series.copy(x) All the above expressions give the same output of x value and even after updating the x value the change is not reflecting. So I need to know what is the use of copy as argument and the series.copy() and also how to copy x series to some other series such that any changes made in x is reflected back in the new series also. If any thing is wrong or not possible please forgive me... I'm a

How do I create a new column in pandas from the difference of two string columns?

被刻印的时光 ゝ 提交于 2021-02-10 06:32:09
问题 How can I create a new column in pandas that is the result of the difference of two other columns consisting of strings? I have one column titled "Good_Address" which has entries like "123 Fake Street Apt 101" and another column titled "Bad_Address" which has entries like "123 Fake Street". I want the output in column "Address_Difference" to be " Apt101". I've tried doing: import pandas as pd data = pd.read_csv("AddressFile.csv") data['Address Difference'] = data['GOOD_ADR1'].replace(data[

How to left align a dataframe column in python?

纵饮孤独 提交于 2021-02-10 06:31:41
问题 Have to left align a description column in the pandas dataframe in python. Similar to left or right align a cell in excel sheet. is there any solution for this? Image attached for reference. !Dataset 回答1: Try this df.style.set_properties(subset=["col1", "col2"], **{'text-align': 'right'}) 回答2: I think you can just remove the leading spaces. df.Description = df.Description.apply(lambda row: row.lstrip(' ')) 来源: https://stackoverflow.com/questions/53460941/how-to-left-align-a-dataframe-column

How do I create a new column in pandas from the difference of two string columns?

一曲冷凌霜 提交于 2021-02-10 06:31:23
问题 How can I create a new column in pandas that is the result of the difference of two other columns consisting of strings? I have one column titled "Good_Address" which has entries like "123 Fake Street Apt 101" and another column titled "Bad_Address" which has entries like "123 Fake Street". I want the output in column "Address_Difference" to be " Apt101". I've tried doing: import pandas as pd data = pd.read_csv("AddressFile.csv") data['Address Difference'] = data['GOOD_ADR1'].replace(data[

Element-wise mean of a list of pandas DataFrames

感情迁移 提交于 2021-02-10 06:29:06
问题 Is there a canonical way to compute the element-wise mean of a list of DataFrames with identical columns and indices? The best way I can think of is from functools import reduce dfs = [df1, df2, df3, df4, df5] reduce(lambda x, y: x.add(y), dfs) / len(dfs) 回答1: Use concat with mean per index values: df1 = pd.DataFrame({ 'C':[7,8,9], 'D':[1,3,5], }) df2 = pd.DataFrame({ 'C':[4,2,3], 'D':[7,1,0], }) df3 = pd.DataFrame({ 'C':[9,4,2], 'D':[1,7,1], }) from functools import reduce dfs = [df1, df2,