pandas

pandas:calculate jaccard similarity for every row based on the value in another column

≡放荡痞女 提交于 2021-02-10 18:17:43
问题 I have a dataframe as follows, only with more rows: import pandas as pd data = {'First': ['First value', 'Second value','Third value'], 'Second': [['old','new','gold','door'], ['old','view','bold','door'],['new','view','world','window']]} df = pd.DataFrame (data, columns = ['First','Second']) To calculate the jaccard similarity i found this piece online(not my solution): def lexical_overlap(doc1, doc2): words_doc1 = set(doc1) words_doc2 = set(doc2) intersection = words_doc1.intersection(words

Order string sequences within a cell

只愿长相守 提交于 2021-02-10 18:17:39
问题 I have the following data in a column of a Pandas dataframe: col_1 ,B91-10,B7A-00,B7B-00,B0A-01,B0A-00,B64-03,B63-00,B7B-01 ,B8A-01,B5H-02,B32-02,B57-00 ,B83-01,B83-00,B5H-00 ,B83-01,B83-00 ,B83-00,B83-01 ,B83-00,B92-00,B92-01,B0N-02 ,B91-16 FYI: each of these strings begins with a comma, so the above example has 7 rows. The order of these different codes in a row do not matter. Rows 3 and 4 (assuming index starts at 0) are identical for my purpose. I need to order these different codes in

Sorting rows in python pandas

房东的猫 提交于 2021-02-10 18:17:30
问题 I have a dataframe (the sample looks like this) Type SKU Description FullDescription Size Price Variable 2 Boots Shoes on sale XL,S,M Variation 2.5 Boots XL XL 330 Variation 2.6 Boots S S 330 Variation 2.7 Boots M M 330 Variable 3 Helmet Helmet Sizes E42,E41 Variation 3.8 Helmet E42 E42 89 Variation 3.2 Helmet E41 E41 89 What I want to do is sort the values based on Size so the final data frame should look like this: Type SKU Description FullDescription Size Price Variable 2 Boots Shoes on

How to normalize multiple columns of dicts in a pandas dataframe

我怕爱的太早我们不能终老 提交于 2021-02-10 18:15:45
问题 I am new to coding and I can understand that this is a very basic question I have a dataframe as: df Unnamed: 0 time home_team away_team full_time_result both_teams_to_score double_chance -- ------------ ------------------- ------------- -------------- ---------------------------------- ------------------------- ------------------------------------ 0 0 2021-01-12 18:00:00 Sheff Utd Newcastle {'1': 2400, 'X': 3200, '2': 3100} {'yes': 2000, 'no': 1750} {'1X': 1360, '12': 1360, '2X': 1530} 1 1

Error opening Excel (XLSX) files from pandas xlsxwriter

六眼飞鱼酱① 提交于 2021-02-10 18:12:00
问题 Upon opening an XLSX file in MS Excel, an error dialog is presented: "We found a problem with some content in filename.xlsx ..." Clicking "Yes" to attempt recovery yields the following XML error message: <?xml version="1.0" encoding="UTF-8" standalone="true"?> -<recoveryLog xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"> <logFileName>error359720_09.xml</logFileName> <summary>Errors were detected in file 'C:\Users\username\Github\Project\Data\20200420b.xlsx'</summary> -

Using pandas.io.json.json_normalize() with empty list attributes

余生颓废 提交于 2021-02-10 18:10:49
问题 I'm using pandas.io.json.json_normalize() to convert some json into a dataframe, which is then pushed to an SQLite database via df.to_sql() . However, I'm getting sqlite3.InterfaceError: Error binding parameter 1 - probably unsupported type. when progressing with this, I think due to one of my json fields being an empty array. I understand I can pass additional path arguments to json_normalize to have it pull out array values and augment the rows with the parent data: json_normalize(json_data

Using pandas.io.json.json_normalize() with empty list attributes

老子叫甜甜 提交于 2021-02-10 18:05:06
问题 I'm using pandas.io.json.json_normalize() to convert some json into a dataframe, which is then pushed to an SQLite database via df.to_sql() . However, I'm getting sqlite3.InterfaceError: Error binding parameter 1 - probably unsupported type. when progressing with this, I think due to one of my json fields being an empty array. I understand I can pass additional path arguments to json_normalize to have it pull out array values and augment the rows with the parent data: json_normalize(json_data

Add a column value depending on a date range (if-else)

送分小仙女□ 提交于 2021-02-10 17:54:10
问题 I have a date column in my dataframe and want to add a column called location. The value of location in each row should depend on which date range it falls under. For example, the date 13th November falls between 12th November and 16th November & therefore the location should be Seattle. The date 17th November falls between 17th November and 18th November and must be New York. Below is an example of the data frame I want to achieve Dates | Location (column I want to add) .....................

Error opening Excel (XLSX) files from pandas xlsxwriter

穿精又带淫゛_ 提交于 2021-02-10 17:50:33
问题 Upon opening an XLSX file in MS Excel, an error dialog is presented: "We found a problem with some content in filename.xlsx ..." Clicking "Yes" to attempt recovery yields the following XML error message: <?xml version="1.0" encoding="UTF-8" standalone="true"?> -<recoveryLog xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"> <logFileName>error359720_09.xml</logFileName> <summary>Errors were detected in file 'C:\Users\username\Github\Project\Data\20200420b.xlsx'</summary> -

Pandas P&L rollup to the next business day

烂漫一生 提交于 2021-02-10 17:50:29
问题 I'm having a hard time trying to do this efficiently. I have some stocks and daily P&L info in a dataframe. In reality, I have millions of rows of data so efficiency matters a lot! The Dataframe looks like : ------------------------------- | Date | Security | P&L | ------------------------------- | 2016-01-01 | AAPL | 100 | ------------------------------- | 2016-01-02 | AAPL | 200 | ------------------------------- | 2016-01-03 | AAPL | 300 | ------------------------------- | 2016-01-04 | AAPL