pandas | 易学教程

pandas:calculate jaccard similarity for every row based on the value in another column

阅读更多关于 pandas:calculate jaccard similarity for every row based on the value in another column

问题 I have a dataframe as follows, only with more rows: import pandas as pd data = {'First': ['First value', 'Second value','Third value'], 'Second': [['old','new','gold','door'], ['old','view','bold','door'],['new','view','world','window']]} df = pd.DataFrame (data, columns = ['First','Second']) To calculate the jaccard similarity i found this piece online(not my solution): def lexical_overlap(doc1, doc2): words_doc1 = set(doc1) words_doc2 = set(doc2) intersection = words_doc1.intersection(words

Order string sequences within a cell

阅读更多关于 Order string sequences within a cell

问题 I have the following data in a column of a Pandas dataframe: col_1 ,B91-10,B7A-00,B7B-00,B0A-01,B0A-00,B64-03,B63-00,B7B-01 ,B8A-01,B5H-02,B32-02,B57-00 ,B83-01,B83-00,B5H-00 ,B83-01,B83-00 ,B83-00,B83-01 ,B83-00,B92-00,B92-01,B0N-02 ,B91-16 FYI: each of these strings begins with a comma, so the above example has 7 rows. The order of these different codes in a row do not matter. Rows 3 and 4 (assuming index starts at 0) are identical for my purpose. I need to order these different codes in

Sorting rows in python pandas

阅读更多关于 Sorting rows in python pandas

问题 I have a dataframe (the sample looks like this) Type SKU Description FullDescription Size Price Variable 2 Boots Shoes on sale XL,S,M Variation 2.5 Boots XL XL 330 Variation 2.6 Boots S S 330 Variation 2.7 Boots M M 330 Variable 3 Helmet Helmet Sizes E42,E41 Variation 3.8 Helmet E42 E42 89 Variation 3.2 Helmet E41 E41 89 What I want to do is sort the values based on Size so the final data frame should look like this: Type SKU Description FullDescription Size Price Variable 2 Boots Shoes on

How to normalize multiple columns of dicts in a pandas dataframe

阅读更多关于 How to normalize multiple columns of dicts in a pandas dataframe

问题 I am new to coding and I can understand that this is a very basic question I have a dataframe as: df Unnamed: 0 time home_team away_team full_time_result both_teams_to_score double_chance -- ------------ ------------------- ------------- -------------- ---------------------------------- ------------------------- ------------------------------------ 0 0 2021-01-12 18:00:00 Sheff Utd Newcastle {'1': 2400, 'X': 3200, '2': 3100} {'yes': 2000, 'no': 1750} {'1X': 1360, '12': 1360, '2X': 1530} 1 1

Error opening Excel (XLSX) files from pandas xlsxwriter

阅读更多关于 Error opening Excel (XLSX) files from pandas xlsxwriter

问题 Upon opening an XLSX file in MS Excel, an error dialog is presented: "We found a problem with some content in filename.xlsx ..." Clicking "Yes" to attempt recovery yields the following XML error message: <?xml version="1.0" encoding="UTF-8" standalone="true"?> -<recoveryLog xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"> <logFileName>error359720_09.xml</logFileName> <summary>Errors were detected in file 'C:\Users\username\Github\Project\Data\20200420b.xlsx'</summary> -

Using pandas.io.json.json_normalize() with empty list attributes

阅读更多关于 Using pandas.io.json.json_normalize() with empty list attributes

问题 I'm using pandas.io.json.json_normalize() to convert some json into a dataframe, which is then pushed to an SQLite database via df.to_sql() . However, I'm getting sqlite3.InterfaceError: Error binding parameter 1 - probably unsupported type. when progressing with this, I think due to one of my json fields being an empty array. I understand I can pass additional path arguments to json_normalize to have it pull out array values and augment the rows with the parent data: json_normalize(json_data

Using pandas.io.json.json_normalize() with empty list attributes

阅读更多关于 Using pandas.io.json.json_normalize() with empty list attributes

Add a column value depending on a date range (if-else)

阅读更多关于 Add a column value depending on a date range (if-else)

问题 I have a date column in my dataframe and want to add a column called location. The value of location in each row should depend on which date range it falls under. For example, the date 13th November falls between 12th November and 16th November & therefore the location should be Seattle. The date 17th November falls between 17th November and 18th November and must be New York. Below is an example of the data frame I want to achieve Dates | Location (column I want to add) .....................

Error opening Excel (XLSX) files from pandas xlsxwriter

阅读更多关于 Error opening Excel (XLSX) files from pandas xlsxwriter

Pandas P&L rollup to the next business day

阅读更多关于 Pandas P&L rollup to the next business day

问题 I'm having a hard time trying to do this efficiently. I have some stocks and daily P&L info in a dataframe. In reality, I have millions of rows of data so efficiency matters a lot! The Dataframe looks like : ------------------------------- | Date | Security | P&L | ------------------------------- | 2016-01-01 | AAPL | 100 | ------------------------------- | 2016-01-02 | AAPL | 200 | ------------------------------- | 2016-01-03 | AAPL | 300 | ------------------------------- | 2016-01-04 | AAPL