问题
The csv file that I have contain several repeated supplier_name but with different amt for year 2015-2017.
Here goes my codes.
df = pd.read_csv('government-procurement-via-gebiz.csv', parse_dates=['award_date'],
infer_datetime_format=True, usecols=['supplier_name', 'award_date', 'awarded_amt'],)
df = df[(df['supplier_name'] != 'na') & (df['award_date'].dt.year == 2015)].reset_index(drop=True)
d1 = df.set_index('supplier_name').to_dict()['awarded_amt']
top5D1 = dict(sorted(d1.iteritems(), key=operator.itemgetter(1), reverse=True)[:5])
print top5D1
The output is
{'KAJIMA OVERSEAS ASIA PTE LTD': 595800000.0, 'SAMSUNG C&T CORPORATION': 555322063.0, 'GS Engineering & Construction Corp.': 428301000.0, 'HYUNDAI ENGINEERING & CONSTRUCTION CO. LTD': 601726000.0, 'THE GO-AHEAD GROUP PLC': 497738104.0}
I check on the csv file, the correct result should be this.
supplier_name award_date awarded_amt
1 SANTARLI CONSTRUCTION PTE. LTD. 2015-01-07 1.030000e+09
2 HYUNDAI ENGINEERING & CONSTRUCTION CO. LTD 2015-08-04 6.017260e+08
3 KAJIMA OVERSEAS ASIA PTE LTD 2015-02-03 5.958000e+08
4 SAMSUNG C&T CORPORATION 2015-11-20 5.553221e+08
5 THE GO-AHEAD GROUP PLC 2015-11-23 4.977381e+08
From the csv file I found that " SANTARLI CONSTRUCTION PTE. LTD. " supplier_name appeared twice on the csv file one is the lowest and the others is the highest amt.
How am I suppose to output I " SANTARLI CONSTRUCTION PTE. LTD. " highest amt out?
The csv data is something like this.
1/7/2015 SANTARLI CONSTRUCTION PTE. LTD. 1030000000
8/4/2015 HYUNDAI ENGINEERING & CONSTRUCTION CO. LTD 601726000
2/3/2015 KAJIMA OVERSEAS ASIA PTE LTD 595800000
11/20/2015 SAMSUNG C&T CORPORATION 555322063
11/23/2015 THE GO-AHEAD GROUP PLC 497738104
6/19/2015 GS Engineering & Construction Corp. 428301000
6/25/2015 TIONG SENG CONTRACTORS (PRIVATE) LIMITED 277265946
2/27/2015 CHIP ENG SENG CONTRACTORS (1988) PTE LTD 258000000
11/18/2015 TEAMBUILD ENGINEERING & CONSTRUCTION PTE. LTD. 236800000
2/23/2015 NCS PTE. LTD. 223028240
11/11/2015 HSL Constructor Pte Ltd 217354000
7/31/2015 HI-TEK CONSTRUCTION PTE LTD 215000000
6/22/2015 HWA SENG BUILDER PTE LTD 189339600
3/19/2015 EXPAND CONSTRUCTION PTE LTD 189000000
11/30/2015 CNQC ENGINEERING & CONSTRUCTION PTE. LTD. 163980000
9/7/2015 Master Contract Services Pte Ltd 163000000
3/5/2015 Yongnam Engineering & Construction Pte Ltd 159000000
5/19/2015 SANTARLI CONSTRUCTION PTE. LTD. 148800000
回答1:
The problem is; when you create the dictionary with to_dict
it creates the desired first instance of "SANTARLI" as a key, and then as it continues to parse, it finds the second instance of "SANTARLI", which it uses as a key, replacing the first instance's key (overwriting the key and data.)
Dictionary keys must be unique. You need to clean your data of redundant instances first. See below...
import pandas as pd
import re
import operator
#df = pd.read_csv('government-procurement-via-gebiz.csv', parse_dates=['award_date'], infer_datetime_format=True, usecols=['supplier_name', 'award_date', 'awarded_amt'],)
# I creatd the df from the data supplied in the questions
df = pd.DataFrame(data, columns =['award_date', 'supplier_name', 'awarded_amt'])
df['award_date'] = pd.to_datetime(df['award_date'])
print(df)
# Select by date (your original code)
df = df[(df['supplier_name'] != 'na') & (df['award_date'].dt.year == 2015)].reset_index(drop=True)
# Sort by column 'awarded_amt'.
# This will leave the duplicates like 'SANTARLI', but put the one with the highest
# value in 'awarded_amt' first
df = df.sort_values('awarded_amt', ascending=True)
# Drop the duplicates. This has a parameter "keep" which defaults to "first"
# Thus, it will keep the first instance of 'SANTARLI',
# which will also be the greatest 'awarded_amt'
df = df.drop_duplicates(subset=['supplier_name'])
# Now create your dict
d1 = df.set_index('supplier_name').to_dict()['awarded_amt']
print(d1)
OUTPUT:
award_date supplier_name awarded_amt
0 2015-01-07 SANTARLI CONSTRUCTION PTE. LTD. 1030000000
1 2014-08-04 HYUNDAI ENGINEERING & CONSTRUCTION CO. LTD 601726000
2 2014-02-03 KAJIMA OVERSEAS ASIA PTE LTD 595800000
3 2015-11-20 SAMSUNG C&T CORPORATION 555322063
4 2015-11-23 THE GO-AHEAD GROUP PLC 497738104
5 2015-06-19 GS Engineering & Construction Corp. 428301000
6 2015-09-07 Master Contract Services Pte Ltd 163000000
7 2015-03-05 Yongnam Engineering & Construction Pte Ltd 159000000
8 2015-12-30 NANJING DADI CONSTRUCTION (GROUP) CO., LTD. SI... 152600000
9 2015-05-19 SANTARLI CONSTRUCTION PTE. LTD. 148800000
{'SANTARLI CONSTRUCTION PTE. LTD.': '1030000000', 'NANJING DADI CONSTRUCTION (GROUP) CO., LTD. SINGAPORE BRANCH': '152600000', 'Yongnam Engineering & Construction Pte Ltd': '159000000', 'Master Contract Services Pte Ltd': '163000000', 'GS Engineering & Construction Corp.': '428301000', 'THE GO-AHEAD GROUP PLC': '497738104', 'SAMSUNG C&T CORPORATION': '555322063'}
EDIT: If you just want the top 5 rows based on "awarded_amt" for each year (I.e. The top 5 "awarded_amt"s regardless of whether those are 5 different companies, or the same companies) then don't do a drop duplicates at all.
Just sort the entire DataFrame by "awarded_amt", take the top 5 (maybe use df.head(5) ), but DON'T use the to_dict() (using the company names as keys) since it won't allow two (or more) of the same company names.
import pandas as pd
import sys
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
data = [["1/7/2015", "SANTARLI CONSTRUCTION PTE. LTD.", 1030000000],
["8/4/2015", "HYUNDAI ENGINEERING & CONSTRUCTION CO. LTD", 601726000],
["2/3/2015", "KAJIMA OVERSEAS ASIA PTE LTD", 595800000],
["11/20/2015","SAMSUNG C&T CORPORATION", 555322063],
["11/23/2015" ,"THE GO-AHEAD GROUP PLC", 497738104],
["6/19/2015" ,"GS Engineering & Construction Corp.", 428301000],
["6/25/2015" ,"TIONG SENG CONTRACTORS (PRIVATE) LIMITED", 277265946],
["5/19/2015" ,"SANTARLI CONSTRUCTION PTE. LTD." , 649800000],
["5/19/2016" ,"SANTARLI CONSTRUCTION PTE. LTD." , 650800000],
["5/19/2016" ,"SANTARLI CONSTRUCTION PTE. LTD." , 651800000],
["11/20/2016","SAMSUNG C&T CORPORATION", 555322063],
["11/23/2016" ,"THE GO-AHEAD GROUP PLC", 497738104],
["6/19/2016" ,"GS Engineering & Construction Corp.", 428301000]
]
df = pd.DataFrame(data, columns = ['award_date', 'supplier_name', 'awarded_amt'])
df['award_date'] = pd.to_datetime(df['award_date'])
# Separate df by years
finaldf = pd.DataFrame()
years = [2015, 2016]
for year in years:
temp_df = df[(df['supplier_name'] != 'na') & (df['award_date'].dt.year == year)].reset_index(drop=True)
# Sort by column 'awarded_amt'.
# This will leave the duplicates like 'SANTARLI', but put the one with the highest
# value in 'awarded_amt' first
temp_df = temp_df.sort_values('awarded_amt', ascending=False)
print("-----------------------____")
finaldf = pd.concat([finaldf, temp_df.iloc[:5]])
print(finaldf)
OUTPUT:
award_date supplier_name awarded_amt
0 2015-01-07 SANTARLI CONSTRUCTION PTE. LTD. 1030000000
7 2015-05-19 SANTARLI CONSTRUCTION PTE. LTD. 649800000
1 2015-08-04 HYUNDAI ENGINEERING & CONSTRUCTION CO. LTD 601726000
2 2015-02-03 KAJIMA OVERSEAS ASIA PTE LTD 595800000
3 2015-11-20 SAMSUNG C&T CORPORATION 555322063
1 2016-05-19 SANTARLI CONSTRUCTION PTE. LTD. 651800000
0 2016-05-19 SANTARLI CONSTRUCTION PTE. LTD. 650800000
2 2016-11-20 SAMSUNG C&T CORPORATION 555322063
3 2016-11-23 THE GO-AHEAD GROUP PLC 497738104
4 2016-06-19 GS Engineering & Construction Corp. 428301000
EDIT:
To transform finaldf
to a dictionary, I would recommend this. It will create a nested dictionary, similar to JSON. You could also use the Python module JSON
for this.
final_dict = {}
for row in finaldf.iterrows():
award_date = row[1][0]
supplier_name = row[1][1]
awarded_amt = row[1][2]
if supplier_name not in final_dict.keys():
final_dict[supplier_name] = {}
final_dict[supplier_name][award_date] = awarded_amt
print(final_dict)
OUTPUT:
{
'SANTARLI CONSTRUCTION PTE. LTD.': {
Timestamp('2015-01-07 00:00:00'): 1030000000,
Timestamp('2015-05-19 00:00:00'): 649800000,
Timestamp('2016-05-19 00:00:00'): 650800000
},
'HYUNDAI ENGINEERING & CONSTRUCTION CO. LTD': {
Timestamp('2015-08-04 00:00:00'): 601726000
},
'KAJIMA OVERSEAS ASIA PTE LTD': {
Timestamp('2015-02-03 00:00:00'): 595800000
},
'SAMSUNG C&T CORPORATION': {
Timestamp('2015-11-20 00:00:00'): 555322063,
Timestamp('2016-11-20 00:00:00'): 555322063
},
'THE GO-AHEAD GROUP PLC': {
Timestamp('2016-11-23 00:00:00'): 497738104
},
'GS Engineering & Construction Corp.': {
Timestamp('2016-06-19 00:00:00'): 428301000
}
}
来源:https://stackoverflow.com/questions/58471634/python-failed-in-retrieving-the-highest-amount-from-a-repeated-data-with-differ