问题
My code merges csv files and removes duplicates with pandas. Is it possible to add an additional header with values to the single merged file?
The additional header should be called Host Alias and should correspond to Host Name
E.g. Host Name is dpc01n1 and the corresponding Host Alias should be dev_dom1
Host Name is dpc02n1 and the corresponding Host Alias should be dev_dom2
etc.
Here is my code
from glob import glob
import pandas as pd
class bcolors:
HEADER = '\033[95m'
OKBLUE = '\033[94m'
OKGREEN = '\033[92m'
WARNING = '\033[93m'
FAIL = '\033[91m'
ENDC = '\033[0m'
BOLD = '\033[1m'
UNDERLINE = '\033[4m'
input_path = r'C:\Users\urale\Desktop\logs'
output_path = r'C:\Users\urale\Desktop\logs' + '\\'
output_name = 'output.csv'
stock_files = sorted(glob(input_path + '\pc_dblatmonstat_*_*.log'))
print(bcolors.OKBLUE + 'Getting .log files from', input_path)
final_headers = [
'Start Time',
'epoch',
'Host Name',
'Db Alias',
'Database',
'Db Host',
'Db Host IP',
'IP Port',
'Latency (us)'
]
#read in files via list comprehension
content = [pd.read_csv(f,usecols = final_headers, sep='[;]',engine='python')
for f in stock_files]
print(bcolors.OKBLUE + 'Reading files')
#combine files into one dataframe
combo = pd.concat(content,ignore_index = True)
print(bcolors.OKBLUE + 'Combining files')
#drop duplicates
combo = combo.drop_duplicates()
#combo = combo.drop_duplicates(final_headers, keep=False)
print(bcolors.OKBLUE + 'Dropping duplicates')
#write to csv:
combo.to_csv(output_path + output_name, index = False)
print(bcolors.OKGREEN + 'Merged file output to', output_path, 'as', output_name)
回答1:
def func(row):
if row['Host Name'] == "dpc01n1":
return 'dev_dom1'
#do your Host Alias generate logic here,and return
combo["Host Alias"]=combo.apply(func, axis=1)
DataFrame.apply accept a function to generate a new Series or DataFrame
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
回答2:
Something like this should work:
import pandas as pd
combo = pd.DataFrame({
'Start Time' : [1,2,3],
'epoch' : [1,2,3],
'Host Name': ['dpc01n1','dpc02n1','dpc00103n1'],
'Db Alias' : [1,2,3],
'Database' : [1,2,3],
'Db Host' : [1,2,3],
'Db Host IP' : [1,2,3],
'IP Port' : [1,2,3],
'Latency (us)' : [1,2,3],
})
h_num = combo['Host Name'].str.lstrip('dpc0').str[:-2]
combo['Host Alias'] = 'dev_dom' + h_num
print(combo)
It assumes all 'Host Name's don't start with anything other than 'dpc' and the two trailing characters like 'n1' are not needed. Example in python tutor
Follow up question asked in comments:
It assumes that my merged csv file already has Host Alias but it doesn't resulting in an error: Exception has occurred: ValueError Usecols do not match columns, columns expected but not found: ['Host Alias'] File "D:\OneDrive\python\merger.py", line 42, in content = [pd.read_csv(f,usecols = combo_headers, sep='[;]',engine='python') Other than dpc, I also have tpc. How can I add that too? – Trunks
str.lstrip will strip all characters provided in the argument regardless of order. Just add a 't':
h_num = combo['Host Name'].str.lstrip('tdpc0').str[:-2]
python tutor example with t added
More reading on str.strip
As for:
It assumes that my merged csv file already has Host Alias
I'm not sure what you mean by this. When you do
combo['Host Alias'] = 'dev_dom' + h_num
The 'Host Alias' column will be created in the pandas.DataFrame should it not already exist. If it does exist then the column will be replaced with the new data returned by the operation. You can then use pandas.DataFrame.to_csv to save this DataFrame to a .csv file.
来源:https://stackoverflow.com/questions/61180155/add-additional-column-in-merged-csv-file