Add additional column in merged csv file

北城余情 提交于 2021-01-28 04:40:14

问题


My code merges csv files and removes duplicates with pandas. Is it possible to add an additional header with values to the single merged file?

The additional header should be called Host Alias and should correspond to Host Name

E.g. Host Name is dpc01n1 and the corresponding Host Alias should be dev_dom1 Host Name is dpc02n1 and the corresponding Host Alias should be dev_dom2 etc.

Here is my code

from glob import glob
import pandas as pd

class bcolors:
    HEADER = '\033[95m'
    OKBLUE = '\033[94m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m'
    FAIL = '\033[91m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'

input_path = r'C:\Users\urale\Desktop\logs'
output_path = r'C:\Users\urale\Desktop\logs' + '\\'
output_name = 'output.csv'

stock_files = sorted(glob(input_path + '\pc_dblatmonstat_*_*.log'))
print(bcolors.OKBLUE + 'Getting .log files from', input_path)

final_headers = [
        'Start Time', 
        'epoch', 
        'Host Name', 
        'Db Alias', 
        'Database', 
        'Db Host', 
        'Db Host IP',
        'IP Port',
        'Latency (us)'
]

#read in files via list comprehension
content = [pd.read_csv(f,usecols = final_headers, sep='[;]',engine='python') 
           for f in stock_files]
print(bcolors.OKBLUE + 'Reading files')


#combine files into one dataframe
combo = pd.concat(content,ignore_index = True)
print(bcolors.OKBLUE + 'Combining files')

#drop duplicates
combo = combo.drop_duplicates()
#combo = combo.drop_duplicates(final_headers, keep=False)
print(bcolors.OKBLUE + 'Dropping duplicates')

#write to csv:
combo.to_csv(output_path + output_name, index = False)
print(bcolors.OKGREEN + 'Merged file output to', output_path, 'as', output_name)

回答1:


def func(row):
    if row['Host Name'] == "dpc01n1":
        return 'dev_dom1'
    #do your Host Alias generate logic here,and return

combo["Host Alias"]=combo.apply(func, axis=1)

DataFrame.apply accept a function to generate a new Series or DataFrame

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html




回答2:


Something like this should work:

import pandas as pd

combo = pd.DataFrame({
        'Start Time' : [1,2,3], 
        'epoch' : [1,2,3], 
        'Host Name': ['dpc01n1','dpc02n1','dpc00103n1'], 
        'Db Alias' : [1,2,3], 
        'Database' : [1,2,3], 
        'Db Host' : [1,2,3], 
        'Db Host IP' : [1,2,3],
        'IP Port' : [1,2,3],
        'Latency (us)' : [1,2,3],
})

h_num = combo['Host Name'].str.lstrip('dpc0').str[:-2]

combo['Host Alias'] = 'dev_dom' + h_num

print(combo)

It assumes all 'Host Name's don't start with anything other than 'dpc' and the two trailing characters like 'n1' are not needed. Example in python tutor

Follow up question asked in comments:

It assumes that my merged csv file already has Host Alias but it doesn't resulting in an error: Exception has occurred: ValueError Usecols do not match columns, columns expected but not found: ['Host Alias'] File "D:\OneDrive\python\merger.py", line 42, in content = [pd.read_csv(f,usecols = combo_headers, sep='[;]',engine='python') Other than dpc, I also have tpc. How can I add that too? – Trunks

str.lstrip will strip all characters provided in the argument regardless of order. Just add a 't':

h_num = combo['Host Name'].str.lstrip('tdpc0').str[:-2]

python tutor example with t added

More reading on str.strip

As for:

It assumes that my merged csv file already has Host Alias

I'm not sure what you mean by this. When you do

combo['Host Alias'] = 'dev_dom' + h_num

The 'Host Alias' column will be created in the pandas.DataFrame should it not already exist. If it does exist then the column will be replaced with the new data returned by the operation. You can then use pandas.DataFrame.to_csv to save this DataFrame to a .csv file.



来源:https://stackoverflow.com/questions/61180155/add-additional-column-in-merged-csv-file

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!