Append dataframes with different column names - Pandas

后端 未结 3 2067
借酒劲吻你
借酒劲吻你 2020-12-21 16:17

I have 3 dataframes which can be generated from the code shown below

df1= pd.DataFrame({\'person_id\':[1,2,3],\'gender\': [\'Male\',\'Female\',\'Not disclose         


        
相关标签:
3条回答
  • 2020-12-21 16:43

    As per pandas documentation, you can do this creating a mapping:

    df2.rename(columns={column1:'ethn', column2:'gen', column3:'pers_id'}, inplace=True)
    

    Now, you clearly stated that you have to do this runtime. If you know that number of columns and their respective positions won't change, you can collect the actual column names with df2.columns(), that should output something like that:

    ['ethnicity', 'gender', 'person_id']
    

    At this point, you can create the mapping as:

    final_columns = ['ethn', 'gen', 'pers_id']
    previous_columns = df2.columns()
    mapping = {previous_columns[i]: final_columns[i] for i in range(3)}  # 3 is arbitrary.
    

    And then just call

    df2.rename(mapping, inplace=True)
    
    0 讨论(0)
  • 2020-12-21 17:02

    As mentioned on https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html , you can pass multiple column names together which can point to same final column name you want. So best approach will be to collect all column names and then map them to common names you need based on some algorithm or manually and then run rename command.

    That algorithm can use both similarity in the names (use TF-IDF) or similarity in values for those columns.

    0 讨论(0)
  • 2020-12-21 17:05

    If you don't know the order of your columns you could try the fuzzy matching approach. Fuzzy matching will provide you a similarity/likeliness value from 0 - 100. So you can determine a threshold of similarity and then replace the columns that is similar to your desired column names. Here is my approach:

    import pandas as pd
    from fuzzywuzzy import process
    
    
    df1= pd.DataFrame({'person_id':[1,2,3],'gender': ['Male','Female','Not disclosed'],'ethn': ['Chinese','Indian','European']})
    df2= pd.DataFrame({'pers_id':[4,5,6],'gen': ['Male','Female','Not disclosed'],'ethnicity': ['Chinese','Indian','European']})
    df3= pd.DataFrame({'son_id':[7,8,9],'sex': ['Male','Female','Not disclosed'],'ethnici': ['Chinese','Indian','European']})
    
    dataFrames = [df1, df2, df3]
    
    for dataFrame in dataFrames:
      for i, column in enumerate(list(dataFrame.columns)):
        if dataFrame.columns[i] == "sex":
          dataFrame.rename(columns={ dataFrame.columns[i]: "gender" }, inplace = True)
    
    colsToFix = ["person_id", "gender", "ethnicity"]
    replaceThreshold = 75
    
    
    ratiosPerDf = list()
    
    for i, dataFrame in enumerate(dataFrames):
      ratioDict = dict()
      for column in colsToFix:
        ratios = process.extract(column, list(dataFrame.columns))
        ratioDict[column] = ratios
      ratiosPerDf.append(ratioDict)
    
    for i, dfRatio in enumerate(ratiosPerDf):
      for column in colsToFix:
        bestMatching = ("", 0)
        for item in dfRatio[column]:
            if item[1] >= replaceThreshold and item[1] > bestMatching[1]:
              bestMatching = item
        if not bestMatching[1] < replaceThreshold:
          print("Column : {} Best matching : {}".format(column, bestMatching[0]))
          dataFrames[i].rename(columns={ bestMatching[0] : column  }, inplace = True)
    
    
    0 讨论(0)
提交回复
热议问题