Google Sheets - Matching Company Names

依然范特西╮ 提交于 2020-03-06 09:30:11

问题


I have 2 databases, both have names of companies, but in different formats. I have been able to do exact matching using vlookup. I want to extract companies that were written differently, but they are actually the same company and extract the data.

Below is a small part of the databases I have

Database 1

Column A
1-800-Flowers.com Inc
Abbott Laboratories (Abbott)
21st Century Fox America Inc (formerly News America Inc)

Column B
1234(data I need to grab)
4567
8910

Database 2

Column C                                             
1-800 CONTACTS INC                                 
1-800-FLOWERS.COM                                   
ABBOTT LABORATORIES                                 
TWENTY-FIRST CENTURY FOX INC                        

Column D
ABCD(DataI can ignore as the company doesn't exist in database 1)
EFGH (Data I need as it matches from Database 1)
IJK
LMNO

As you can see from the above databases, Database 1 matches Database 2's in similar words like 21st Century Fox America Inc vs Twenty-first Century Fox Inc

In my database 1, I have about 4000+ values, while in database 2, I have 10,000 values. Is there a code to compare similar words between both databases and extract the data I need from columns B and D?

I have tried query, but it doesn't work the way I wanted it to. This is my shareable link.

Currently, What I have done is to extract the words which are similar using REGEXTRACT to find a match between the strings like Century Fox in 21st Century Fox and Twenty-First Century Fox and attempted to match both data sets using query. However my query result comes up with NA when I write it like this

=query(E:E,"Select E where E contains '"&L2&"'",0 )

L2 being the cell that contains the string Century Fox


回答1:


L2:

=ARRAYFORMULA(INDEX($E$2:$E$68,MATCH(MAX(ARRAY_CONSTRAIN(MMULT(LEN(IFERROR(VLOOKUP(SPLIT($E$2:$E$68," "),transpose(SPLIT(A2," ")),1,0))),ROW(A$1:A$7)^0),ROW(E68),7)),ARRAY_CONSTRAIN(MMULT(LEN(IFERROR(VLOOKUP(SPLIT($E$2:$E$68," "),transpose(SPLIT(A2," ")),1,0))),ROW(A$1:A$7)^0),ROW(E68),7),0)))

M2:

=ARRAYFORMULA(INDEX($E$2:$F$68,MATCH(MAX(ARRAY_CONSTRAIN(MMULT(LEN(IFERROR(VLOOKUP(SPLIT($E$2:$E$68," "),transpose(SPLIT(A2," ")),1,0))),ROW(A$1:A$7)^0),ROW(E68),7)),ARRAY_CONSTRAIN(MMULT(LEN(IFERROR(VLOOKUP(SPLIT($E$2:$E$68," "),transpose(SPLIT(A2," ")),1,0))),ROW(A$1:A$7)^0),ROW(E68),7),0),2))

N2:

=ARRAYFORMULA(TEXT(MAX(ARRAY_CONSTRAIN(MMULT(LEN(IFERROR(VLOOKUP(SPLIT($E$2:$E$68," "),transpose(SPLIT(A2," ")),1,0))),ROW(A$1:A$7)^0),ROW(E68),7))/LEN(A2),"0%"))

Drag fill down.

Notes:

  • Formula is resource intensive. Apps Script might be a better choice.

  • For the given sample, This formula works with a reasonable degree of precision.

  • 7 is the maximum number of words per cell found in all of Column E( or Column C of database 2). This is hardcoded in the above formula. This should be found using a helper column. Z2:COUNTA (SPLIT(A2," ")) Drag fill down. And AA2: =MAX(Z2:Z)

  • N column gives the degree of confidence in the VLOOKUP produced result. Preferably, Anything below 45% should be rechecked manually.

  • How it works: All of E column (db2) is split by words and each of the word is looked upon in each entry of A column(db1). If a group of words are matched for multiple entries in E column, then the maximum of the length of matched words is taken and given as the possible match. A letter approach instead of a word approach may give better precision, but seems unnecessary in the given sample.



来源:https://stackoverflow.com/questions/48482798/google-sheets-matching-company-names

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!