Regex to match company names from copyright statements under several conditions

后端 未结 1 1068
刺人心
刺人心 2020-12-20 03:32

I\'m on a tight schedule to come up with a python regex to match company names in many possible different copyright statements, for instance:

Copyright © 201         


        
1条回答
  •  挽巷
    挽巷 (楼主)
    2020-12-20 04:24

    You may consider a regex like

    (?i)(?:©(?:\s*Copyright)?|Copyright(?:\s*©)?)\s*\d+(?:\s*-\s*\d+)?\s*(.*?(?=\W*All\s+rights\s+reserved)|[^.]*(?=\.)|.*)
    

    See the regex demo. Use a case insensitive modifier, re.I with it.

    Details

    • (?:©(?:\s*Copyright)?|Copyright(?:\s*©)?) - either
      • ©(?:\s*Copyright)? - © char followed with an optional substring of 0+ whitespaces and then Copyright
      • | - or
      • Copyright(?:\s*©)? - Copyright followed with an optional substring of 0+ whitespaces and © char
    • \s* - 0+ whitespaces
    • \d+ - 1+ digits (use \d{4} if the years always contain 4 digits)
    • (?:\s*-\s*\d+)? - an optional sequence of a - enclosed with 0+ whitespaces and then 1+ digits (use \d{4} if the years always contain 4 digits)
    • \s* - 0+ whitespaces
    • (.*?(?=\W*All\s+rights\s+reserved)|[^.]*(?=\.)|.*) - Capturing group 1: any of the alternatives:
      • .*?(?=\W*All\s+rights\s+reserved) - any 0+ chars other than line break chars, s few as possible, up to the 0+ non-word chars followed with All rights reserved string
      • [^.]*(?=\.) - any 0+ chars other than . as many as possible up to . not including .
      • .* - the rest of the line

    Python demo:

    import re
    s = "Copyright © 2019 Apple Inc. All rights reserved.\r\n© 2019 Quid, Inc. All Rights Reserved.\r\n© 2009 Database Designs \r\n© 2019 Rediker Software, All Rights Reserved\r\n©2019 EVOSUS, INC. ALL RIGHTS RESERVED\r\n© 2019 Walmart. All Rights Reserved.\r\n© Copyright 2003-2019 Exxon Mobil Corporation. All Rights Reserved.\r\nCopyright © 1978-2019 Berkshire Hathaway Inc.\r\n© 2019 McKesson Corporation\r\n© 2019 UnitedHealth Group. All rights reserved.\r\n© Copyright 1999 - 2019 CVS Health\r\nCopyright 2019 General Motors. All Rights Reserved.\r\n© 2019 Ford Motor Company\r\n©2019 AT&T Intellectual Property. All rights reserved.\r\n© 2019 GENERAL ELECTRIC\r\nCopyright ©2019 AmerisourceBergen Corporation. All Rights Reserved.\r\n© 2019 Verizon\r\n© 2019 Fannie Mae\r\nCopyright © 2018 Jonas Construction Software Inc. All rights reserved.\r\nAll Comments © Copyright 2017 Kroger | The Kroger Co. All Rights Reserved\r\n© 2019 Express Scripts Holding Company. All Rights Reserved. 1 Express Way, St. Louis, MO 63121\r\n© 2019 JPMorgan Chase & Co.\r\nCopyright © 1995 - 2018 Boeing. All Rights Reserved.\r\n© 2019 Bank of America Corporation. All rights reserved.\r\n© 1999 - 2019 Wells Fargo. All rights reserved. NMLSR ID 399801\r\n©2019 Cardinal Health. All rights reserved.\r\n© 2019 Quid, Inc All Rights Reserved."
    rx = r"(?:©(?:\s*Copyright)?|Copyright(?:\s*©)?)\s*\d+(?:\s*-\s*\d+)?\s*(.*?(?=\W*All\s+rights\s+reserved)|[^.\n]*(?=\.)|.*)"
    for m in re.findall(rx, s, re.I):
        print(m)
    

    Output:

    Apple Inc
    Quid, Inc
    Database Designs 
    Rediker Software
    EVOSUS, INC
    Walmart
    Exxon Mobil Corporation
    Berkshire Hathaway Inc
    McKesson Corporation
    UnitedHealth Group
    CVS Health
    General Motors
    Ford Motor Company
    AT&T Intellectual Property
    GENERAL ELECTRIC
    AmerisourceBergen Corporation
    Verizon
    Fannie Mae
    Jonas Construction Software Inc
    Kroger | The Kroger Co
    Express Scripts Holding Company
    JPMorgan Chase & Co
    Boeing
    Bank of America Corporation
    Wells Fargo
    Cardinal Health
    Quid, Inc
    

    0 讨论(0)
提交回复
热议问题