I\'m on a tight schedule to come up with a python regex to match company names in many possible different copyright statements, for instance:
Copyright © 201
You may consider a regex like
(?i)(?:©(?:\s*Copyright)?|Copyright(?:\s*©)?)\s*\d+(?:\s*-\s*\d+)?\s*(.*?(?=\W*All\s+rights\s+reserved)|[^.]*(?=\.)|.*)
See the regex demo. Use a case insensitive modifier, re.I
with it.
Details
(?:©(?:\s*Copyright)?|Copyright(?:\s*©)?)
- either
©(?:\s*Copyright)?
- ©
char followed with an optional substring of 0+ whitespaces and then Copyright
|
- orCopyright(?:\s*©)?
- Copyright
followed with an optional substring of 0+ whitespaces and ©
char\s*
- 0+ whitespaces\d+
- 1+ digits (use \d{4}
if the years always contain 4 digits)(?:\s*-\s*\d+)?
- an optional sequence of a -
enclosed with 0+ whitespaces and then 1+ digits (use \d{4}
if the years always contain 4 digits)\s*
- 0+ whitespaces(.*?(?=\W*All\s+rights\s+reserved)|[^.]*(?=\.)|.*)
- Capturing group 1: any of the alternatives:
.*?(?=\W*All\s+rights\s+reserved)
- any 0+ chars other than line break chars, s few as possible, up to the 0+ non-word chars followed with All rights reserved
string[^.]*(?=\.)
- any 0+ chars other than .
as many as possible up to .
not including .
.*
- the rest of the linePython demo:
import re
s = "Copyright © 2019 Apple Inc. All rights reserved.\r\n© 2019 Quid, Inc. All Rights Reserved.\r\n© 2009 Database Designs \r\n© 2019 Rediker Software, All Rights Reserved\r\n©2019 EVOSUS, INC. ALL RIGHTS RESERVED\r\n© 2019 Walmart. All Rights Reserved.\r\n© Copyright 2003-2019 Exxon Mobil Corporation. All Rights Reserved.\r\nCopyright © 1978-2019 Berkshire Hathaway Inc.\r\n© 2019 McKesson Corporation\r\n© 2019 UnitedHealth Group. All rights reserved.\r\n© Copyright 1999 - 2019 CVS Health\r\nCopyright 2019 General Motors. All Rights Reserved.\r\n© 2019 Ford Motor Company\r\n©2019 AT&T Intellectual Property. All rights reserved.\r\n© 2019 GENERAL ELECTRIC\r\nCopyright ©2019 AmerisourceBergen Corporation. All Rights Reserved.\r\n© 2019 Verizon\r\n© 2019 Fannie Mae\r\nCopyright © 2018 Jonas Construction Software Inc. All rights reserved.\r\nAll Comments © Copyright 2017 Kroger | The Kroger Co. All Rights Reserved\r\n© 2019 Express Scripts Holding Company. All Rights Reserved. 1 Express Way, St. Louis, MO 63121\r\n© 2019 JPMorgan Chase & Co.\r\nCopyright © 1995 - 2018 Boeing. All Rights Reserved.\r\n© 2019 Bank of America Corporation. All rights reserved.\r\n© 1999 - 2019 Wells Fargo. All rights reserved. NMLSR ID 399801\r\n©2019 Cardinal Health. All rights reserved.\r\n© 2019 Quid, Inc All Rights Reserved."
rx = r"(?:©(?:\s*Copyright)?|Copyright(?:\s*©)?)\s*\d+(?:\s*-\s*\d+)?\s*(.*?(?=\W*All\s+rights\s+reserved)|[^.\n]*(?=\.)|.*)"
for m in re.findall(rx, s, re.I):
print(m)
Output:
Apple Inc
Quid, Inc
Database Designs
Rediker Software
EVOSUS, INC
Walmart
Exxon Mobil Corporation
Berkshire Hathaway Inc
McKesson Corporation
UnitedHealth Group
CVS Health
General Motors
Ford Motor Company
AT&T Intellectual Property
GENERAL ELECTRIC
AmerisourceBergen Corporation
Verizon
Fannie Mae
Jonas Construction Software Inc
Kroger | The Kroger Co
Express Scripts Holding Company
JPMorgan Chase & Co
Boeing
Bank of America Corporation
Wells Fargo
Cardinal Health
Quid, Inc