Extract number between text and | with RegEx Python

问题

I want to extract the information between CVE and |, but only the first time that CVE appear in the txt.

I have now the follow code:

import re
f = open ('/Users/anna/PycharmProjects/extractData/DiarioOficial/aaa1381566.pdf.txt','r')
mensaje = f.read()
mensaje = mensaje.replace("\n","")

print re.findall(r'\sCVE\s+([^|]*)', mensaje)

Here is the txt file:

CVE 1381566     

|     

Director: Juan Jorge Lazo Rodríguez    

Sitio Web:   

 www.diarioficial.cl    

|     

Mesa Central:   

 +562 2486 3600    

    Email:    

consultas@diarioficial.cl   

Dirección:    

Dr. Torres Boonen N°511, Providencia, Santiago, Chile.       

Este documento ha sido firmado electrónicamente de acuerdo con la ley N°19.799 e incluye sellado de tiempo y firma electrónica  

avanzada. Para verificar la autenticidad de una representación impresa del mismo, ingrese este código en el sitio web www.diarioficial.cl                           

DIARIO OFICIAL    

DE LA REPUBLICA DE CHILE    

Ministerio del Interior y Seguridad Pública      

V    

SECCIÓN       

CONSTITUCIONES, MODIFICACIONES Y DISOLUCIONES DE SOCIEDADES Y COOPERATIVAS                      

Núm. 42.031    

|    

Viernes 13 de Abril de 2018    

|    

Página 1 de 1      

Empresas y Cooperativas    

CVE 1381566        

EXTRACTO     

     

MARÍA SOLEDAD LÁSCAR MERINO, Notario Público Titular de la Sexta Notaría de  

Antofagasta, Prat Nº 482, local 25, certifica: Escritura hoy ante mí: CARLOS ANDRES ROJAS  

ANGEL, calle Antilhue Nº 1613; CAROLINA ANDREA ROJAS VALERO, calle Catorce de  

Febrero Nº 2339; NADIA TATIANA LEON BELMAR, calle Azapa Nº 4831; MARIO  

ANTONIO LUQUE HERRERA, calle Huanchaca Nº 398; PEDRO EDUARDO BARRAZA  

ZAPATA, Avenida Andrés Sabella Nº 2766; JOSE ANTONIO REYES RASSE, calle Altos del  

Mar Nº 1147, casa 15; y PATRICIA ALICIA MARCHANT ROJAS, calle Ossa N° 2741; todos  

domicilios Antofagasta, rectificaron y complementaron sociedad "CENTRO DE  

ACONDICIONAMIENTO FISICO LEFTRARU LIMITADA, LEFTRARU LIMITADA  

nombre de fantasía "LEFTRARU BOX LTDA"., constituida escritura este oficio, fecha 20 de  

febrero de 2018, publicada en extracto Diario Oficial fecha 13 de marzo de 2018, edición Nº  

42006; sentido señalar que la razón social correcta de la sociedad es: CENTRO DE  

ACONDICIONAMIENTO FISICO LEFTRARU LIMITADA; y su nombre de fantasía es  

LEFTRARU BOX LTDA.; y no "CENTRO DE ACONDICIONAMIENTO FISICO  

LEFTRARU, y nombre fantasía "LEFTRARU LTDA"., como erróneamente allí se menciona.-  

Demás estipulaciones escritura.- ANTOFAGASTA, 27 de marzo de 2018.-

回答1:

What you might do is instead of matching \s at the start, match a whitespace character\s*zero or more times or assert the start of the string ^ and use search to find the first location where the regular expression pattern produces a match.

Then get the value from the capturing group:

mensaje = mensaje.replace("\n","")
regex = r"\s*CVE\s+([^|]*)"
matches = re.search(regex, mensaje)
if matches:
    print (matches.group(1).strip()) # 1381566

Demo

回答2:

A solution with split:

number = mensaje.split('CVE')[1].split('|')[0].strip()

来源：https://stackoverflow.com/questions/51477118/extract-number-between-text-and-with-regex-python

标签

python

regex

extract