How to add information found with regular expressions in a CSV file

问题

I am trying to "append" new information to a CSV file. The problem is that that information is not in a dataframe structure, but is information extracted from a text using regular expressions. The sample text would be the next one:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam id diam posuere, eleifend diam at, condimentum justo. Pellentesque mollis a diam id consequat.

TITLE-SDFSD-DFDS-SFDS-01-01: This is the title 1 that

is split into two lines with a blank line in the middle

Conditions Pellentesque blandit scelerisque pellentesque. Sed nec quam purus. Quisque nec tellus sed neque accumsan lacinia sit amet sit amet tellus. Etiam venenatis nibh vel pellentesque elementum. Nullam eget tortor quam. Morbi sed leo et arcu aliquet luctus.

Opening date 15 Apr 2021

Deadline 26 Aug 2021

Indicative budget: The total indicative budget for the topic is EUR 20.00 million.

TITLE-SDFSD-DFDS-SFDS-01-02; This is the title2 in one single line

Conditions Cras egestas consectetur sapien at dignissim. Maecenas commodo purus nibh, a tempus augue vestibulum feugiat. Vestibulum dolor neque, sagittis ut tortor et, lobortis faucibus quam.

Opening date 15 March 2021

Deadline 17 Aug 2021

Indicative budget: The total indicative budget for the topic is EUR 15.00 million.

TITLE-SDFSD-DFDS-SFDS-01-03: This is the title3 that is too long and takes two lines

Conditions Cras egestas consectetur sapien at dignissim. Maecenas commodo purus nibh, a tempus augue vestibulum feugiat. Vestibulum dolor neque, sagittis ut tortor et, lobortis faucibus quam.

Opening date 15 May 2021

Deadline 26 Sep 2021

Indicative budget: The total indicative budget for the topic is EUR 5.00 million.

To extract all the information, I have to make several interactions to extract the information I need. I know it is possible to make just one iteration subdividing into several groups what I need, but it is very hard for me to find just a regular expression that works. Instead, I am using several of them:

import re
import csv
    
with open('doubt2.txt','r', encoding="utf-8") as f:
    f_contents = f.read()

regexHOR =r'\n(TITLE-\S+-\d{2}-\d{2})[:|;](.*?)^Conditions'
regexOD = r'^Opening date\s+(\d{1,2} \w+ \d{4})\s*?'
regexDL =r'^Deadline\s+(\d+ \w+ \d+)'

patternHOR = re.compile(regexHOR, re.MULTILINE | re.DOTALL)
patternOD = re.compile(regexOD, re.MULTILINE | re.DOTALL)
patternDL = re.compile(regexDL, re.MULTILINE | re.DOTALL)

matchesHOR = patternHOR.finditer(f_contents)
matchesOD = patternOD.finditer(f_contents)
matchesDL = patternDL.finditer(f_contents)

marchesHOR finds two groups, whereas the other matches are just one group. Once I have the matches I have to export it in a CSV file executing the next code:

with open("result.csv", "w",newline='') as outfile:
    csvfile = csv.writer(outfile)
    csvfile.writerow(['Topic ID', 'Title', 'Opening date', 'Deadline'])
    for match in matchesHOR:
        csvfile.writerow([match.group(1), match.group(2).replace('\n', ' '),'',''])
    for match in matchesOD:
        csvfile.writerow(['','',match.group(1),''])
    for match in matchesDL:
        csvfile.writerow(['','','',match.group(1)])

The problem is that when I write the new nows after the matchesHOR it put me below, as you can see in this table:

CODE	TITLE	Opening	Deadline
CODE 1	TITLE 1
CODE 2	TITLE 2
CODE 3	TITLE 3
		OPENING 1
		OPENING 2
		OPENING 3
			DEADLINE 1
			DEADLINE 2
			DEADLINE 3

Any additional comment to perform the four interaction to identify the several groups are welcome

回答1:

You need to rearrange things a bit so that all items are written for a row at the same time. The approach here is to use the match_hor to find each title start, and to then use this as a starting point for the match_od, which is in turn used as a starting point for the match_dl.

import re
import csv
    
with open('doubt2.txt','r', encoding="utf-8") as f:
    f_contents = f.read()

regexHOR = r'\n(TITLE-\S+-\d{2}-\d{2})[:|;](.*?)^Conditions'
regexOD = r'^Opening date\s+(\d{1,2} \w+ \d{4})\s*?'
regexDL =r'^Deadline\s+(\d+ \w+ \d+)'

patternHOR = re.compile(regexHOR, re.MULTILINE | re.DOTALL)
patternOD = re.compile(regexOD, re.MULTILINE | re.DOTALL)
patternDL = re.compile(regexDL, re.MULTILINE | re.DOTALL)

with open("result.csv", "w",newline='') as outfile:
    csvfile = csv.writer(outfile)
    csvfile.writerow(['Topic ID', 'Title', 'Opening date', 'Deadline'])
    
    for match_hor in patternHOR.finditer(f_contents):
        code, title = [match_hor.group(1), match_hor.group(2).replace('\n', ' ')]
        offset = match_hor.end()
        
        match_od = patternOD.search(f_contents[offset:])
        offset += match_od.end()
        opening = match_od.group(1)
        
        match_dl = patternDL.search(f_contents[offset:]) 
        offset += match_dl.end()
        deadline = match_dl.group(1)
        
        csvfile.writerow([code, title.strip(), opening, deadline])

This would give you result.csv containing:

Topic ID,Title,Opening date,Deadline
TITLE-SDFSD-DFDS-SFDS-01-01,This is the title 1 that  is split into two lines with a blank line in the middle,15 Apr 2021,26 Aug 2021
TITLE-SDFSD-DFDS-SFDS-01-02,This is the title2 in one single line,15 March 2021,17 Aug 2021
TITLE-SDFSD-DFDS-SFDS-01-03,This is the title3 that is too long and takes two lines,15 May 2021,26 Sep 2021

回答2:

I propose to you the following code, using positive lookahead, lookbehind and namedgroup, as below:

>>> regexHOR = r'(?P<TopicID>TITLE-\S+-\d{2}-\d{2})[:;]\s*(?P<Title>[\w\s]+(?=Conditions))'
>>>
>>> regexOD = r'(?P<OpeningDate>(?<=Opening date )\d{1,2} \w+ \d{4})'
>>>
>>> regexDL = r'(?P<DeadLine>(?<=Deadline )\d+ \w+ \d+)'
>>>
>>>regex_pattern = re.compile('.*?'.join([regexHOR, regexOD, regexDL]), re.MULTILINE | re.DOTALL)
>>>
>>> for match in re.finditer(regex_pattern, f_contents):
        csvfile.writerow([match.group('TopicID'), match.group('Title'), \
        match.group('OpeningDate'), match.group('DeadLine')])

Each time you call csvfile.writerow, a new row is written, that's why you don't have all items of each loop iteration written on the same row.

来源：https://stackoverflow.com/questions/66048123/how-to-add-information-found-with-regular-expressions-in-a-csv-file

标签

python

regex

csv

regex-group