问题
I am still on a basic beginner level with r. I am currently working on some natural language stuff and I use the ProQuest Newsstand database. Even though the database allows to download txt files, I don't need everything they provide. The files you can download there look like this:
###############################################################################
____________________________________________________________
Report Information from ProQuest 16 July 2016 09:58
____________________________________________________________
____________________________________________________________
Inhaltsverzeichnis
1. Savills cracks Granite deal to establish US presence ; COMMERCIAL PROPERTY
____________________________________________________________
Dokument 1 von 1
Savills cracks Granite deal to establish US presence ; COMMERCIAL PROPERTY
http:...
Kurzfassung: Savills said that as part of its plans to build...
Links: ...
Volltext: Property agency Savills yesterday snapped up US real estate banking firm Granite Partners...
Unternehmen/Organisation: Name: Granite Partners LP; NAICS: 525910
Titel: Savills cracks Granite deal to establish US presence; COMMERCIAL PROPERTY: [FIRST Edition]
Autor: Steve Pain Commercial Property Editor
Titel der Publikation: Birmingham Post
Seiten: 30
Seitenanzahl: 0
Erscheinungsjahr: 2007
Publikationsdatum: Aug 2, 2007
Jahr: 2007
Bereich: Business
Herausgeber: Mirror Regional Newspapers
Verlagsort: Birmingham (UK)
Publikationsland: United Kingdom
Publikationsthema: General Interest Periodicals--Great Britain
Quellentyp: Newspapers
Publikationssprache: English
Dokumententyp: NEWSPAPER
ProQuest-Dokument-ID: 324215031
Dokument-URL: ...
Copyright: (Copyright 2007 Birmingham Post and Mail Ltd.)
Zuletzt aktualisiert: 2010-06-19
Datenbank: UK Newsstand
____________________________________________________________
Kontaktieren Sie uns unter: http... Copyright © 2016 ProQuest LLC. Alle Rechte vorbehalten. Allgemeine Geschäftsbedingungen: ...
###############################################################################
What I need is a way to extract only the full text to a csv file. The reason is, when I download hundreds of articles within one file it is quite difficult to copy and paste them manually and I think the file is quite structured. However, the length of text varies. Nevertheless, one could use the next header after the full text as a stop sign (I guess).
Is there any way to do this?
I really would appreciate some help. Kind regards, Steffen
回答1:
Lets say you have all publication information in a single text file make a copy of your file for reset first. Using Notepad++ and RegEx you'd go through following steps:
- Ctrl+F
- Choose the Mark tab.
- Search mode: Regular expression
- Find what:
^Volltext:\s
- Alt+M to check
Bookmark line
(if unchecked only) - Click on Mark All
From the main menu go to: Search > Bookmark > Remove Unmarked Lines
In a third step go through following steps:
- Ctrl+H
- Search mode: Regular expression
- Find what:
^Volltext:\s
(choose from dropdown) - Replace with: NOTHING (clear text field)
- Click on Replace All
Done ...
回答2:
Try this out:
con <- file("./R/sample text.txt")
content <- paste(readLines(con),collapse="\n")
content <- gsub(pattern = "\\n\\n", replacement = "\n", x = content)
close(con)
content.filtered <- sub(pattern = "(.*)(Volltext:.*?)(_{10,}.*)",
replacement = "\\2", x=content)
Results:
> cat(content.filtered)
Volltext: Property agency Savills yesterday snapped up US real estate banking firm Granite Partners...
Unternehmen/Organisation: Name: Granite Partners LP; NAICS: 525910
Titel: Savills cracks Granite deal to establish US presence; COMMERCIAL PROPERTY: [FIRST Edition]
Autor: Steve Pain Commercial Property Editor
Titel der Publikation: Birmingham Post
Seiten: 30
Seitenanzahl: 0
Erscheinungsjahr: 2007
Publikationsdatum: Aug 2, 2007
Jahr: 2007
Bereich: Business
Herausgeber: Mirror Regional Newspapers
Verlagsort: Birmingham (UK)
Publikationsland: United Kingdom
Publikationsthema: General Interest Periodicals--Great Britain
Quellentyp: Newspapers
Publikationssprache: English
Dokumententyp: NEWSPAPER
ProQuest-Dokument-ID: 324215031
Dokument-URL: ...
Copyright: (Copyright 2007 Birmingham Post and Mail Ltd.)
Zuletzt aktualisiert: 2010-06-19
Datenbank: UK Newsstand
来源:https://stackoverflow.com/questions/38410186/extract-relevant-text-from-a-txt-file-in-r