Extract relevant text from a .txt file in R

北慕城南 提交于 2019-12-25 19:11:02

问题


I am still on a basic beginner level with r. I am currently working on some natural language stuff and I use the ProQuest Newsstand database. Even though the database allows to download txt files, I don't need everything they provide. The files you can download there look like this:

###############################################################################
____________________________________________________________

Report Information from ProQuest 16 July 2016 09:58
____________________________________________________________




____________________________________________________________

Inhaltsverzeichnis

1. Savills cracks Granite deal to establish US presence ; COMMERCIAL PROPERTY

____________________________________________________________

Dokument 1 von 1

Savills cracks Granite deal to establish US presence ; COMMERCIAL PROPERTY

http:...

Kurzfassung: Savills said that as part of its plans to build...

Links: ...

Volltext: Property agency Savills yesterday snapped up US real estate banking firm Granite Partners...

Unternehmen/Organisation: Name: Granite Partners LP; NAICS: 525910

Titel: Savills cracks Granite deal to establish US presence; COMMERCIAL PROPERTY:   [FIRST Edition]

Autor: Steve Pain Commercial Property Editor

Titel der Publikation: Birmingham Post

Seiten: 30

Seitenanzahl: 0

Erscheinungsjahr: 2007

Publikationsdatum: Aug 2, 2007

Jahr: 2007

Bereich: Business

Herausgeber: Mirror Regional Newspapers

Verlagsort: Birmingham (UK)

Publikationsland: United Kingdom

Publikationsthema: General Interest Periodicals--Great Britain

Quellentyp: Newspapers

Publikationssprache: English

Dokumententyp: NEWSPAPER

ProQuest-Dokument-ID: 324215031

Dokument-URL: ...

Copyright: (Copyright 2007 Birmingham Post and Mail Ltd.)

Zuletzt aktualisiert: 2010-06-19

Datenbank: UK Newsstand

____________________________________________________________

Kontaktieren Sie uns unter: http... Copyright © 2016 ProQuest LLC. Alle Rechte vorbehalten. Allgemeine Geschäftsbedingungen:  ...

###############################################################################

What I need is a way to extract only the full text to a csv file. The reason is, when I download hundreds of articles within one file it is quite difficult to copy and paste them manually and I think the file is quite structured. However, the length of text varies. Nevertheless, one could use the next header after the full text as a stop sign (I guess).

Is there any way to do this?

I really would appreciate some help. Kind regards, Steffen


回答1:


Lets say you have all publication information in a single text file make a copy of your file for reset first. Using Notepad++ and RegEx you'd go through following steps:

  • Ctrl+F
  • Choose the Mark tab.
  • Search mode: Regular expression
  • Find what: ^Volltext:\s
  • Alt+M to check Bookmark line (if unchecked only)
  • Click on Mark All

From the main menu go to: Search > Bookmark > Remove Unmarked Lines

In a third step go through following steps:

  • Ctrl+H
  • Search mode: Regular expression
  • Find what: ^Volltext:\s (choose from dropdown)
  • Replace with: NOTHING (clear text field)
  • Click on Replace All

Done ...




回答2:


Try this out:

con <- file("./R/sample text.txt")
content <- paste(readLines(con),collapse="\n")
content <- gsub(pattern = "\\n\\n", replacement = "\n", x = content)
close(con)
content.filtered <- sub(pattern = "(.*)(Volltext:.*?)(_{10,}.*)", 
                        replacement = "\\2", x=content)

Results:

> cat(content.filtered)
Volltext: Property agency Savills yesterday snapped up US real estate banking firm Granite Partners...
Unternehmen/Organisation: Name: Granite Partners LP; NAICS: 525910
Titel: Savills cracks Granite deal to establish US presence; COMMERCIAL PROPERTY:   [FIRST Edition]
Autor: Steve Pain Commercial Property Editor
Titel der Publikation: Birmingham Post
Seiten: 30
Seitenanzahl: 0
Erscheinungsjahr: 2007
Publikationsdatum: Aug 2, 2007
Jahr: 2007
Bereich: Business
Herausgeber: Mirror Regional Newspapers
Verlagsort: Birmingham (UK)
Publikationsland: United Kingdom
Publikationsthema: General Interest Periodicals--Great Britain
Quellentyp: Newspapers
Publikationssprache: English
Dokumententyp: NEWSPAPER
ProQuest-Dokument-ID: 324215031
Dokument-URL: ...
Copyright: (Copyright 2007 Birmingham Post and Mail Ltd.)
Zuletzt aktualisiert: 2010-06-19
Datenbank: UK Newsstand


来源:https://stackoverflow.com/questions/38410186/extract-relevant-text-from-a-txt-file-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!