Parsing Interview Text

社会主义新天地 提交于 2021-02-20 04:14:07

问题


I have a text file of a presidential debate. Eventually, I want to parse the text into a dataframe where each row is a statement, with one column with the speaker's name and another column with the statement. For example:

"Bob Smith: Hi Steve. How are you doing? Steve Brown: Hi Bob. I'm doing well!"

Would become:

   name          text
1   Bob Smith    Hi Steve. How are you doing?
2 Steve Brown    Hi Bob. I'm doing well!

Question: How do I split the statements from the names? I tried splitting on the colon:

data <- strsplit(data, split=":")

But then I get this:

"Bob Smith" "Hi Steve. How are you doing? Steve Brown" "Hi Bob. I'm doing well!"

When what I want is this:

"Bob Smith" "Hi Steve. How are you doing?" "Steve Brown" "Hi Bob. I'm doing well!"

回答1:


I doubt this will fix all of your parsing needs, but an approach using strsplit to solve your most immediate question is using lookaround. You'll need to use perl regex though.

Here you instruct strsplit to split on either : or a space where there is a punctuation character immediately before and nothing but alphanumeric characters or spaces between the space and :. \\pP matches punctuation characters and \\w matches word characters.

data <- "Bob Smith: Hi Steve. How are you doing? Steve Brown: Hi Bob. I'm doing well!"
strsplit(data,split="(: |(?<=\\pP) (?=[\\w ]+:))",perl=TRUE)
[[1]]
[1] "Bob Smith"                    "Hi Steve. How are you doing?" "Steve Brown"                 
[4] "Hi Bob. I'm doing well!"  



回答2:


We can extract these with regex using the stringr package. You then directly have the columns of speaker and quote you are looking for.

a <- "Bob: Hi Steve. Steve: Hi Bob."

library(stringr)

str_match_all(a, "([A-Za-z]*?): (.*?\\.)")
#> [[1]]
#>      [,1]             [,2]    [,3]       
#> [1,] "Bob: Hi Steve." "Bob"   "Hi Steve."
#> [2,] "Steve: Hi Bob." "Steve" "Hi Bob."


来源:https://stackoverflow.com/questions/60778339/parsing-interview-text

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!