问题
I have a batch of data that includes a text variable full of free-form text. I am trying to extract certain information based on context within the string into new variables which I can then analyze.
I have been digging into qdap
and tm
. I have uniformed the format with tolower
and replace_abbreviation
but cannot seem to figure out how to actually extract the information I need.
So for example,
library(data.table)
data<-data.table(text=c("Person 1: $1000 fine, 31 months jail",
"Person 2: $500 fine, 45 days jail"))
text
1: Person 1: $1000 fine, 31 months jail
2: Person 2: $500 fine, 45 days jail
What I would like to do is to extract numbers based on whatever the following term is to create two additional variable, months and days, which has the corresponding values:
data<-data.table(text=c("Person 1: $1000 fine, 31 months jail",
"Person 2: $500 fine, 45 days jail"),
months=c("31",""),
days=c("","45")
text months days
1: Person 1: $1000 fine, 31 months jail 31
2: Person 2: $500 fine, 45 days jail 45
I have scoured Stack Overflow and have not found any answers to this so hopefully I didn't just miss one. But any help anyone could offer will be very much appreciated. Still pretty new at text analysis.
Thank you for your time!
回答1:
Using stringr::str_extract()
with positive lookahead you can do something like this:
data <- dplyr::mutate(data,
months = stringr::str_extract(text, "\\d+(?=\\smonths)"),
days = stringr::str_extract(text, "\\d+(?=\\sdays)"))
## text months days
## 1 Person 1: $1000 fine, 31 months jail 31 <NA>
## 2 Person 2: $500 fine, 45 days jail <NA> 45
The above regex makes some assumptions about the text string, namely it has one and only one space between the number and the unit, and also that the units are always plural. Something more flexible would be:
data<-data.table(text=c("Person 1: $1000 fine, 31 months jail",
"Person 2: $500 fine, 45 days jail",
"Person 3: $1000 fine, 1 month 1 day jail"))
data <- dplyr::mutate(data,
months = stringr::str_extract(text, "\\d+(?=\\s*months*)"),
days = stringr::str_extract(text, "\\d+(?=\\s*days*)"))
## text months days
## 1 Person 1: $1000 fine, 31 months jail 31 <NA>
## 2 Person 2: $500 fine, 45 days jail <NA> 45
## 3 Person 3: $1000 fine, 1 month 1 day jail 1 1
回答2:
getMonths <- function(str) {
res <- regmatches(str, regexpr("\\d+\\smonths",str));
if (length(res)>0) {
res <- regmatches(res, regexpr("\\d+",res));
}
return (ifelse(is.null(res),NA,res))
}
getDays <- function(str) {
res <- regmatches(str, regexpr("\\d+\\sdays",str));
if (length(res)>0) {
res <- regmatches(res, regexpr("\\d+",res));
}
return (ifelse(is.null(res),NA,res))
}
d<-tibble::as_tibble( list(text = c("Person 1: $1000 fine, 31 months jail",
"Person 2: $500 fine, 45 days jail")))
d %>% dplyr::mutate( days = sapply(text,getDays), months = sapply(text,getMonths))
## A tibble: 2 x 3
## text days months
## <chr> <chr> <chr>
## 1 Person 1: $1000 fine, 31 months jail NA 31
## 2 Person 2: $500 fine, 45 days jail 45 NA
来源:https://stackoverflow.com/questions/56120014/extracting-numbers-based-on-the-following-term-in-a-string