Extracting Numbers Based On the Following Term in a String

问题

I have a batch of data that includes a text variable full of free-form text. I am trying to extract certain information based on context within the string into new variables which I can then analyze.

I have been digging into qdap and tm. I have uniformed the format with tolower and replace_abbreviation but cannot seem to figure out how to actually extract the information I need.

So for example,

library(data.table)
data<-data.table(text=c("Person 1: $1000 fine, 31 months jail", 
                     "Person 2: $500 fine, 45 days jail"))


                                   text
1: Person 1: $1000 fine, 31 months jail
2:    Person 2: $500 fine, 45 days jail

What I would like to do is to extract numbers based on whatever the following term is to create two additional variable, months and days, which has the corresponding values:

data<-data.table(text=c("Person 1: $1000 fine, 31 months jail", 
                        "Person 2: $500 fine, 45 days jail"), 
                 months=c("31",""), 
                 days=c("","45")


                                   text months days
1: Person 1: $1000 fine, 31 months jail     31     
2:    Person 2: $500 fine, 45 days jail          45

I have scoured Stack Overflow and have not found any answers to this so hopefully I didn't just miss one. But any help anyone could offer will be very much appreciated. Still pretty new at text analysis.

Thank you for your time!

回答1:

Using stringr::str_extract() with positive lookahead you can do something like this:

data <- dplyr::mutate(data,
                      months = stringr::str_extract(text, "\\d+(?=\\smonths)"),
                      days = stringr::str_extract(text, "\\d+(?=\\sdays)"))

##                                   text months days
## 1 Person 1: $1000 fine, 31 months jail     31 <NA>
## 2    Person 2: $500 fine, 45 days jail   <NA>   45

The above regex makes some assumptions about the text string, namely it has one and only one space between the number and the unit, and also that the units are always plural. Something more flexible would be:

data<-data.table(text=c("Person 1: $1000 fine, 31 months jail", 
                        "Person 2: $500 fine, 45 days jail",
                        "Person 3: $1000 fine, 1     month 1 day jail"))

data <- dplyr::mutate(data,
                      months = stringr::str_extract(text, "\\d+(?=\\s*months*)"),
                      days = stringr::str_extract(text, "\\d+(?=\\s*days*)"))

##                                           text months days
## 1         Person 1: $1000 fine, 31 months jail     31 <NA>
## 2            Person 2: $500 fine, 45 days jail   <NA>   45
## 3 Person 3: $1000 fine, 1     month 1 day jail      1    1

回答2:

getMonths <- function(str) {
  res <- regmatches(str, regexpr("\\d+\\smonths",str));
  if (length(res)>0) {
    res <- regmatches(res, regexpr("\\d+",res));
  }
  return (ifelse(is.null(res),NA,res))
}

getDays <- function(str) {
  res <- regmatches(str, regexpr("\\d+\\sdays",str));
  if (length(res)>0) {
    res <- regmatches(res, regexpr("\\d+",res));
  }
  return (ifelse(is.null(res),NA,res))
}

d<-tibble::as_tibble( list(text = c("Person 1: $1000 fine, 31 months jail", 
                        "Person 2: $500 fine, 45 days jail")))


d %>% dplyr::mutate( days = sapply(text,getDays), months = sapply(text,getMonths)) 

##  A tibble: 2 x 3
##  text                                   days  months
##  <chr>                                  <chr> <chr> 
##  1 Person 1: $1000 fine, 31 months jail NA    31    
##  2 Person 2: $500 fine, 45 days jail    45    NA

来源：https://stackoverflow.com/questions/56120014/extracting-numbers-based-on-the-following-term-in-a-string

标签

text

nlp