Hashtag Extract function in R Programming

前端未结

关注

 3  2011

I am trying to create an hashtag extraction function in R. This function will extract a hashtags from a post, if there are any, else will give a blank. My function is like <

相关标签:

3条回答

陌清茗

2020-12-12 08:29
Thanks everyone for all the help, I got it worked somehow, thought it is almost similar as Shalini's answer 1.replacing all NAs on message
```
message[is.na(message)]='abc'
```
2.function for extracting the Hashtags
```
hashtag_extrac= function(text){
match = str_extract_all(text,"#\\S+")
if (match!= "") { 
match
} else {
'' }}
```
applying function on whole column
```
hashtags= sapply(message, hashtag_extrac)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
南旧

2020-12-12 08:33
@manu sharma I would say you need not apply if else inside. Let the non-matching rows take values as 'NA'. And after applying the function you change it to blank. Hope my code helps you:
```
   aaa <- readLines("C:\\MY_FOLDER\\NOI\\file2sample.txt")
 ttt <- function(x){

  r <- sapply(x, function(x) { matches <- str_match(x,"#\\w+\\s+")})
  r


  }

 y <-ttt(aaa)
 y[is.na(y)]=''
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

走了就别回头了

2020-12-12 08:45

Hashtag regexes aren't that simple
I'm not sure you understand the commonly accepted "rules" for hashtags
I do not believe str_extract_all() is returning what you think it is
Just use stringi which stringr functions are built on top of
Folks rly need to stop analyzing tweets

This should handle most, if not all, cases:

get_tags <- function(x) {
  # via http://stackoverflow.com/a/5768660/1457051
  twitter_hashtag_regex <- "(^|[^&\\p{L}\\p{M}\\p{Nd}_\u200c\u200d\ua67e\u05be\u05f3\u05f4\u309b\u309c\u30a0\u30fb\u3003\u0f0b\u0f0c\u00b7])(#|\uFF03)(?!\uFE0F|\u20E3)([\\p{L}\\p{M}\\p{Nd}_\u200c\u200d\ua67e\u05be\u05f3\u05f4\u309b\u309c\u30a0\u30fb\u3003\u0f0b\u0f0c\u00b7]*[\\p{L}\\p{M}][\\p{L}\\p{M}\\p{Nd}_\u200c\u200d\ua67e\u05be\u05f3\u05f4\u309b\u309c\u30a0\u30fb\u3003\u0f0b\u0f0c\u00b7]*)"
  stringi::stri_match_all_regex(x, hashtag_regex) %>% 
    purrr::map(~.[,4]) %>% 
    purrr::flatten_chr()

}

tests <- c("#teste_teste      //underscore accepted",
           "#teste-teste      //Hyphen not accepted",
           "#leof_gfg.sdfsd   //dot not accepted",
           "#f34234@45#6fgh6  // @ not accepted",
           "#leo#leo2#asd     //followed hastag without space ",
           "#6663             // only number accepted",
           "_#asd_            // hashtag can't start or finish with underscore",
           "-#sdfsdf-         // hashtag can't start or finish with hyphen",
           ".#sdfsdf.         // hashtag can't start or finish with dot",
           "#leo_leo__leo__leo____leo // decline followed underline")


get_tags(tests)
##  [1] "teste_teste"              "teste"                   
##  [3] "leof_gfg"                 "f34234"                  
##  [5] "leo"                      NA                        
##  [7] NA                         "sdfsdf"                  
##  [9] "sdfsdf"                   "leo_leo__leo__leo____leo"

your_string <- "#letsdoit #Tonewbeginnign world is on a new#route"

get_tags(your_string)
## [1] "letsdoit"       "Tonewbeginnign"

You'll need to tweak the function if you need each set of hashtags to be grouped with each input vector but you really didn't provide much detail on what you're really trying to accomplish.

0 讨论(0)

Hashtag Extract function in R Programming

applying function on whole column