Replacing Emoji Unicode Range from Arabic Tweets using Java

前端未结

关注

 2  919

萌比男神i 2021-01-01 05:10

I am trying to replace emoji from Arabic tweets using java.

I used this code:

String line = \"اييه تقولي اجل الارسنال تعادل امس بعد ما كان فايز


      
      
        
          2条回答        

        
                    
            
            
                         
                
              
              
                
                   天涯浪人
                                             
                
                
                (楼主)
            
              
              
                2021-01-01 05:59
              

            
            
                        
Java 5 and 6

If you are stuck running your program on Java 5 or 6 JVM, and you want to match characters in the range from U+1F601 to U+1F64F, use surrogate pairs in the character class:

Pattern emoticons = Pattern.compile("[\uD83D\uDE01-\uD83D\uDE4F]");


This method is valid even in Java 7 and above, since in Sun/Oracle's implementation, if you decompile Pattern.compile() method, the String containing the pattern is converted into an array of code points before compilation.

Java 7 and above


You can use the construct \x{...} in David Wallace's answer, which is available from Java 7.
Or alternatively, you can also specify the whole Emoticons Unicode block, which spans from code point U+1F600 (instead of U+1F601) to U+1F64F.

Pattern emoticons = Pattern.compile("\\p{InEmoticons}");


Since Emoticons block support is added in Java 7, this method is also only valid from Java 7.
Although the other methods are preferred, you can specify supplemental characters by specifying the escape in the regex. While there is no reason to do this in the source code, this change in Java 7 corrects the behavior in applications where regex is used for searching, and directly pasting the character is not possible.

Pattern emoticons = Pattern.compile("[\\uD83D\\uDE01-\\uD83D\\uDE4F]");


/!\ Warning

Never ever mix the syntax together when you specify a supplemental code point, like:


"[\\uD83D\uDE01-\\uD83D\\uDE4F]"
"[\uD83D\\uDE01-\\uD83D\\uDE4F]"


Those will specify to match the code point U+D83D and the range from code point U+DE01 to code point U+1F64F in Oracle's implementation.


Note

In Java 5 and 6, Oracle's implementation, the implementation of Pattern.u() doesn't collapse valid regex-escaped surrogate pairs "\\uD83D\\uDE01". As the result, the pattern is interpreted as 2 lone surrogates, which will fail to match anything.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它2个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复