Remove ✅,

前端未结

关注

 7  1797

离开以前

I have some strings with all kinds of different emojis/images/signs in them.

Not all the strings are in English -- some of them are in other non-Latin languages, for

相关标签:

7条回答

醉酒成梦

2020-11-28 20:40

I gave some examples below, and thought that Latin is enough, but...

Is there a way to remove all these signs from the input string and keeping only the letters & punctuation in the different languages?

After editing, developed a new solution, using the Character.getType method, and that appears to be the best shot at this.

package zmarcos.emoji;

import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;

public class TestEmoji {

    public static void main(String[] args) {
        String[] arr = {"Remove ✅,


          	          
            
           
            
                              
                
              
              
                
                  终归单人心        
                
              
                            
                2020-11-28 20:41
              
            
            
                                                                       
Use a jQuery plugin called RM-Emoji. Here's how it works:

$('#text').remove('emoji').fast()


This is the fast mode that may miss some emojis as it uses heuristic algorithms for finding emojis in text. Use the .full() method to scan entire string and remove all emojis guaranteed.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  暗喜        
                
              
                            
                2020-11-28 20:43
              
            
            
                                                                       
Instead of blacklisting some elements, how about creating a whitelist of the characters you do wish to keep? This way you don't need to worry about every new emoji being added.

String characterFilter = "[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]";
String emotionless = aString.replaceAll(characterFilter,"");


So:


[\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s] is a range representing all numeric (\\p{N}), letter (\\p{L}), mark (\\p{M}), punctuation (\\p{P}), whitespace/separator (\\p{Z}), other formatting (\\p{Cf}) and other characters above U+FFFF in Unicode (\\p{Cs}), and newline (\\s) characters. \\p{L} specifically includes the characters from other alphabets such as Cyrillic, Latin, Kanji, etc. 
The ^ in the regex character set negates the match.


Example:

String str = "hello world _# 皆さん、こんにちは！　私はジョンと申します。                                                                    
                                                        
            

            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          

          	          
            
           
            
                              
                
              
              
                
                  野的像风        
                
              
                            
                2020-11-28 20:46
              
            
            
                                                                       
Based on Full Emoji List, v11.0 you have 1644 different Unicode code points to remove. For example ✅ is on this list as U+2705. 

Having the full list of emojis you need to filter them out using code points. Iterating over single char or byte won't work as single code point can span multiple bytes. Because Java uses UTF-16 emojis will usually take two chars.

String input = "ab✅cd";
for (int i = 0; i < input.length();) {
  int cp = input.codePointAt(i);
  // filter out if matches
  i += Character.charCount(cp); 
}


Mapping from Unicode code point U+2705 to Java int is straightforward:

int viSign = 0x2705;


or since Java supports Unicode Strings:

int viSign = "✅".codePointAt(0);

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  再見小時候        
                
              
                            
                2020-11-28 20:47
              
            
            
                                                                       
ICU4J is your friend.

UCharacter.hasBinaryProperty(UProperty.EMOJI);


Remember to keep your version of icu4j up to date and note this will only filter out official Unicode emoji, not symbol characters. Combine with filtering out other character types as desired.

More information:
http://icu-project.org/apiref/icu4j/com/ibm/icu/lang/UProperty.html#EMOJI
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  无人共我        
                
              
                            
                2020-11-28 20:57
              
            
            
                                                                       
Try this project simple-emoji-4j

Compatible with Emoji 12.0 (2018.10.15)

Simple with:

EmojiUtils.removeEmoji(str)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
下一页