R tm package invalid input in 'utf8towcs'

前端未结

关注

 14  1381

逝去的感伤 2020-11-29 01:47

I\'m trying to use the tm package in R to perform some text analysis. I tied the following:

require(tm)
dataSet <- Corpus(DirSource(\'tmp/\'))
dataSet <


      
      
        
          14条回答        

        
                    
            
            
                         
                
              
              
                
                   [愿得一人]
                                             
                
                
                (楼主)
            
              
              
                2020-11-29 01:59
              

            
            
                        
I have just run afoul of this problem. By chance are you using a machine running OSX? I am and seem to have traced the problem to the definition of the character set that R is compiled against on this operating system (see https://stat.ethz.ch/pipermail/r-sig-mac/2012-July/009374.html)

What I was seeing is that using the solution from the FAQ

tm_map(yourCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))


was giving me this warning:

Warning message:
it is not known that wchar_t is Unicode on this platform 


This I traced to the enc2utf8 function. Bad news is that this is a problem with my underlying OS and not R. 

So here is what I did as a work around:

tm_map(yourCorpus, function(x) iconv(x, to='UTF-8-MAC', sub='byte'))


This forces iconv to use the utf8 encoding on the macintosh and works fine without the need to recompile.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它14个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复