How to detect the right encoding for read.csv?

后端 未结 6 1821
遥遥无期
遥遥无期 2020-11-27 11:02

I have this file (http://b7hq6v.alterupload.com/en/) that I want to read in R with read.csv. But I am not able to detect the correct encoding. It seems to be a

6条回答
  •  南方客
    南方客 (楼主)
    2020-11-27 11:41

    First, you have to figure out what is the encoding of the file, what cannot be done in R (at least as I know). You can use external tools for it e.g. from Perl, python or eg. the file utility under Linux/UNIX.

    As @ssmit suggested, you have an UTF-16LE (Unicode) encoding here, so load the file with this encoding and use readLines to see what you have in the first (e.g.) 10 lines:

    > f <- file('encoding.asc', open="r", encoding="UTF-16LE")   # UTF-16LE, which is "called" Unicode in Windows
    > readLines(f,10)
     [1] "\tFe 2\tZn\tO\tC\tSi\tMn\tP\tS\tAl\tN\tCr\tNi\tMo\tCu\tV\tNb 2\tTi\tB\tZr\tCa\tH\tCo\tMg\tPb 2\tW\tCl\tNa 3\tAr"                                                                                                                          
     [2] ""                                                                                                                                                                                                                                         
     [3] "0\t0,003128\t3,82E-05\t0,0004196\t0\t0,001869\t0,005836\t0,004463\t0,002861\t0,02148\t0\t0,004768\t0,0003052\t0\t0,0037\t0,0391\t0,06409\t0,1157\t0,004654\t0\t0\t0\t0,00824\t7,63E-05\t0,003891\t0,004501\t0\t0,001335\t0,01175"         
     [4] "0,0005\t0,003265\t3,05E-05\t0,0003662\t0\t0,001709\t0,005798\t0,004395\t0,002808\t0,02155\t0\t0,004578\t0,0002441\t0\t0,003601\t0,03897\t0,06406\t0,1158\t0,0047\t0\t0\t0\t0,008026\t6,10E-05\t0,003876\t0,004425\t0\t0,001343\t0,01157"  
     [5] "0,001\t0,003332\t2,54E-05\t0,0003052\t0\t0,001704\t0,005671\t0,0044\t0,002823\t0,02164\t0\t0,004603\t0,0003306\t0\t0,003611\t0,03886\t0,06406\t0,1159\t0,004705\t0\t0\t0\t0,008036\t5,09E-05\t0,003815\t0,004501\t0\t0,001246\t0,01155"   
     [6] "0,0015\t0,003313\t2,18E-05\t0,0002616\t0\t0,001678\t0,005689\t0,004447\t0,002921\t0,02171\t0\t0,004621\t0,0003488\t0\t0,003597\t0,03889\t0,06404\t0,1158\t0,004752\t0\t0\t0\t0,008022\t4,36E-05\t0,003815\t0,004578\t0\t0,001264\t0,01144"
     [7] "0,002\t0,003313\t2,18E-05\t0,0002834\t0\t0,001591\t0,005646\t0,00436\t0,003008\t0,0218\t0\t0,004643\t0,0003488\t0\t0,003619\t0,03895\t0,06383\t0,1159\t0,004752\t0\t0\t0\t0,008\t4,36E-05\t0,003771\t0,004643\t0\t0,001351\t0,01142"      
     [8] "0,0025\t0,003488\t2,18E-05\t0,000218\t0\t0,001657\t0,00558\t0,004338\t0,002986\t0,02175\t0\t0,004469\t0,0002616\t0\t0,00351\t0,03889\t0,06374\t0,1159\t0,004621\t0\t0\t0\t0,008131\t4,36E-05\t0,003771\t0,004708\t0\t0,001243\t0,01125"   
     [9] "0,003\t0,003619\t0\t0,0001526\t0\t0,001591\t0,005668\t0,004207\t0,00303\t0,02169\t0\t0,00449\t0,0002834\t0\t0,00351\t0,03874\t0,06383\t0,116\t0,004665\t0\t0\t0\t0,007956\t0\t0,003749\t0,004796\t0\t0,001286\t0,01125"                   
    [10] "0,0035\t0,003422\t0\t4,36E-05\t0\t0,001482\t0,005711\t0,004185\t0,003292\t0,02156\t0\t0,004665\t0,0003488\t0\t0,003553\t0,03852\t0,06391\t0,1158\t0,004708\t0\t0\t0\t0,007717\t0\t0,003597\t0,004905\t0\t0,00133\t0,01136"                   
    

    From this, it can be seen, that we have a header, and a blank line in the second row (which will be skipped by default using the read.table function), the separator is \t and the decimal character is ,.

    > f <- file('encoding.asc', open="r", encoding="UTF-16LE")
    > df <- read.table(f, sep='\t', dec=',', header=TRUE)
    

    And see what we have:

    > head(df)
           X     Fe.2       Zn         O C       Si       Mn        P        S
    1 0.0000 0.003128 3.82e-05 0.0004196 0 0.001869 0.005836 0.004463 0.002861
    2 0.0005 0.003265 3.05e-05 0.0003662 0 0.001709 0.005798 0.004395 0.002808
    3 0.0010 0.003332 2.54e-05 0.0003052 0 0.001704 0.005671 0.004400 0.002823
    4 0.0015 0.003313 2.18e-05 0.0002616 0 0.001678 0.005689 0.004447 0.002921
    5 0.0020 0.003313 2.18e-05 0.0002834 0 0.001591 0.005646 0.004360 0.003008
    6 0.0025 0.003488 2.18e-05 0.0002180 0 0.001657 0.005580 0.004338 0.002986
           Al N       Cr        Ni Mo       Cu       V    Nb.2     Ti        B Zr
    1 0.02148 0 0.004768 0.0003052  0 0.003700 0.03910 0.06409 0.1157 0.004654  0
    2 0.02155 0 0.004578 0.0002441  0 0.003601 0.03897 0.06406 0.1158 0.004700  0
    3 0.02164 0 0.004603 0.0003306  0 0.003611 0.03886 0.06406 0.1159 0.004705  0
    4 0.02171 0 0.004621 0.0003488  0 0.003597 0.03889 0.06404 0.1158 0.004752  0
    5 0.02180 0 0.004643 0.0003488  0 0.003619 0.03895 0.06383 0.1159 0.004752  0
    6 0.02175 0 0.004469 0.0002616  0 0.003510 0.03889 0.06374 0.1159 0.004621  0
      Ca H       Co       Mg     Pb.2        W Cl     Na.3      Ar
    1  0 0 0.008240 7.63e-05 0.003891 0.004501  0 0.001335 0.01175
    2  0 0 0.008026 6.10e-05 0.003876 0.004425  0 0.001343 0.01157
    3  0 0 0.008036 5.09e-05 0.003815 0.004501  0 0.001246 0.01155
    4  0 0 0.008022 4.36e-05 0.003815 0.004578  0 0.001264 0.01144
    5  0 0 0.008000 4.36e-05 0.003771 0.004643  0 0.001351 0.01142
    6  0 0 0.008131 4.36e-05 0.003771 0.004708  0 0.001243 0.01125
    

提交回复
热议问题