dirent not working with unicode

微笑、不失礼 提交于 2021-01-29 16:26:36

问题


i try to count files in folder, but readdir function skip on files that contains unicode characters. I am using dirent, in c.

int filecount(char* path)
{
    int file_Count=0;
    DIR* dirp;
    struct dirent * entry;
    dirp = opendir(path);
    while((entry=readdir(dirp)) !=NULL)
    {
        if(entry->d_type==DT_REG)
        {
            ++file_Count;
        }
    }
    closedir(dirp);
    return file_Count;
}

回答1:


Testing on Mac OS X 10.9.1 Mavericks, I adapted your code into the following complete program:

#include <dirent.h>
#include <stdio.h>

static
int filecount(char *path)
{
    int file_Count = 0;
    DIR *dirp;
    struct dirent *entry;
    dirp = opendir(path);
    while ((entry = readdir(dirp)) != NULL)
    {
        printf("Found (%llu)(%d): %s\n", entry->d_ino, entry->d_type, entry->d_name);
        if (entry->d_type == DT_REG)
        {
            ++file_Count;
        }
    }
    closedir(dirp);
    return file_Count;
}

static void proc_dir(char *dir)
{
    printf("Processing %s:\n", dir);
    printf("File count = %d\n", filecount(dir));
}

int main(int argc, char **argv)
{
    if (argc > 1)
    {
        for (int i = 1; i < argc; i++)
            proc_dir(argv[i]);
    }
    else
        proc_dir(".");
    return 0;
}

Notably, it lists each entry as it is returned — inode, type and name. On Mac OS X, I got told that the inode type was __uint64_t aka unsigned long long, hence the use of %llu for the format; YMMV on that.

I also created a folder utf8 and in the folder created files:

total 32
-rw-r--r--  1 jleffler  eng  6 Jan  7 12:14 ÿ-y-umlaut
-rw-r--r--  1 jleffler  eng  6 Jan  7 12:15 £
-rw-r--r--  1 jleffler  eng  6 Jan  7 12:14 €
-rw-r--r--  1 jleffler  eng  6 Jan  7 12:15 ™

Each file contained Hello plus a newline. When I run the command (I called it fc), it gives:

$ ./fc utf8
Processing utf8:
Found (8138036)(4): .
Found (377579)(4): ..
Found (8138046)(8): ÿ-y-umlaut
Found (8138067)(8): £
Found (8138054)(8): €
Found (8138078)(8): ™
File count = 4
$

The Euro symbol € is U+20AC EURO SIGN, which is way outside the range of ordinary single-byte code sets. The pound symbol £ is U+00A3 POUND SIGN, so that's in the range of the Latin 1 alphabet (ISO 8859-1, 8859-15). The trademark symbol ™ is U+2122 TRADE MARK SIGN, also outside the range of ordinary single-byte code sets.

This shows that on at least some platforms, readdir() works perfectly well with UTF-8 encoded file names using Unicode characters that are not in the Latin1 character set. It also demonstrates how I'd go about debugging the problem — and/or illustrates what I'd like you to run (the program above) and the sort of directory you should run it on to make your case that readdir() on your platform does not like Unicode file names.




回答2:


Try to change

if(entry->d_type==DT_REG)

to

if((entry->d_type==DT_REG || entry->d_type==DT_UNKNOWN) 
    && strcmp(entry->d_name,".")==0 && strcmp(entry->d_name,"..")==0)

which should enable you to count these files by further counting files of unknown types.

Note that, strcmp(entry->d_name,".")==0 and strcmp(entry->d_name,"..")==0 are used to exclude sub-directories.



来源:https://stackoverflow.com/questions/20979727/dirent-not-working-with-unicode

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!