How to use os.walk to only list text files

≡放荡痞女 提交于 2019-12-06 15:37:15

You can use Python's mimetypes library to check whether a file is a plaintext file.

import os
import mimetypes

for dirpath, dirnames, filenames in os.walk('/path/to/directory'):
    for filename in filenames:
        if mimetypes.guess_type(filename)[0] == 'text/plain':
            print(os.path.join(dirpath, filename))

UPDATE: Since the mimetypes library uses file extension to determine the type of file, it is not very reliable, especially since you mentioned that some files are mislabeled or without extensions.

For those cases you can use the magic library (which is not in the standard library unfortunately).

import os
import magic

mime = magic.Magic(mime=True)
for dirpath, dirnames, filenames in os.walk('/path/to/directory'):
    for filename in filenames:
        fullpath = os.path.join(dirpath, filename)
        if mime.from_file(fullpath) == 'text/plain':
            print(fullpath)

UPDATE 2: The above solution wouldn't catch files you would otherwise consider "plaintext" (e.g. XML files, source files, etc). The following solution should work in those cases:

import os
import magic

for dirpath, dirnames, filenames in os.walk('/path/to/directory'):
    for filename in filenames:
        fullpath = os.path.join(dirpath, filename)
        if 'text' in magic.from_file(fullpath):
            print(fullpath)

Let me know if any of these works for you.

A pretty good heuristic is to look for null bytes at the beginning of the file. Text files don't typically have them and binary files usually have lots of them. Below checks that the first 1K bytes contain no nulls. You can of course adjust how much or little of the file to read:

#!python3
import os

def textfiles(root):
    for path,dirs,files in os.walk(root):
        for file in files:
            fullname = os.path.join(path,file)
            with open(fullname,'rb') as f:
                data = f.read(1024)
            if not 0 in data:
                yield fullname

for file in textfiles('.'):
    print(file)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!