Directory file size calculation - how to make it faster?

前端 未结 8 1398
没有蜡笔的小新
没有蜡笔的小新 2020-12-08 11:10

Using C#, I am finding the total size of a directory. The logic is this way : Get the files inside the folder. Sum up the total size. Find if there are sub directories. Then

相关标签:
8条回答
  • 2020-12-08 11:22

    With tens of thousands of files, you're not going to win with a head-on assault. You need to try to be a bit more creative with the solution. With that many files you could probably even find that in the time it takes you calculate the size, the files have changed and your data is already wrong.

    So, you need to move the load to somewhere else. For me, the answer would be to use System.IO.FileSystemWatcher and write some code that monitors the directory and updates an index.

    It should take only a short time to write a Windows Service that can be configured to monitor a set of directories and write the results to a shared output file. You can have the service recalculate the file sizes on startup, but then just monitor for changes whenever a Create/Delete/Changed event is fired by the System.IO.FileSystemWatcher. The benefit of monitoring the directory is that you are only interested in small changes, which means that your figures have a higher chance of being correct (remember all data is stale!)

    Then, the only thing to look out for would be that you would have multiple resources both trying to access the resulting output file. So just make sure that you take that into account.

    0 讨论(0)
  • 2020-12-08 11:25

    I gave up on the .NET implementations (for performance reasons) and used the Native function GetFileAttributesEx(...)

    Try this:

    [StructLayout(LayoutKind.Sequential)]
    public struct WIN32_FILE_ATTRIBUTE_DATA
    {
        public uint fileAttributes;
        public System.Runtime.InteropServices.ComTypes.FILETIME creationTime;
        public System.Runtime.InteropServices.ComTypes.FILETIME lastAccessTime;
        public System.Runtime.InteropServices.ComTypes.FILETIME lastWriteTime;
        public uint fileSizeHigh;
        public uint fileSizeLow;
    }
    
    public enum GET_FILEEX_INFO_LEVELS
    {
        GetFileExInfoStandard,
        GetFileExMaxInfoLevel
    }
    
    public class NativeMethods {
        [DllImport("KERNEL32.dll", CharSet = CharSet.Auto)]
        public static extern bool GetFileAttributesEx(string path, GET_FILEEX_INFO_LEVELS  level, out WIN32_FILE_ATTRIBUTE_DATA data);
    
    }
    

    Now simply do the following:

    WIN32_FILE_ATTRIBUTE_DATA data;
    if(NativeMethods.GetFileAttributesEx("[your path]", GET_FILEEX_INFO_LEVELS.GetFileExInfoStandard, out data)) {
    
         long size = (data.fileSizeHigh << 32) & data.fileSizeLow;
    }
    
    0 讨论(0)
  • 2020-12-08 11:26

    Peformance will suffer using any method when scanning a folder with tens of thousands of files.

    • Using the Windows API FindFirstFile... and FindNextFile... functions provides the fastest access.

    • Due to marshalling overhead, even if you use the Windows API functions, performance will not increase. The framework already wraps these API functions, so there is no sense doing it yourself.

    • How you handle the results for any file access method determines the performance of your application. For instance, even if you use the Windows API functions, updating a list-box is where performance will suffer.

    • You cannot compare the execution speed to Windows Explorer. From my experimentation, I believe Windows Explorer reads directly from the file-allocation-table in many cases.

    • I do know that the fastest access to the file system is the DIR command. You cannot compare performance to this command. It definitely reads directly from the file-allocation-table (propbably using BIOS).

    • Yes, the operating-system caches file access.

    Suggestions

    • I wonder if BackupRead would help in your case?

    • What if you shell out to DIR and capture then parse its output? (You are not really parsing because each DIR line is fixed-width, so it is just a matter of calling substring.)

    • What if you shell out to DIR /B > NULL on a background thread then run your program? While DIR is running, you will benefit from the cached file access.

    0 讨论(0)
  • 2020-12-08 11:28

    If fiddled with it a while, trying to Parallelize it, and surprisingly - it speeded up here on my machine (up to 3 times on a quadcore), don't know if it is valid in all cases, but give it a try...

    .NET4.0 Code (or use 3.5 with TaskParallelLibrary)

        private static long DirSize(string sourceDir, bool recurse)
        {
            long size = 0;
            string[] fileEntries = Directory.GetFiles(sourceDir);
    
            foreach (string fileName in fileEntries)
            {
                Interlocked.Add(ref size, (new FileInfo(fileName)).Length);
            }
    
            if (recurse)
            {
                string[] subdirEntries = Directory.GetDirectories(sourceDir);
    
                Parallel.For<long>(0, subdirEntries.Length, () => 0, (i, loop, subtotal) =>
                {
                    if ((File.GetAttributes(subdirEntries[i]) & FileAttributes.ReparsePoint) != FileAttributes.ReparsePoint)
                    {
                        subtotal += DirSize(subdirEntries[i], true);
                        return subtotal;
                    }
                    return 0;
                },
                    (x) => Interlocked.Add(ref size, x)
                );
            }
            return size;
        }
    
    0 讨论(0)
  • 2020-12-08 11:31

    I don't think it will change a lot, but it might go a little faster if you use the API functions FindFirstFile and NextFile to do it.

    I don't think there's any really quick way of doing it however. For comparison purposes you could try doing dir /a /x /s > dirlist.txt and to list the directory in Windows Explorer to see how fast they are, but I think they will be similar to FindFirstFile.

    PInvoke has a sample of how to use the API.

    0 讨论(0)
  • 2020-12-08 11:33

    Hard disks are an interesting beast - sequential access (reading a big contiguous file for example) is super zippy, figure 80megabytes/sec. however random access is very slow. this is what you're bumping into - recursing into the folders wont read much (in terms of quantity) data, but will require many random reads. The reason you're seeing zippy perf the second go around is because the MFT is still in RAM (you're correct on the caching thought)

    The best mechanism I've seen to achieve this is to scan the MFT yourself. The idea is you read and parse the MFT in one linear pass building the information you need as you go. The end result will be something much closer to 15 seconds on a HD that is very full.

    some good reading: NTFSInfo.exe - http://technet.microsoft.com/en-us/sysinternals/bb897424.aspx Windows Internals - http://www.amazon.com/Windows%C2%AE-Internals-Including-Windows-PRO-Developer/dp/0735625301/ref=sr_1_1?ie=UTF8&s=books&qid=1277085832&sr=8-1

    FWIW: this method is very complicated as there really isn't a great way to do this in Windows (or any OS I'm aware of) - the problem is that the act of figuring out which folders/files are needed requires much head movement on the disk. It'd be very tough for Microsoft to build a general solution to the problem you describe.

    0 讨论(0)
提交回复
热议问题