Insert bytes into middle of a file (in windows filesystem) without reading entire file (using File Allocation Table)?

前端 未结 8 1109
盖世英雄少女心
盖世英雄少女心 2020-12-13 19:04

I need a way to insert some file clusters into the middle of a file to insert some data.

Normally, I would just read the entire file and write it back out again with

相关标签:
8条回答
  • 2020-12-13 19:20

    You don't need to (and probably can't) modify the file access table. You can achieve the same using a filter-driver or a stackable FS. Let us consider a cluster size of 4K. I am merely writing out the design for reasons I explain at the end.

    1. Creation of a new file will a layout-map of the file in a header. The header will mention the number of entries and a list of entries. The size of the header will be the same as the size of the cluster. For simplicity let the header be of fixed size with 4K entries. For example suppose there was a file of say 20KB the header may mention: [DWORD:5][DWORD:1][DWORD:2][DWORD:3][DWORD:4][DWORD:5]. This file currently has had no insertions.

    2. Suppose someone inserts a cluster after sector 3. You can add it to the end of the file and change the layout-map to: [5][1][2][3][5][6][4]

    3. Suppose someone needs to seek to cluster 4. You will need to access the layout-map and calculate the offset and then seek to it. It will be after the first 5 clusters so will start at 16K.

    4. Suppose someone reads or writes serially to the file. The reads and writes will have to map the same way.

    5. Suppose the header has only one more entry left: we will need to extend it by having a pointer to a new cluster at the end of the file using the same format as the other pointers above. To know that we have more than one cluster all we need to do is to look at the number of items and calculate the number of clusters that are needed to store it.

    You can implement all of the above using a filter driver on Windows or a stackable file-system (LKM) on Linux. Implementing the basic level of functionality is on the level of a grad-school mini project in difficulty. Getting this to work as a commercial filesystem can be quite challenging especially since you don't want to affect IO speeds.

    Note that the above filter will not be affected by any change in disk layout / defragmentation etc. You can also defragment your own file if you think it will be helpful.

    0 讨论(0)
  • 2020-12-13 19:27

    Edited - another approach - how about switching to Mac for this task? They have superior editing capabilities, with automation capabilities!

    Edited - the original specs suggested the file was being modified a lot, instead it is modified once. Suggest as others have pointed out to do the operation in the background: copy to new file, delete old file, rename new file to old file.

    I would abandon this approach. A database is what you're looking for./YR

    0 讨论(0)
  • 2020-12-13 19:28

    It all really depends on what the original problem is, that is what you're trying to achieve. Modification of a FAT / NTFS table is not the problem, it's a solution to your problem -- potentially elegant and efficient, but more likely highly dangerous and inappropriate. You mentioned that you have no control over the users' systems where it will be used, so presumably for at least some of them the administrator would object against hacking into the file system internals.

    Anyways, let's get back to the problem. Given the incomplete information, several use cases may be imagined, and the solution will be either easy or difficult depending on the use case.

    1. If you know that after the edit the file won't be needed for some time, then saving the edit in half a second is easy -- just close the window and let the application finish saving in the background, even if it takes half an hour. I know this sounds dumb, but this is a frequent use case -- once you finish editing your file, you save it, close the program, and you don't need that file anymore for a long time.

    2. Unless you do. Maybe the user decides to edit some more, or maybe another user comes along. In both cases your application can easily detect that the file is in the process of being saved to hard disk (for example you may have around a hidden guard file while the main file is being saved). In this case you would open a file as-is (partially saved), but present to the user the customized view of the file which makes it appear as if the file is in the final state. After all, you have all the information about which chunks of file have to be moved where.

    3. Unless the user needs to open the file immediately in another editor (this is not a very common case, especially for a very specialized file format, but then who knows). If so, do you have access to the source code of that other editor? Or can you talk to the developers of that other editor and persuade them to treat the incompletely saved file as if it was in the final state (it's not that hard -- all it takes is to read the offset information from the guard file). I would imagine the developers of that other editor are equally frustrated with long save times and would gladly embrace your solution as it would help their product.

    4. What else could we have? Maybe the user wants to immediately copy or move the file somewhere else. Microsoft probably won't change Windows Explorer for your benefit. In that case you would either need to implement the UMDF driver, or plainly forbid the user to do so (for example rename the original file and hide it, leaving a blank placeholder in its place; when the user tries to copy the file at least he'll know something went wrong).

    5. Another possibility, which doesn't fit in the above hierarchy 1-4 nicely, comes up if you know beforehand which files will be edited. In that case you can "pre-sparse" the file inserting random gaps uniformly along the volume of the file. This is due to the special nature of your file format that you mentioned: there could be gaps of no data, provided that the links correctly point to following next data chunks. If you know which files will be edited (not unreasonable assumption -- how many 10Gb files lie around your hard drive?) you "inflate" the file before the user starts editing it (say, the night before), and then just move around these smaller chunks of data when you need to insert new data. This of course also relies on the assumption that you don't have to insert TOO much.

    In any case, there's always more than one answer depending on what your users actually want. But my advice comes from a designer's perspective, not from programmer's.

    0 讨论(0)
  • 2020-12-13 19:31

    Abstract question, abstract answer:

    It is certainly possible to do this in FAT and probably in most other FS, you would essentially be fragmenting the file, rather than the more common process of defragmenting.

    FAT is organized with around cluster pointers which produce a chain of cluster numbers where data is stored, the first link index is stored with the file record, the second one is stored in the allocation table at index [the first link's number] etc. It's possible to insert another link anywhere in the chain, for as long as the data you're inserting ends at the boundary of a cluster.

    Chances are you'll have much easier time doing this in C by finding an open source library. While it's probably possible to do that in C# with PInvoke you won't find any good sample code floating around for you to get started.

    I suspect you don't have any control over the file format (video files?), if you do it would be much easier to design your data storage to avoid the problem in the first place.

    0 讨论(0)
  • 2020-12-13 19:34

    Robert, I don't think that what you want to achieve is really possible to do without actively manipulating file system data structures for a file system which, from the sounds of it, is mounted. I don't think I have to tell you how dangerous and unwise this sort of exercise it.

    But if you need to do it, I guess I can give you a "sketch on the back of a napkin" to get you started:

    You could leverage the "sparse file" support of NTFS to simply add "gaps" by tweaking the LCN/VCN mappings. Once you do, just open the file, seek to the new location and write your data. NTFS will transparently allocate the space and write the data in the middle of the file, where you created a hole.

    For more, look at this page about defragmentation support in NTFS for hints on how you can manipulate things a bit and allow you to insert clusters in the middle of the file. At least by using the sanctioned API for this sort of thing, you are unlikely to corrupt the filesystem beyond repair, although you can still horribly hose your file, I guess.

    Get the retrieval pointers for the file that you want, split them where you need, to add as much extra space as you need, and move the file. There's an interesting chapter on this sort of thing in the Russinovich/Ionescu "Windows Internals" book (http://www.amazon.com/Windows%C2%AE-Internals-Including-Windows-Developer/dp/0735625301)

    0 讨论(0)
  • 2020-12-13 19:43

    Do you understand that it's nearly 99.99% impossible insert non-aligned data in non-aligned places? (Maybe some hack based on compression can be used.) I think that you do.

    The "easiest" solution is to create the sparse run records and then write over the sparse ranges.

    1. Do something with the NTFS cache. It's best to perform the operations on the offline/unmounted drive.
    2. Get the file record (@JerKimball's answer sounds helpful, but stops short of it). There may be problems if the file is overflown with attributes and they are stored away.
    3. Get to the file's data run list. The data run concept and format is described here (http://inform.pucp.edu.pe/~inf232/Ntfs/ntfs_doc_v0.5/concepts/data_runs.html) and some other NTFS format data can be seen on the adjacent pages.
    4. Iterate through data runs, accumulating the file length, to find the correct insertion spot.
    5. You'll most probably find that your insertion point is in the middle of the run. You'll need to split the run which is not hard. (Just store away the two resulting runs for now.)
    6. Creating a sparse run record is very easy. It's just the run length (in clusters) prepended by the byte, which contains the byte size of the length in it's lower 4 bits (the higher 4 bits should be zero to indicate a spare run).
    7. Now you need to calculate how many additional bytes you have to insert in the data runs list, somehow make way for them and do the insertion/replacement.
    8. Then you need to fix the file size attribute to make it consistent with the runs.
    9. Finally you can mount the drive and write the inserted information over the spare spots.
    0 讨论(0)
提交回复
热议问题