Indexing columns in a csv file

问题

I have a large csv file which each row has different columns, such as ID, username, email, job position, etc.

I want to search for a row by exact matches (username == David), or wildcard (jobPosition == %admin).

I want to index columns in this file to make searches faster, but I don't know which algorithm should I choose (specially for wildcards).

回答1:

You can index the file. But you need to read it as a binary file instead of a text file. Use 128 or 256 block size. To build the index, you scan your file looking for the beginning of each record and then create an index file like this:

  key, 0, 0
   ........
   ........
  key, block, offset

key is the key you are indexing on. Can be a composite key. block is the block number the record starts at (be aware that your records can span more than one block), and offset is a number between 0 and 127 which is the offset into that block, assuming a 128 bytes block size. To retrieve your record you look up the key on the index file (using binary search of course) and then use the block-offset to access your record directly.

You can also create multiple index files at the same time if you need to search for different criteria.

Having a distinct end-of-line character would help but CR-LF would do. If you use CR-LF be aware that the CRcan be at the exact end of the block while LF will be at the very beginning of the next. Once you have created this index file (or files) you can sort it by the key and you are good to go.

Alternatively, if your software allows fast memory block moving (like C++ memmove), you can use insertion sort in combination with binary search. That way, after you finish building your index(es) they are already sorted. This is particularly efficient if the index entries are being added from a file that is being captured using a slow input device (ej. keyboard). If you are managing large amounts of records consider using a B-Tree structure for your index(es).

This schema, allows your csv database to accept record additions, deletions and updates. Additions are made at the end of the file. To delete a record, just change the first character of the record with a unique character like 0x0 and of course delete the entry from the index file. Updates can be achieved by deleting and then adding the updated record at the end of the file.

This will create some need for garbage collection on your database but most DBMS, if not all, do so. Periodically rebuild your index and get rid of the deleted records.

It is not that complicated, is it? Agreed, you may not succeed at first try. But who does? Programming is not for the faint of heart.

Hope this help.

回答2:

Short version. Load the CSV into SQLite, and then query that. You can learn about SQLite at https://www.sqlite.org/, but I would suggest looking for a library in your language that already has it.

Long version.

Before you get done figuring out how to write your code, you can load the data into SQLite, index it, query it, and be done. This is even true if you do not currently know how to write SQL. (Trust me, I know the algorithms you need, and learning them is harder than learning SQL.)

Before you're done actually writing the code your alternate self will have done several other projects.

After you write the code, then you get to debug it. I guarantee you won't successfully debug it. Meanwhile in the alternate universe you've continued building more projects.

Once you've debugged your code and put it into production (with unknown bugs still there), you have the win of skipping the initial loading step. Meanwhile your alternate universe self doesn't even have to think about the fact that SQLite was implemented in very efficient C, with an optimizer that may not match a "real" database, but is better than anything you can roll on your own.

Given this, you really should consider using SQLite.

PS: https://www.sqlite.org/fts3.html explains how to do the wildcard match in SQLite.

来源：https://stackoverflow.com/questions/36089456/indexing-columns-in-a-csv-file

标签

algorithm

performance

csv

indexing