What is the fastest way to read several lines of data from a large file

社会主义新天地 提交于 2021-02-11 12:45:45

问题


My application needs to read like thousands of lines from a large csv file around 300GB with billion lines, each line contains several numbers. The data are like these:

1, 34, 56, 67, 678, 23462, ...
2, 3, 6, 8, 34, 5
23,547, 648, 34657 ...
...
...

I tried fget reading file line by line in c, but it took really really really long, even with wc -l in linux, just to read all of the line, it took quite a while.

I also tried to write all data to sqlite3 database based on the logics of the application. However, the data structure is different than the csv file above, which now has 100 billion lines, with only two numbers each line. I then created two indices on top of them, which resulted a 2.5TB database, while it was 1 TB without indices before. Since the scale of indices are large than data, query has to read the whole 1.5 TB indices, I think it doesn't make any sense to use database method right?

So I would like to ask, what is the quickest way to read several lines within a large csv file with billion lines in C or python. And by the way, is there any formula or something to calculate the time consume between reading file and capacity of RAM.

environment: linux, RAM 200GB, C, python


回答1:


Requirements

  • huge csv file, several hundred GB in size
  • each line contains several numbers
  • the program must extract several thousand lines per run
  • the program works several times with the same file, only different lines should be extracted

Since lines in the csv files have a variable length, you would have to read the entire file to get the data of the required lines. Sequential reading of the entire file would still be very slow - even if you optimized the file reading as much as possible. A good indicator is actually the runtime of wc -l, as mentioned already by the OP in the question.

Instead, one should optimize on the algorithmic level. A one-time preprocessing of the data is necessary, which then allows fast access to certain lines - without reading the whole file.

There are several possible ways, for example:

  1. Using a database with an index
  2. programmatic creation of an index file (association of line numbers with file offsets)
  3. convert the csv file into a binary file with fixed format

The OP test shows that approach 1) led to 1.5 TB indices. Method 2), to create a small program that connects the line number with a file offset is certainly also a possibility. Finally, approach 3 would allow to calculate the file offset to a line number without the need for a separate index file. This approach is especially useful if the maximum number of numbers per line is known. Otherwise, approach 2 and approach 3 are very similar.

Approach 3 is explained in more detail below. There may be additional requirements that require the approach to be slightly modified, but the following should get things started.

A one-time pre-processing is necessary. The textual csv lines are converted into int arrays and use a fixed record format to store the ints in binary format in a separate file. To then read a particular line n, you can simply calculate the file offset, e.g. with line_nr * (sizeof(int) * MAX_NUMBERS_PER_LINE);. Finally, with fseeko(fp, offset, SEEK_SET); jump to this offset and read MAX_NUMBERS_PER_LINE ints. So you only need to read the data that you actually want to process.

This has not only the advantage that the program runs much faster, it also requires very little main memory.

Test case

A test file with 3,000,000,000 lines was created. Each line contains up to 10 random int numbers, separated by a comma.

In this case this gave a csv file with about 342 GB of data.

A quick test with

time wc -l numbers.csv 

gives

187.14s user 74.55s system 96% cpu 4:31.48 total

This means that it would take a total of at least 4.5 minutes if a sequential file read approach were used.

For one-time preprocessing, a converter program reads each line and stores 10 binary ints per line. The converted file is called 'numbers_bin'. A quick test with access to the data of 10,000 randomly selected rows:

time demo numbers_bin

gives

0.03s user 0.20s system 5% cpu 4.105 total

So instead of 4.5 minutes, it takes 4.1 seconds for this specific example data. That is more than a factor of 65 faster.

Source Code

This approach may sound more complicated than it actually is.

Let's start with the converter program. It reads the csv file and creates a binary fixed format file.

The interesting part takes place in the function pre_process: there a line is read in a loop with 'getline', the numbers are extracted with 'strtok' and 'strtol' and put into an int array initialized with 0. Finally this array is written to the output file with 'fwrite'.

Errors during the conversion result in a message on stderr and the program is terminated.

convert.c

#include "data.h"
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <limits.h>

static void pre_process(FILE *in, FILE *out) {
    int *block = get_buffer();
    char *line = NULL;
    size_t line_capp = 0;

    while (getline(&line, &line_capp, in) > 0) {
        line[strcspn(line, "\n")] = '\0';
        memset(block, 0, sizeof(int) * MAX_ELEMENTS_PER_LINE);

        char *token;
        char *ptr = line;
        int i = 0;
        while ((token = strtok(ptr, ", ")) != NULL) {
            if (i >= MAX_ELEMENTS_PER_LINE) {
                fprintf(stderr, "too many elements in line");
                exit(EXIT_FAILURE);
            }
            char *end_ptr;
            errno = 0;
            long val = strtol(token, &end_ptr, 10);
            if (val > INT_MAX || val < INT_MIN || errno || *end_ptr != '\0' || end_ptr == token) {
                fprintf(stderr, "value error with '%s'\n", token);
                exit(EXIT_FAILURE);
            }
            ptr = NULL;
            block[i] = (int) val;
            i++;
        }
        fwrite(block, sizeof(int), MAX_ELEMENTS_PER_LINE, out);
    }
    free(block);
    free(line);
}


static void one_off_pre_processing(const char *csv_in, const char *bin_out) {
    FILE *in = get_file(csv_in, "rb");
    FILE *out = get_file(bin_out, "wb");
    pre_process(in, out);
    fclose(in);
    fclose(out);
}

int main(int argc, char *argv[]) {
    if (argc != 3) {
        fprintf(stderr, "usage: convert <in> <out>\n");
        exit(EXIT_FAILURE);
    }
    one_off_pre_processing(argv[1], argv[2]);
    return EXIT_SUCCESS;
}

Data.h

A few auxiliary functions are used. They are more or less self-explanatory.

#ifndef DATA_H
#define DATA_H

#include <stdio.h>
#include <stdint.h>

#define NUM_LINES 3000000000LL
#define MAX_ELEMENTS_PER_LINE 10

void read_data(FILE *fp, uint64_t line_nr, int *block);
FILE *get_file(const char *const file_name, char *mode);
int *get_buffer();

#endif //DATA_H

Data.c

#include "data.h"
#include <stdlib.h>

void read_data(FILE *fp, uint64_t line_nr, int *block) {
    off_t offset = line_nr * (sizeof(int) * MAX_ELEMENTS_PER_LINE);
    fseeko(fp, offset, SEEK_SET);
    if(fread(block, sizeof(int), MAX_ELEMENTS_PER_LINE, fp) != MAX_ELEMENTS_PER_LINE) {
        fprintf(stderr, "data read error for line %lld", line_nr);
        exit(EXIT_FAILURE);
    }
}

FILE *get_file(const char *const file_name, char *mode) {
    FILE *fp;
    if ((fp = fopen(file_name, mode)) == NULL) {
        perror(file_name);
        exit(EXIT_FAILURE);
    }
    return fp;
}

int *get_buffer() {
    int *block = malloc(sizeof(int) * MAX_ELEMENTS_PER_LINE);
    if(block == NULL) {
        perror("malloc failed");
        exit(EXIT_FAILURE);
    }
    return block;
}

demo.c

And finally a demo program that reads the data for 10,000 randomly determined lines.

The function request_lines determines 10,000 random lines. The lines are sorted with qsort. The data for these lines is read. Some lines of the code are commented out. If you comment them out, the read data is output to the debug console.

#include "data.h"
#include <stdlib.h>
#include <assert.h>
#include <sys/stat.h>


static int comp(const void *lhs, const void *rhs) {
    uint64_t l = *((uint64_t *) lhs);
    uint64_t r = *((uint64_t *) rhs);
    if (l > r) return 1;
    if (l < r) return -1;
    return 0;
}

static uint64_t *request_lines(uint64_t num_lines, int num_request_lines) {
    assert(num_lines < UINT32_MAX);
    uint64_t *request_lines = malloc(sizeof(*request_lines) * num_request_lines);

    for (int i = 0; i < num_request_lines; i++) {
        request_lines[i] = arc4random_uniform(num_lines);
    }
    qsort(request_lines, num_request_lines, sizeof(*request_lines), comp);

    return request_lines;
}


#define REQUEST_LINES 10000

int main(int argc, char *argv[]) {

    if (argc != 2) {
        fprintf(stderr, "usage: demo <file>\n");
        exit(EXIT_FAILURE);
    }

    struct stat stat_buf;
    if (stat(argv[1], &stat_buf) == -1) {
        perror(argv[1]);
        exit(EXIT_FAILURE);
    }

    uint64_t num_lines = stat_buf.st_size / (MAX_ELEMENTS_PER_LINE * sizeof(int));

    FILE *bin = get_file(argv[1], "rb");
    int *block = get_buffer();

    uint64_t *requests = request_lines(num_lines, REQUEST_LINES);
    for (int i = 0; i < REQUEST_LINES; i++) {
        read_data(bin, requests[i], block);
        //do sth with the data, 
        //uncomment the following lines to output the data to the console
//        printf("%llu: ", requests[i]);
//        for (int x = 0; x < MAX_ELEMENTS_PER_LINE; x++) {
//            printf("'%d' ", block[x]);
//        }
//        printf("\n");
    }

    free(requests);
    free(block);
    fclose(bin);

    return EXIT_SUCCESS;
}

Summary

This approach provides much faster results than reading through the entire file sequentially (4 seconds instead of 4.5 minutes per run for the sample data). It also requires very little main memory.

The prerequisite is the one-time pre-processing of the data into a binary format. This conversion is quite time-consuming, but the data for certain rows can be read very quickly afterwards using a query program.



来源:https://stackoverflow.com/questions/62409573/what-is-the-fastest-way-to-read-several-lines-of-data-from-a-large-file

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!