How to improve Python C Extensions file line reading?

跟風遠走 提交于 2019-12-01 14:44:47

Reading line by line is going to cause unavoidable slowdowns here. Python's built-in text oriented read-only file objects are actually three layers:

  1. io.FileIO - Raw, unbuffered access to the file
  2. io.BufferedReader - Buffers the underlying FileIO
  3. io.TextIOWrapper - Wraps the BufferedReader to implement buffered decode to str

While iostream does perform buffering, it's only doing the job of io.BufferedReader, not io.TextIOWrapper. io.TextIOWrapper adds an extra layer of buffering, reading 8 KB chunks out of the BufferedReader and decoding them in bulk to str (when a chunk ends in an incomplete character, it saves off the remaining bytes to prepend to the next chunk), then yielding individual lines from the decoded chunk on request until it's exhausted (when a decoded chunk ends in a partial line, the remainder is prepended to the next decoded chunk).

By contrast, you're consuming a line at a time with std::getline, then decoding a line at a time with PyUnicode_DecodeUTF8, then yielding back to the caller; by the time the caller requests the next line, odds are at least some of the code associated with your tp_iternext implementation has left the CPU cache (or at least, left the fastest parts of the cache). A tight loop decoding 8 KB of text to UTF-8 is going to go extremely fast; repeatedly leaving the loop and only decoding a 100-300 bytes at a time is going to be slower.

The solution is to do roughly what io.TextIOWrapper does: Read in chunks, not lines, and decode them in bulk (preserving incomplete UTF-8 encoded characters for the next chunk), then search for newlines to fish out substrings from the decoded buffer until it's exhausted (don't trim the buffer each time, just track indices). When no more complete lines remain in the decoded buffer, trim the stuff you've already yielded, and read, decode, and append a new chunk.

There is some room for improvement on Python's underlying implementation of io.TextIOWrapper.readline (e.g. they have to construct a Python level int each time they read a chunk and call indirectly since they can't guarantee they're wrapping a BufferedReader), but it's a solid basis for reimplementing your own scheme.

Update: On checking your full code (which is wildly different from what you've posted), you've got other issues. Your tp_iternext just repeatedly yields None, requiring you to call your object to retrieve the string. That's... unfortunate. That's more than doubling the Python interpreter overhead per item (tp_iternext is cheap to call, being quite specialized; tp_call is not nearly so cheap, going through convoluted general purpose code paths, requiring the interpreter to pass an empty tuple of args you never use, etc.; side-note, PyFastFile_tp_call should be accepting a third argument for the kwds, which you ignore, but must still be accepted; casting to ternaryfunc is silencing the error, but this will break on some platforms).

Final note (not really relevant to performance for all but the smallest files): The contract for tp_iternext does not require you to set an exception when the iterator is exhausted, just that you return NULL;. You can remove your call to PyErr_SetNone( PyExc_StopIteration );; as long as no other exception is set, return NULL; alone indicates end of iteration, so you can save some work by not setting it at all.

These results are only for Linux or Cygwin compiler. If you are using Visual Studio Compiler, the results for std::getline and std::ifstream.getline are 100% or more slower than Python builtin for line in file iterator.

You will see linecache.push_back( emtpycacheobject ) being used around the code because this way I am only benchmarking the time used to read the lines, excluding the time Python would spend converting the input string into a Python Unicode Object. Therefore, I commented out all lines which call PyUnicode_DecodeUTF8.

These are the global definitions used on the examples:

const char* filepath = "./myfile.log";
size_t linecachesize = 131072;

PyObject* emtpycacheobject;
emtpycacheobject = PyUnicode_DecodeUTF8( "", 0, "replace" );

I managed to optimize my Posix C getline usage (by caching the total buffer size instead of always passing 0) and now the Posix C getline beats the Python builtin for line in file by 5%. I guess if I remove all Python and C++ code around the Posix C getline, it should gain some more performance:

char* readline = (char*) malloc( linecachesize );
FILE* cfilestream = fopen( filepath, "r" );

if( cfilestream == NULL ) {
    std::cerr << "ERROR: Failed to open the file '" << filepath << "'!" << std::endl;
}

if( readline == NULL ) {
    std::cerr << "ERROR: Failed to alocate internal line buffer!" << std::endl;
}

bool getline() {
    ssize_t charsread;
    if( ( charsread = getline( &readline, &linecachesize, cfilestream ) ) != -1 ) {
        fileobj.getline( readline, linecachesize );
        // PyObject* pythonobject = PyUnicode_DecodeUTF8( readline, charsread, "replace" );
        // linecache.push_back( pythonobject );
        // return true;

        Py_XINCREF( emtpycacheobject );
        linecache.push_back( emtpycacheobject );
        return true;
    }
    return false;
}

if( readline ) {
    free( readline );
    readline = NULL;
}

if( cfilestream != NULL) {
    fclose( cfilestream );
    cfilestream = NULL;
}

I also managed to improve the C++ performance to only 20% slower than the builtin Python C for line in file by using std::ifstream.getline():

char* readline = (char*) malloc( linecachesize );
std::ifstream fileobj;
fileobj.open( filepath );

if( fileobj.fail() ) {
    std::cerr << "ERROR: Failed to open the file '" << filepath << "'!" << std::endl;
}

if( readline == NULL ) {
    std::cerr << "ERROR: Failed to alocate internal line buffer!" << std::endl;
}

bool getline() {

    if( !fileobj.eof() ) {
        fileobj.getline( readline, linecachesize );
        // PyObject* pyobj = PyUnicode_DecodeUTF8( readline, fileobj.gcount(), "replace" );
        // linecache.push_back( pyobj );
        // return true;

        Py_XINCREF( emtpycacheobject );
        linecache.push_back( emtpycacheobject );
        return true;
    }
    return false;
}

if( readline ) {
    free( readline );
    readline = NULL;
}

if( fileobj.is_open() ) {
    fileobj.close();
}

Finally, I also managed to get only 10% slower performance than the builtin Python C for line in file with std::getline by caching the std::string it uses as input:

std::string line;
std::ifstream fileobj;
fileobj.open( filepath );

if( fileobj.fail() ) {
    std::cerr << "ERROR: Failed to open the file '" << filepath << "'!" << std::endl;
}

try {
    line.reserve( linecachesize );
}
catch( std::exception error ) {
    std::cerr << "ERROR: Failed to alocate internal line buffer!" << std::endl;
}

bool getline() {

    if( std::getline( fileobj, line ) ) {
        // PyObject* pyobj = PyUnicode_DecodeUTF8( line.c_str(), line.size(), "replace" );
        // linecache.push_back( pyobj );
        // return true;

        Py_XINCREF( emtpycacheobject );
        linecache.push_back( emtpycacheobject );
        return true;
    }
    return false;
}

if( fileobj.is_open() ) {
    fileobj.close();
}

After removing all boilerplate from C++, the performance for Posix C getline was 10% inferior to the Python builtin for line in file:

const char* filepath = "./myfile.log";
size_t linecachesize = 131072;

PyObject* emtpycacheobject = PyUnicode_DecodeUTF8( "", 0, "replace" );
char* readline = (char*) malloc( linecachesize );
FILE* cfilestream = fopen( filepath, "r" );

static PyObject* PyFastFile_tp_call(PyFastFile* self, PyObject* args, PyObject *kwargs) {
    Py_XINCREF( emtpycacheobject );
    return emtpycacheobject;
}

static PyObject* PyFastFile_iternext(PyFastFile* self, PyObject* args) {
    ssize_t charsread;
    if( ( charsread = getline( &readline, &linecachesize, cfilestream ) ) == -1 ) {
        return NULL;
    }
    Py_XINCREF( emtpycacheobject );
    return emtpycacheobject;
}

static PyObject* PyFastFile_getlines(PyFastFile* self, PyObject* args) {
    Py_XINCREF( emtpycacheobject );
    return emtpycacheobject;
}

static PyObject* PyFastFile_resetlines(PyFastFile* self, PyObject* args) {
    Py_INCREF( Py_None );
    return Py_None;
}

static PyObject* PyFastFile_close(PyFastFile* self, PyObject* args) {
    Py_INCREF( Py_None );
    return Py_None;
}

Values from the last test run where Posix C getline was 10% inferior to Python:

$ /bin/python3.6 fastfileperformance.py fastfile_time 1.15%, python_time 0.87%
Python   timedifference 0:00:00.695292
FastFile timedifference 0:00:00.796305

$ /bin/python3.6 fastfileperformance.py fastfile_time 1.13%, python_time 0.88%
Python   timedifference 0:00:00.708298
FastFile timedifference 0:00:00.803594

$ /bin/python3.6 fastfileperformance.py fastfile_time 1.14%, python_time 0.88%
Python   timedifference 0:00:00.699614
FastFile timedifference 0:00:00.795259

$ /bin/python3.6 fastfileperformance.py fastfile_time 1.15%, python_time 0.87%
Python   timedifference 0:00:00.699585
FastFile timedifference 0:00:00.802173

$ /bin/python3.6 fastfileperformance.py fastfile_time 1.15%, python_time 0.87%
Python   timedifference 0:00:00.703085
FastFile timedifference 0:00:00.807528

$ /bin/python3.6 fastfileperformance.py fastfile_time 1.17%, python_time 0.85%
Python   timedifference 0:00:00.677507
FastFile timedifference 0:00:00.794591

$ /bin/python3.6 fastfileperformance.py fastfile_time 1.20%, python_time 0.83%
Python   timedifference 0:00:00.670492
FastFile timedifference 0:00:00.804689
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!