Optimizing Python Code Using SQLite3 + Mutagen

怎甘沉沦 提交于 2019-12-02 00:52:09

问题


I'm in the process of improving an open-source music database, which reads songs in from my collection and stores them to an SQLite database. In turn, I'm able to leverage the database to find duplicates, run queries on my collection, and (if I so desired) find duplicate songs in the collection. To read the metadata from the music files, I'm leveraging the Mutagen library, and to store the metadata, I'm using SQLite3.

I wanted to test the code I had authored on a sizable collection, so I contacted fellow students and family, and came across a total test size of about 90,000. This also consists of mixed information - songs are of either .mp3, .ogg, or .flac format.

My main issue is with speed - the code that I have works, but it is unsuitably slow. In its current state, it runs through the test size in about 35:45. My main question: what can I do to improve the performance of this block of code? I think that it is related directly to something within either Mutagen directly or SQLite3, although I'm open to suggestions on what would be an ideal approach to improve efficiency from here.


I had gone through two iterations of improving this critical part of code. The first improvement led to a reduced runtime of 21:30, but that's still horrible. I decided to refactor the code to reduce the number of function calls made, and an attempt to improve performance. The result, however, is a regression of performance, but a huge reduction in the number of function calls made - the second batch runs for close to 51:51, which is simply unacceptable.

What follows in terms of code is for both the "improved" runtime and refactored set. Also attached are separate profiles for each piece of code.

Block 1: "Improved" Runtime

def walk(self, d):
    '''Walk down the file structure iteratively, gathering file names to be read in.'''
    d = os.path.abspath(d)
    dirpath = os.walk(d)
    for folder in dirpath:
        for f in folder[2]: # for each file in the folder...
            supported = 'mp3', 'ogg', 'flac'
            if f.split('.')[-1] in supported:
                try:
                    self.parse(os.path.join(folder[0], f))
                    if self.filecount == 2000 or self.leftover:
                        self.filecount = 0
                        try:
                            self.db.execute_batch_insert_statement(u"INSERT INTO song VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)", self.buf)
                        except Exception, e:
                            print e.__unicode__()
                        finally:
                            del self.buf
                            self.buf = [] # wipe the buffers clean so we can repeat a batch parse again.
                except Exception, e:
                    print e.__unicode__()
    try:
        self.db.execute_batch_insert_statement(u"INSERT INTO song VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)", self.buf)
    except Exception, e:
        print e.__unicode__()
    finally:
        del self.buf
        self.buf = [] # wipe the buffers clean so we can repeat a batch parse again.

def parse(self, filename):
    '''Process and parse the music file to extract desired information.

    It may be the case that, in the future, we require more information from a song than is provided at this time.
    Examine all tags that can be retrieved from a mutagen.File object, and adjust the database's schema accordingly.'''

    if ".ogg" in filename:
        song = OggVorbis(filename)
    elif ".mp3" in filename:
        song = MP3(filename)
    elif ".flac" in filename:
        song = FLAC(filename)
    else:
        raise InvalidSongException(u"Song is not supported by K'atun at this time.")

    filename = u'filename'

    #song = mutagen.File(filename, easy=True)
    artist, title, genre, track, album, bitrate, year, month = '', '', '', '', '', '', '', ''
    try:
        artist = song['artist'][0]
        title = song['title'][0]
    except Exception:
        raise InvalidSongException(u"Cannot read " + filename + ": missing critical song information.")
    if 'genre' in song:
        genre = song['genre'][0]
    else:
        genre = u'Unknown'
    if 'tracknumber' in song:
        track = song['tracknumber'][0]
    else:
        track = 0
    if 'album' in song:
        album = song['album'][0]
    else:
        album = u'Unknown'
    if 'date' in song:
        year = song['date'][0]
    else:
        year = 'Unknown'
    try:
        bitrate = int(song.info.bitrate)
    except AttributeError: # Likely due to us messing with FLAC
        bitrate = 999999 # Set to a special flag value, to indicate that this is a lossless file.
    self.buf.append((filename, artist, filename.split('.')[-1], title, genre, track, album, bitrate, year, time.time()))
    self.filecount += 1


Sat Dec 24 21:24:23 2011    modified.dat

     70626027 function calls (70576436 primitive calls) in 1290.127 CPU seconds

Ordered by: cumulative time
List reduced from 666 to 28 due to restriction <28>

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.033    0.033 1290.127 1290.127 parser.py:6(<module>)
    1    0.000    0.000 1290.090 1290.090 parser.py:90(main)
    1    0.000    0.000 1286.493 1286.493 parser.py:24(__init__)
    1    1.826    1.826 1286.335 1286.335 parser.py:35(walk)
90744    2.376    0.000 1264.788    0.014 parser.py:55(parse)
90744   11.840    0.000 1250.401    0.014 lib/mutagen/__init__.py:158(File)
376019  613.881    0.002  613.881    0.002 {method 'seek' of 'file' objects}
90744    1.231    0.000  580.143    0.006 /usr/lib/pymodules/python2.7/mutagen/apev2.py:458(score)
671848  530.346    0.001  530.346    0.001 {method 'read' of 'file' objects}
90742    0.846    0.000  242.337    0.003 /usr/lib/pymodules/python2.7/mutagen/__init__.py:68(__init__)
63944    2.471    0.000  177.050    0.003 /usr/lib/pymodules/python2.7/mutagen/id3.py:1973(load)
63944    0.526    0.000  119.326    0.002 /usr/lib/pymodules/python2.7/mutagen/easyid3.py:161(__init__)
63944    4.649    0.000  118.077    0.002 /usr/lib/pymodules/python2.7/mutagen/id3.py:89(load)
26782    1.073    0.000   64.435    0.002 /usr/lib/pymodules/python2.7/mutagen/ogg.py:434(load)
127531    0.464    0.000   59.314    0.000 /usr/lib/pymodules/python2.7/mutagen/id3.py:76(__fullread)
63944    1.078    0.000   54.060    0.001 /usr/lib/pymodules/python2.7/mutagen/mp3.py:68(__init__)
26782    0.638    0.000   53.613    0.002 /usr/lib/pymodules/python2.7/mutagen/ogg.py:379(find_last)
66487    3.167    0.000   50.136    0.001 /usr/lib/pymodules/python2.7/mutagen/mp3.py:106(__try)
855079    6.415    0.000   33.237    0.000 /usr/lib/pymodules/python2.7/mutagen/id3.py:279(__read_frames)
816987    0.904    0.000   24.491    0.000 /usr/lib/pymodules/python2.7/mutagen/id3.py:321(__load_framedata)
816987    2.805    0.000   23.587    0.000 /usr/lib/pymodules/python2.7/mutagen/id3.py:1023(fromData)
60803/11257    0.370    0.000   19.036    0.002 /usr/lib/python2.7/os.py:209(walk)
11256   14.651    0.001   14.651    0.001 {posix.listdir}
816973    3.265    0.000   13.140    0.000 /usr/lib/pymodules/python2.7/mutagen/id3.py:996(_readData)
879103    4.936    0.000   11.473    0.000 /usr/lib/pymodules/python2.7/mutagen/id3.py:964(__init__)
872462    0.967    0.000   11.336    0.000 /usr/lib/pymodules/python2.7/mutagen/__init__.py:78(__getitem__)
63944    1.969    0.000   10.871    0.000 /usr/lib/pymodules/python2.7/mutagen/id3.py:443(update_to_v24)
619380    1.396    0.000    8.521    0.000 /usr/lib/pymodules/python2.7/mutagen/easyid3.py:175(__getitem__)

Block 2: Refactored Code

def walk(self, d):
    '''Walk down the file structure iteratively, gathering file names to be read in.'''
    d = os.path.abspath(d)
    dirpath = os.walk(d)
    parsecount = 0
    start = time.time()
    for folder in dirpath:
        for f in folder[2]: # for each file in the folder...
            filetype = f.split('.')[-1].lower()
            if filetype == 'mp3':
                try:
                    self.read_mp3(os.path.join(folder[0], f).decode('utf_8'))
                except Exception, e:
                    print e.__unicode__()
            elif filetype == 'ogg':
                try:
                    self.read_vorbis(os.path.join(folder[0], f).decode('utf_8'))
                except Exception, e:
                    print e.__unicode__()
            elif filetype == 'flac':
                try:
                    self.read_flac(os.path.join(folder[0], f).decode('utf_8'))
                except Exception, e:
                    print e.__unicode__()
            else:
                continue
            if self.filecount == 2000 or self.leftover:
                self.filecount = 0
                print "Time differential: %1.4f s" % (time.time() - start)
                self.batch_commit()
    try:
        print "Wrapping up"
        self.batch_commit()
    except Exception, e:
        print e.__unicode__()
    finally:
        print "Elapsed time: " + str(time.time()-start)

def batch_commit(self):
    '''Insert new values into the database in large quantities.'''
    self.db.execute_batch_insert_statement(u"INSERT INTO song VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)", self.buf)
    self.buf = []

def read_mp3(self, filename):
    '''Read and extract an MP3 file's tags.  This makes use of the ID3 standard, not the easy ID3 tag system.'''
    artist, title, genre, track, album, bitrate, year = '', '', '', '', '', 0, ''

    song = MP3(filename)
    keys = song.keys()
    try:
        artist = song['TPE1'].__unicode__()
        title = song['TIT2'].__unicode__()
    except KeyError, e:
        raise InvalidSongException(u"Cannot read " + filename + ": missing critical song information.")
    genre = song['TCON'].__unicode__() if "TCON" in keys else u'Unknown'
    track = song['TRCK'].__unicode__() if "TRCK" in keys else u'0'
    album = song['TALB'].__unicode__() if "TALB" in keys else u'Unknown'
    bitrate = int(song.info.bitrate)
    year = song['TDRC'].__unicode__() if "TDRC" in keys else u'Unknown'
    self.buf.append((filename, artist, filename.split('.')[-1], title, genre, track, album, bitrate, year, time.time()))
    self.filecount += 1

def read_vorbis(self, filename):
    '''Read and extract an Ogg Vorbis file's tags.'''
    song = OggVorbis(filename)
    artist, title, genre, track, album, bitrate, year = '', '', '', '', '', 0, ''
    try:
        artist = song['artist'][0]
        title = song['title'][0]
    except KeyError, e:
        raise InvalidSongException(u"Cannot read " + filename + ": missing critical song information.")
    genre = song['genre'][0] if genre else u'Unknown'
    track = song['tracknumber'][0] if 'tracknumber' in song else u'0'
    album = song['album'][0] if 'album' in song else u'Unknown'
    bitrate = int(song.info.bitrate)
    year = song['date'][0] if 'date' in song else 'Unknown'
    self.buf.append((filename, artist, filename.split('.')[-1], title, genre, track, album, bitrate, year, time.time()))
    self.filecount += 1


def read_flac(self, filename):
    '''Read and extract a FLAC file's tags.'''
    song = FLAC(filename)
    artist, title, genre, track, album, bitrate, year = '', '', '', '', '', 0, ''
    try:
        artist = song['artist'][0]
        title = song['title'][0]
    except KeyError, e:
        raise InvalidSongException(u"Cannot read " + filename + ": missing critical song information.")
    genre = song['genre'][0] if genre else u'Unknown'
    track = song['tracknumber'][0] if 'tracknumber' in song else u'0'
    album = song['album'][0] if 'album' in song else u'Unknown'
    bitrate = 999999 # Special flag for K'atun; will know that this is a lossless file
    year = song['date'][0] if 'date' in song else 'Unknown'
    self.buf.append((filename, artist, filename.split('.')[-1], title, genre, track, album, bitrate, year, time.time()))
    self.filecount += 1


Mon Dec 26 03:22:34 2011    refactored.dat

59939763 function calls (59890172 primitive calls) in 3111.490 CPU seconds

Ordered by: cumulative time
List reduced from 559 to 28 due to restriction <28>

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
1    0.001    0.001 3111.490 3111.490 parser.py:6(<module>)
1    0.000    0.000 3111.477 3111.477 parser.py:138(main)
1    0.000    0.000 3108.242 3108.242 parser.py:27(__init__)
1    1.760    1.760 3108.062 3108.062 parser.py:40(walk)
46    0.103    0.002 2220.618   48.274 parser.py:78(batch_commit)
46    0.002    0.000 2220.515   48.272 db_backend.py:127(execute_batch_insert_statement)
46 2184.900   47.498 2184.900   47.498 {method 'executemany' of 'sqlite3.Connection' objects}
90747    0.515    0.000  845.343    0.009 /usr/lib/pymodules/python2.7/mutagen/__init__.py:68(__init__)
426651  640.459    0.002  640.459    0.002 {method 'read' of 'file' objects}
63945    1.847    0.000  582.267    0.009 parser.py:83(read_mp3)
63945    2.372    0.000  577.245    0.009 /usr/lib/pymodules/python2.7/mutagen/id3.py:1973(load)
63945    0.307    0.000  514.927    0.008 /usr/lib/pymodules/python2.7/mutagen/id3.py:72(__init__)
63945    0.256    0.000  514.620    0.008 /usr/lib/pymodules/python2.7/mutagen/_util.py:103(__init__)
63945    0.225    0.000  514.363    0.008 /usr/lib/pymodules/python2.7/mutagen/__init__.py:35(__init__)
63945    4.188    0.000  514.139    0.008 /usr/lib/pymodules/python2.7/mutagen/id3.py:89(load)
127533    0.802    0.000  455.713    0.004 /usr/lib/pymodules/python2.7/mutagen/id3.py:76(__fullread)
63945    1.029    0.000  432.574    0.007 /usr/lib/pymodules/python2.7/mutagen/id3.py:202(__load_header)
26786    0.504    0.000  270.216    0.010 parser.py:102(read_vorbis)
26786    1.095    0.000  267.578    0.010 /usr/lib/pymodules/python2.7/mutagen/ogg.py:434(load)
26782    0.627    0.000  143.492    0.005 /usr/lib/pymodules/python2.7/mutagen/ogg.py:379(find_last)
221337  121.448    0.001  121.448    0.001 {method 'seek' of 'file' objects}
97603    1.797    0.000  118.799    0.001 /usr/lib/pymodules/python2.7/mutagen/ogg.py:66(__init__)
26786    0.342    0.000  114.656    0.004 lib/mutagen/oggvorbis.py:40(__init__)
63945    0.646    0.000   58.809    0.001 lib/mutagen/mp3.py:68(__init__)
66480    3.377    0.000   57.489    0.001 lib/mutagen/mp3.py:106(__try)
47   35.609    0.758   35.609    0.758 {method 'commit' of 'sqlite3.Connection' objects}
855108    6.184    0.000   32.181    0.000 /usr/lib/pymodules/python2.7/mutagen/id3.py:279(__read_frames)
60803/11257    0.385    0.000   31.885    0.003 /usr/lib/python2.7/os.py:209(walk)

回答1:


If you feel up to it, you could dive into threading :) - It is quite easy in python.

Create two Queue objects, one for IN and one for OUT and in between multiple threads scan the files. (Worker threads, and Queue's are threadsafe)

It should be only ~80loc more code, and you probably can keep the current functions, just wrap em in appropriate classes. (to make "Thread" take it)

But 20min for 90k songs does not seem to be completely out of line. It requires a lot of random access on the disk, and seeking is slow (10ms). so 90,000files, one seek per file is 15min already.



来源:https://stackoverflow.com/questions/8640701/optimizing-python-code-using-sqlite3-mutagen

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!