How to parse .ttl files with RDFLib?

前端 未结 3 843
一个人的身影
一个人的身影 2021-02-07 22:14

I have a file in .ttl form. It has 4 attributes/columns containing quadruples of the following form:

  1. (id, student_name, student_address, student
相关标签:
3条回答
  • 2021-02-07 22:22

    You can do as Snakes and Coffee suggests, only wrap that function (or its code) in a loop with yield statements. This creates a generator, which can be called iteratively to create the next line's dicts on the fly. Assuming you were going to write these to a csv, for instance, using Snakes' parse_to_dict:

    import re
    import csv
    
    writer = csv.DictWriter(open(outfile, "wb"), fieldnames=["id", "name", "address", "phone"])
    # or whatever
    

    You can create a generator as a function or with an inline comprehension:

    def dict_generator(lines): 
        for line in lines: 
            yield parse_to_dict(line)
    

    --or--

    dict_generator = (parse_to_dict(line) for line in lines)
    

    These are pretty much equivalent. At this point you can get a dict-parsed line by calling dict_generator.next(), and you'll magically get one at a time- no additional RAM thrashing involved.

    If you have 16 gigs of raw data, you might consider making a generator to pull the lines in, too. They're really useful.

    More info on generators from SO and some docs: What can you use Python generator functions for? http://wiki.python.org/moin/Generators

    0 讨论(0)
  • 2021-02-07 22:24

    It seems there is currently no such library present to parse the Turtle - Terse RDF Triple Language

    As you already know the grammar , your best bet is to use PyParsing to first create a grammar and then parse the file.

    I would also suggest to adapt the following EBNF implementation for your need

    0 讨论(0)
  • 2021-02-07 22:29

    Turtle is a subset of Notation 3 syntax so rdflib should be able to parse it using format='n3'. Check whether rdflib preserves comments (ids are specified in the comments (#...) in your sample). If not and the input format is as simple as shown in your example then you could parse it manually:

    import re
    from collections import namedtuple
    from itertools import takewhile
    
    Entry = namedtuple('Entry', 'id name address phone')
    
    def get_entries(path):
        with open(path) as file:
            # an entry starts with `#@` line and ends with a blank line
            for line in file:
                if line.startswith('#@'):
                    buf = [line]
                    buf.extend(takewhile(str.strip, file)) # read until blank line
                    yield Entry(*re.findall(r'<([^>]+)>', ''.join(buf)))
    
    print("\n".join(map(str, get_entries('example.ttl'))))
    

    Output:

    Entry(id='id1', name='Alice', address='USA', phone='12345')
    Entry(id='id1', name='Jane', address='France', phone='78900')
    

    To save entries to a db:

    import sqlite3
    
    with sqlite3.connect('example.db') as conn:
        conn.execute('''CREATE TABLE IF NOT EXISTS entries
                 (id text, name text, address text, phone text)''')
        conn.executemany('INSERT INTO entries VALUES (?,?,?,?)',
                         get_entries('example.ttl'))
    

    To group by id if you need some postprocessing in Python:

    import sqlite3
    from itertools import groupby
    from operator import itemgetter
    
    with sqlite3.connect('example.db') as c:
        rows = c.execute('SELECT * FROM entries ORDER BY id LIMIT ?', (10,))
        for id, group in groupby(rows, key=itemgetter(0)):
            print("%s:\n\t%s" % (id, "\n\t".join(map(str, group))))
    

    Output:

    id1:
        ('id1', 'Alice', 'USA', '12345')
        ('id1', 'Jane', 'France', '78900')
    
    0 讨论(0)
提交回复
热议问题