Python: make a list generator JSON serializable

后端 未结 4 1559
旧巷少年郎
旧巷少年郎 2020-12-29 05:23

How can I concat a list of JSON files into a huge JSON array? I\'ve 5000 files and 550 000 list items.

My fist try was to use jq, but it looks like jq -s is not opt

相关标签:
4条回答
  • 2020-12-29 05:54

    A complete simple readable solution that can serialize a generator from a normal or empty iterable, can work with .encode() or .iterencode(). Written tests. Tested with Python 2.7, 3.0, 3.3, 3.6

    import itertools
    
    class SerializableGenerator(list):
        """Generator that is serializable by JSON
    
        It is useful for serializing huge data by JSON
        >>> json.dumps(SerializableGenerator(iter([1, 2])))
        "[1, 2]"
        >>> json.dumps(SerializableGenerator(iter([])))
        "[]"
    
        It can be used in a generator of json chunks used e.g. for a stream
        >>> iter_json = ison.JSONEncoder().iterencode(SerializableGenerator(iter([])))
        >>> tuple(iter_json)
        ('[1', ']')
        # >>> for chunk in iter_json:
        # ...     stream.write(chunk)
        # >>> SerializableGenerator((x for x in range(3)))
        # [<generator object <genexpr> at 0x7f858b5180f8>]
        """
    
        def __init__(self, iterable):
            tmp_body = iter(iterable)
            try:
                self._head = iter([next(tmp_body)])
                self.append(tmp_body)
            except StopIteration:
                self._head = []
    
        def __iter__(self):
            return itertools.chain(self._head, *self[:1])
    
    
    # -- test --
    
    import unittest
    import json
    
    
    class Test(unittest.TestCase):
    
        def combined_dump_assert(self, iterable, expect):
            self.assertEqual(json.dumps(SerializableGenerator(iter(iterable))), expect)
    
        def combined_iterencode_assert(self, iterable, expect):
            encoder = json.JSONEncoder().iterencode
            self.assertEqual(tuple(encoder(SerializableGenerator(iter(iterable)))), expect)
    
        def test_dump_data(self):
            self.combined_dump_assert(iter([1, "a"]), '[1, "a"]')
    
        def test_dump_empty(self):
            self.combined_dump_assert(iter([]), '[]')
    
        def test_iterencode_data(self):
            self.combined_iterencode_assert(iter([1, "a"]), ('[1', ', "a"', ']'))
    
        def test_iterencode_empty(self):
            self.combined_iterencode_assert(iter([]), ('[]',))
    
        def test_that_all_data_are_consumed(self):
            gen = SerializableGenerator(iter([1, 2]))
            list(gen)
            self.assertEqual(list(gen), [])
    

    Used solutions: Vadim Pushtaev (incomplete), user1158559 (unnecessarily complicated) and Claude (in another question, also complicated).

    Useful simplification are:

    • It is not necessary to evaluate the first item lazily and it can be it done in __init__ because we can expect that the SerializableGenerator can be called immediately before json.dumps. (against user1158559 solution)
    • It is not necessary to rewrite many methods by NotImplementedError because that are not all methods like __repr__. It is better to store the generator also to the list to provide meaningful results like [<generator object ...>]. (against Claude). Default methods __len__ and __bool__ works now correctly to recognize an empty and not empty object.

    An advantage of this solution is that a standard JSON serializer can be used without params. If nested generators should be supported or if encapsulation by SerializableGenerator(iterator) is undesirable then I recommend IterEncoder answer.

    0 讨论(0)
  • 2020-12-29 05:56

    You should derive from list and override __iter__ method.

    import json
    
    def gen():
        yield 20
        yield 30
        yield 40
    
    class StreamArray(list):
        def __iter__(self):
            return gen()
    
        # according to the comment below
        def __len__(self):
            return 1
    
    a = [1,2,3]
    b = StreamArray()
    
    print(json.dumps([1,a,b]))
    

    Result is [1, [1, 2, 3], [20, 30, 40]].

    0 讨论(0)
  • 2020-12-29 06:06

    Based on the accepted answer, here is the StreamArray I eventually went for. It contains two lies:

    1. The suggestion that self.__tail__ might be immutable
    2. len(StreamArray(some_gen)) is either 0 or 1

    .

    class StreamArray(list):
    
        def __init__(self, gen):
            self.gen = gen
    
        def destructure(self):
            try:
                return self.__head__, self.__tail__, self.__len__
            except AttributeError:
                try:
                    self.__head__ = self.gen.__next__()
                    self.__tail__ = self.gen
                    self.__len__ = 1 # A lie
                except StopIteration:
                    self.__head__ = None
                    self.__tail__ = []
                    self.__len__ = 0
                return self.__head__, self.__tail__, self.__len__
    
        def rebuilt_gen(self):
            def rebuilt_gen_inner():
                head, tail, len_ = self.destructure()
                if len_ > 0:
                    yield head
                for elem in tail:
                    yield elem
            try:
                return self.__rebuilt_gen__
            except AttributeError:
                self.__rebuilt_gen__ = rebuilt_gen_inner()
                return self.__rebuilt_gen__
    
        def __iter__(self):
            return self.rebuilt_gen()
    
        def __next__(self):
            return self.rebuilt_gen()
    
        def __len__(self):
            return self.destructure()[2]
    

    Single use only!

    0 讨论(0)
  • 2020-12-29 06:09

    As of simplejson 3.8.0, you can use the iterable_as_array option to make any iterable serializable into an array

    # Since simplejson is backwards compatible, you should feel free to import
    # it as `json`
    import simplejson as json
    json.dumps((i*i for i in range(10)), iterable_as_array=True)
    

    result is [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

    0 讨论(0)
提交回复
热议问题