How to represent JSON object under CSV

问题

I'd like to export a JSON object to a CSV file, with sub-fields which have sub-field which are potentially populated by arrays of objects, but I don't know how to represent the embedded data in CSV.

回答1:

This comes down to mapping semi-structured (tree-like) data to tabular data. This is not trivial at all because of the impedance mismatch.

There are several approaches commonly used (and taught) in practice, and with extensive academic research, mostly established for XML, but that can in principle be applied to JSON as well. Approaches more or less come down to:

Ad-hoc (schema-based) mapping
Edge shredding
Tree encoding

First, if your data follows regular patterns (like a schema), you can design an ad-hoc mapping that can, for example, map each leaf (value) to a column in CSV. You can preserve information on the structure using dots, assuming dots are not already used in fields.

For example:

{
  "foo" : {
    "bar" : 10
  },
  "foobar" : "foo"
}

can be mapped to:

| foo.bar | foobar |
|---------|--------|
|  10     |  foo   |

The trickier part is when there are arrays in the game. If you have a big array of similar objects, you can make them all rows in the output CSV:

{
  "objects" : [
    {
      "foo" : {
        "bar" : 10
      },
      "foobar" : "foo"
    },
    {
      "foo" : {
        "bar" : 40
      },
      "foobar" : "bar"
    },
    {
      "foo" : {
        "bar" : 50
      },
      "foobar" : "bar"
    }
  ]
}

could map to:

| objects.pos | objects.foo.bar | objects.foobar |
|-------------|-----------------|----------------|
|       1     |      10         |     foo        |
|       2     |      40         |     bar        |
|       3     |      50         |     bar        |

This is the approach that would be the easiest because the output CSV is still easy to understand, but it requires designing it again for each use case to tune it to your data, in particular for different arrangements in arrays.

From a theoretical perspective, this first, ad-hoc approach is called normalizing the data, i.e., bring it to first normal form or higher.

There are other approaches that are more generic such as edge shredding and tree encoding. They may be overdoing it for your use case because decoding them requires quite some work, so they are more meant for implementing complex XML queries on top of relational databases.

In short, with edge shredding, you create one table (CSV file) for each type (in JSON that would be number, string, boolean, etc) where you store the leaves, and have one table where you store the edges of the original JSON tree.

With tree encoding, you only use one single table (CSV file) that smartly stores all nodes and leaves of the tree. Again, it is tuned for XML but can probably be adapted.

JSON is a bit younger than XML, so I am not sure how much research was already done on mapping to tables -- it is possible that there are also general mappings that specifically address JSON rather than XML, even though the general principles are similar.