Converting comma separated file to nested objects json in jq

问题

I have a CSV file which I would like to parse and obtain a Nested JSON using jq. I have started to use JQ recently and I really like the tool. I understand basic functionalities, but parsing a csv file seems a little difficult especially to print nested objects.

Sample Input

Gene, Exon,Total,Exon Bases, Total Bases, Fraction of Exon bases PIK3CA,PIK3CA_Exon10;chr1;1000;1500,PIK3CA_Exon13;chr1;1000;1500,PIK3CA_Exon14;chr1;1000;1500,1927879,12993042,0.15 NRAS,NRAS_Exon4;chr1;1000;1500,NRAS_Amp_369;chr1;1000;1500,NRAS_Amp_371;chr1;1000;1500,NRAS_Amp_374;chr1;1000;1500,NRAS_Amp_379;chr1;1000;1500,884111,8062107,0.11

Header and Input data explanation

The first column will have one value always. The second column can have multiple exons (1 or more). You can see that it has 3 values in 2nd row and 5 in 3rd row. Exon bases will be the second last column always, Total bases will be last but one and Fraction of exon bases will be the last column.

Note

I have added the header for explanation purposes, it can be removed or modified for processing

Expected output

{  
   "Exome regions":[  
      {  
         "metric":"PIK3CA",
         "value":[  
            {  
               "metric":"Exons",
               "value":[  
                  "PIK3CA_Exon10",
                  {
                   "chromosome":"chr1",
                   "start":1000,
                   "end":1500
                  },
                  "PIK3CA_Exon13",
                 {
                   "chromosome":"chr1",
                   "start":1000,
                   "end":1500
                  },
                  "PIK3CA_Exon14",
                  {
                   "chromosome":"chr1",
                   "start":1000,
                   "end":1500
                  }
               ],
               "type":"set"
            },
            {  
               "metric":"Fraction of bases",
               "value":0.15,
               "type":"simple"
            },
            {  
               "metric":"Total_bases",
               "value":1927879,
               "type":"simple"
            }
         ],
         "type":"set"
      },

      {  
         "metric":"NRAS",
         "value":[  
            {  
               "metric":"Exons",
               "value":[  
                  "NRAS_Exon4",
                  {
                   "chromosome":"chr1",
                   "start":1000,
                   "end":1500
                  },
                  "NRAS_Amp_369",
                 {
                   "chromosome":"chr1",
                   "start":1000,
                   "end":1500
                  },
                  "NRAS_Amp_371",
                 {
                   "chromosome":"chr1",
                   "start":1000,
                   "end":1500
                  },
                  "NRAS_Amp_374",
                 {
                   "chromosome":"chr1",
                   "start":1000,
                   "end":1500
                  },
                  "NRAS_Amp_379",
                 {
                   "chromosome":"chr1",
                   "start":1000,
                   "end":1500
                  }
               ],
               "type":"set"
            },
            {  
               "metric":"Fraction of bases",
               "value":0.11,
               "type":"simple"
            },
            {  
               "metric":"Total_bases",
               "value":884111,
               "type":"simple"
            }
         ],
         "type":"set"
      }
   ]
}

Thanks for your help in advance!!

PS: - I need to add more information, I have to edit the Exon fields and add "Chromosomes", "Start" and "End" to each Exon. Here i have given same start and end, but in actual scenario it varies for each Exon. Can you please help me with this. Also, the input for these Exons can be separated by any other character too.Right now I separate it by ";"

回答1:

Here is a solution which uses functions for parsing and assembly of the output:

def parse:
  [
      inputs                     # read lines
    | split(",")                 # split into columns
    | select(length>0)           # eliminate blanks
    | .[:1] + [.[1:-3]] + .[-3:] # normalize columns
  ]
;
def simple(n;v): {metric:n, value:v|tonumber, type:"simple"};
def set(n;v):    {metric:n, value:v,          type:"set"};
def region:
  set(.[0]; [
      set("Exons"; .[1]),
      simple("Fraction of bases"; .[2]),
      simple("Total_bases"; .[3])
    ]
  )
;
{
   "Exome regions": parse | map(region)
}

Sample Run (assumes filter is in filter.jq and data in data.json)

$ jq -M -Rnr -f filter.jq data.json
{
  "Exome regions": [
    {
      "metric": "PIK3CA",
      "value": [
        {
          "metric": "Exons",
          "value": [
            "PIK3CA_Exon10",
            "PIK3CA_Exon13",
            "PIK3CA_Exon14"
          ],
          "type": "set"
        },
        {
          "metric": "Fraction of bases",
          "value": 1927879,
          "type": "simple"
        },
        {
          "metric": "Total_bases",
          "value": 12993042,
          "type": "simple"
        }
      ],
      "type": "set"
    },
    {
      "metric": "NRAS",
      "value": [
        {
          "metric": "Exons",
          "value": [
            "NRAS_Exon4",
            "NRAS_Amp_369",
            "NRAS_Amp_371",
            "NRAS_Amp_374",
            "NRAS_Amp_379"
          ],
          "type": "set"
        },
        {
          "metric": "Fraction of bases",
          "value": 884111,
          "type": "simple"
        },
        {
          "metric": "Total_bases",
          "value": 8062107,
          "type": "simple"
        }
      ],
      "type": "set"
    }
  ]
}

Try it online!

Here is a solution to the revised problem:

def parse:
  [
      inputs                     # read lines
    | split(",")                 # split into columns
    | select(length>0)           # eliminate blanks
    | .[:1] + [.[1:-3]] + .[-3:] # normalize columns
  ]
;
def simple(n;v): {metric:n, value:v|tonumber, type:"simple"};
def set(n;v):    {metric:n, value:v,          type:"set"};
def exons(v):    [ v[] | split(";") | .[0], {"chromosome":.[1], "start":.[2], "end":.[3]} ];
def region:
  set(.[0]; [
      set("Exons"; exons(.[1])),
      simple("Fraction of bases"; .[2]),
      simple("Total_bases"; .[3])
    ]
  )
;

{ "Exome regions": parse | map(region) }

Try it online!

回答2:

Here is a solution that (a) assumes there is no header row, in accordance with the comment about the headers (but see below); (b) does not "slurp" the file (i.e., does not read the entire file into memory); and (c) assumes a version of jq with inputs. (If your jq does not have inputs, it would be very easy to modify the following accordingly.)

def parse_row:
  split(",") 
  | length as $length
  | .[1: $length - 3] as $exons
  | { metric : .[0],
      value: [ { metric: "Exons",
                 value: $exons,
         type: "set" },
        { metric: "Fraction of bases",
                  value: (.[$length - 1] | tonumber),
          type: "simple"
        },
                { metric: "Total_bases",
                  value: (.[$length - 3] | tonumber),
                  type: "simple"
        }
        ],
        type: "set" 
    } ;

[inputs | parse_row]
| { "Exome regions": .}

The appropriate invocation of jq would be along the following lines:

jq -n -R -f program.jq input.txt

This produces the required JSON.

(The -R is for "raw input".)

If the input file does have a header row, the above solution will still work provided only that you drop the "-n" command-line option.

Please note that although the input file has comma-separated values, it is not really a CSV file.

来源：https://stackoverflow.com/questions/46632107/converting-comma-separated-file-to-nested-objects-json-in-jq

标签

json

bioinformatics

hierarchical-data