How to extract data from asn1 data file and load it into a dataframe?

橙三吉。 提交于 2020-01-24 22:11:06

问题


My ultimate goal is to load meta data received from PubMed into a pyspark dataframe. So far, I have managed to download the data I want from the PubMed data base using a shell script. The downloaded data is in asn1 format. Here is an example of a data entry:

Pubmed-entry ::= {
  pmid 31782536,
  medent {
    em std {
      year 2019,
      month 11,
      day 30,
      hour 6,
      minute 0
    },
    cit {
      title {
        name "Impact of CYP2C19 genotype and drug interactions on voriconazole
 plasma concentrations: a spain pharmacogenetic-pharmacokinetic prospective
 multicenter study."
      },
      authors {
        names std {
          {
            name ml "Blanco Dorado S",
            affil str "Pharmacy Department, University Clinical Hospital
 Santiago de Compostela (CHUS). Santiago de Compostela, Spain.; Clinical
 Pharmacology Group, University Clinical Hospital, Health Research Institute
 of Santiago de Compostela (IDIS). Santiago de Compostela, Spain.; Department
 of Pharmacology, Pharmacy and Pharmaceutical Technology, Faculty of Pharmacy,
 University of Santiago de Compostela (USC). Santiago de Compostela, Spain."
          },
          {
            name ml "Maronas O",
            affil str "Genomic Medicine Group, Centro Nacional de Genotipado
 (CEGEN-PRB3), CIBERER, CIMUS, University of Santiago de Compostela (USC),
 Santiago de Compostela, Spain."
          },
          {
            name ml "Latorre-Pellicer A",
            affil str "Genomic Medicine Group, Centro Nacional de Genotipado
 (CEGEN-PRB3), CIBERER, CIMUS, University of Santiago de Compostela (USC),
 Santiago de Compostela, Spain."
          },
          {
            name ml "Rodriguez Jato T",
            affil str "Pharmacy Department, University Clinical Hospital
 Santiago de Compostela (CHUS). Santiago de Compostela, Spain."
          },
          {
            name ml "Lopez-Vizcaino A",
            affil str "Pharmacy Department, University Hospital Lucus Augusti
 (HULA). Lugo, Spain."
          },
          {
            name ml "Gomez Marquez A",
            affil str "Pharmacy Department, University Hospital Ourense
 (CHUO). Ourense, Spain."
          },
          {
            name ml "Bardan Garcia B",
            affil str "Pharmacy Department, University Hospital Ferrol (CHUF).
 A Coruna, Spain."
          },
          {
            name ml "Belles Medall D",
            affil str "Pharmacy Department, General University Hospital
 Castellon (GVA). Castellon, Spain."
          },
          {
            name ml "Barbeito Castineiras G",
            affil str "Microbiology Department, University Clinical Hospital
 Santiago de Compostela (CHUS). Santiago de Compostela, Spain."
          },
          {
            name ml "Perez Del Molino Bernal ML",
            affil str "Microbiology Department, University Clinical Hospital
 Santiago de Compostela (CHUS). Santiago de Compostela, Spain."
          },
          {
            name ml "Campos-Toimil M",
            affil str "Department of Pharmacology, Pharmacy and Pharmaceutical
 Technology, Faculty of Pharmacy, University of Santiago de Compostela (USC).
 Santiago de Compostela, Spain."
          },
          {
            name ml "Otero Espinar F",
            affil str "Department of Pharmacology, Pharmacy and Pharmaceutical
 Technology, Faculty of Pharmacy, University of Santiago de Compostela (USC).
 Santiago de Compostela, Spain."
          },
          {
            name ml "Blanco Hortas A",
            affil str "Epidemiology Unit. Fundacion Instituto de Investigacion
 Sanitaria de Santiago de Compostela (FIDIS), University Hospital Lucus
 Augusti (HULA), Spain."
          },
          {
            name ml "Duran Pineiro G",
            affil str "Clinical Pharmacology Group, University Clinical
 Hospital, Health Research Institute of Santiago de Compostela (IDIS).
 Santiago de Compostela, Spain."
          },
          {
            name ml "Zarra Ferro I",
            affil str "Pharmacy Department, University Clinical Hospital
 Santiago de Compostela (CHUS). Santiago de Compostela, Spain.; Clinical
 Pharmacology Group, University Clinical Hospital, Health Research Institute
 of Santiago de Compostela (IDIS). Santiago de Compostela, Spain."
          },
          {
            name ml "Carracedo A",
            affil str "Genomic Medicine Group, Centro Nacional de Genotipado
 (CEGEN-PRB3), CIBERER, CIMUS, University of Santiago de Compostela (USC),
 Santiago de Compostela, Spain.; Galician Foundation of Genomic Medicine,
 Health Research Institute of Santiago de Compostela (IDIS), SERGAS, Santiago
 de Compostela, Spain."
          },
          {
            name ml "Lamas MJ",
            affil str "Clinical Pharmacology Group, University Clinical
 Hospital, Health Research Institute of Santiago de Compostela (IDIS).
 Santiago de Compostela, Spain."
          },
          {
            name ml "Fernandez-Ferreiro A",
            affil str "Pharmacy Department, University Clinical Hospital
 Santiago de Compostela (CHUS). Santiago de Compostela, Spain.; Clinical
 Pharmacology Group, University Clinical Hospital, Health Research Institute
 of Santiago de Compostela (IDIS). Santiago de Compostela, Spain.; Department
 of Pharmacology, Pharmacy and Pharmaceutical Technology, Faculty of Pharmacy,
 University of Santiago de Compostela (USC). Santiago de Compostela, Spain."
          }
        }
      },
      from journal {
        title {
          iso-jta "Pharmacotherapy",
          ml-jta "Pharmacotherapy",
          issn "1875-9114",
          name "Pharmacotherapy"
        },
        imp {
          date std {
            year 2019,
            month 11,
            day 29
          },
          language "eng",
          pubstatus aheadofprint,
          history {
            {
              pubstatus other,
              date std {
                year 2019,
                month 11,
                day 30,
                hour 6,
                minute 0
              }
            },
            {
              pubstatus pubmed,
              date std {
                year 2019,
                month 11,
                day 30,
                hour 6,
                minute 0
              }
            },
            {
              pubstatus medline,
              date std {
                year 2019,
                month 11,
                day 30,
                hour 6,
                minute 0
              }
            }
          }
        }
      },
      ids {
        pubmed 31782536,
        doi "10.1002/phar.2351",
        other {
          db "ELocationID doi",
          tag str "10.1002/phar.2351"
        }
      }
    },
    abstract "BACKGROUND: Voriconazole, a first-line agent for the treatment
 of invasive fungal infections, is mainly metabolized by cytochrome P450 (CYP)
 2C19. A significant portion of patients fail to achieve therapeutic
 voriconazole trough concentrations, with a consequently increased risk of
 therapeutic failure. OBJECTIVE: To show the association between
 subtherapeutic voriconazole concentrations and factors affecting voriconazole
 pharmacokinetics: CYP2C19 genotype and drug-drug interactions. METHODS:
 Adults receiving voriconazole for antifungal treatment or prophylaxis were
 included in a multicenter prospective study conducted in Spain. The
 prevalence of subtherapeutic voriconazole troughs were analyzed in the rapid
 metabolizer and ultra-rapid metabolizer patients (RMs and UMs, respectively),
 and compared with the rest of the patients. The relationship between
 voriconazole concentration, CYP2C19 phenotype, adverse events (AEs), and
 drug-drug interactions was also assessed. RESULTS: In this study 78 patients
 were included with a wide variability in voriconazole plasma levels with only
 44.8% of patients attaining trough concentrations within the therapeutic
 range of 1 and 5.5 microg/ml. The allele frequency of *17 variant was found
 to be 29.5%. Compared with patients with other phenotypes, RMs and UMs had a
 lower voriconazole plasma concentration (RM/UM: 1.85+/-0.24 microg/ml versus
 other phenotypes: 2.36+/-0.26 microg/ml, ). Adverse events were more common
 in patients with higher voriconazole concentrations (p<0.05). No association
 between voriconazole trough concentration and other factors (age, weight,
 route of administration, and concomitant administration of enzyme inducer,
 enzyme inhibitor, glucocorticoids, or proton pump inhibitors) was found.
 CONCLUSION: These results suggest the potential clinical utility of using
 CYP2C19 genotype-guided voriconazole dosing to achieve concentrations in the
 therapeutic range in the early course of therapy. Larger studies are needed
 to confirm the impact of pharmacogenetics on voriconazole pharmacokinetics.",
    pmid 31782536,
    pub-type {
      "Journal Article"
    },
    status publisher
  }
}

This is where I am stuck. I do not know how to extract the information from asn1 and get it into a pyspark dataframe. Could anyone suggest a way of doing this?


回答1:


The above data is definitely in an "ASN.1 format". This format is called ASN.1 Value Notation and is used to represent ASN.1 values textually. (This format pre-dates the standardization of the JSON encoding rules. Today, one could use JSON for the same purpose, with some differences in the way the JSON would be processed compared to the ASN.1 value notation).

The ASN.1 schema that YaFred posted above contains a few errors, as YaFred himself noted. The notation you posted yourself also seems to contain a few errors. I have looked at the whole set of ASN.1 files of NCBI and noticed that they contain several errors. Because of this, they cannot be handled by a standard-conforming ASN.1 tool (such as the ASN.1 playground) unless they are fixed. Some of those errors are easy to fix, but fixing other errors require knowledge of the intent of the author of those files. This state of affairs is probably due to the fact that the NCBI project uses their own ASN.1 toolkit, which perhaps uses ASN.1 in some non-standard way.

I would imagine that in the NCBI toolkit there should be some means for you to decode the above value notation, so if I were you I would look into that toolkit. I am unable to give you a better suggestion because I don't know the NCBI toolkit.




回答2:


Your problem may not be simple but it's worth experimenting.

Method 1:

As you have the specification, you can try looking for an ASN.1 tool (aka ASN.1 compiler) that will create a data model. In your case, because you downloaded a textual ASN.1 value, you need this tool to provide ASN.1 value decoders.

If the tool was generating Java code, it would go like this:

// decode a Pubmed-entry
// input is your data
Asn1ValueReader reader = new Asn1ValueReader(input);
PubmedEntry obj = PubmedEntry.readPdu(reader);
// access the data
obj.getPmid();
obj.getMedent();

A few caveats:

  • Tools that can do all that will not be free (if you find one at all). The problem here is that you have a textual ASN1 value while tools will generally provide binary decoders (BER, DER, etc ..)
  • You have a lot of glue code to write to create the record that goes into you pyspark dataframe

I wrote this some time ago but it does not have the textual ASN1 value decoders

Method 2:

If your data are simple enough and as they are textual data, you can try and write your own parser (using a tool like ANTLR) ... Not easy, to evaluate this method if you are not familiar with parsers.

EDIT: Unfortunately, the specification is not valid.



来源:https://stackoverflow.com/questions/59219279/how-to-extract-data-from-asn1-data-file-and-load-it-into-a-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!