Creating a Dict from CSV dataflow python

耗尽温柔 提交于 2020-12-12 12:32:51

问题


I am trying to make a dict from csv data in python, I do not want to use the traditional split(',') and then using renaming the rows to the heading I would like, as I will be recieving different csv files with different amounts of information, and I will not be able to consistently target the rows I want with that method.

THE HEADER NAMES WILL BE CONSISTENT, just their maybe more headers in one file compared to another

Instead, I have been trying to formulate a list from the CSV file, then zipping the first row into the rest of the rows to create a dictionary, then I can extract the exact contents I want.

I can create a list of lists, by either using the csv.reader or :

class Split(beam.DoFn):
    def process(self, element):
        rows = element.splitlines()
        data = []
        for row in rows:
            data.append([row])
        return data

This returns:

[u'FIRST_NAME,last_name,birthdate,voter_id,phone_number']
[u'hector,ABAD,6/15/1970,11*******,7*********']
[u'm,ABAL,6/16/1949,12********,']
[u'jorge,ABDALA,6/15/1962,21********,3********']
[u'karen,ABELLA,6/18/1988,33********,']

Although when I try to access the first row via:

rows = element.splitlines()
data = []
for row in rows:
    # f = pattern.findall(row)
    data.append([row])
return data[0]

It returns:

FIRST_NAME,last_name,birthdate,voter_id,phone_number
hector,ABAD,6/15/1970,11*******,7*********
m,ABAL,6/16/1949,109055849,
jorge,ABDALA,6/15/1962,21********,3********
karen,ABELLA,6/18/1988,33********,

I have also tried the beam_utils csv reader although this says that there is no module named 'sources' after I fix the fileio bug.

If someone knows a better way or can point me towards what I'm doing wrong that would be great, also this is my pipeline:

with beam.Pipeline(options=pipeline_options) as p:
    (p
     | 'Read' >> ReadFromText(known_args.input)
     | 'Split Values' >> beam.ParDo(Split())
     | 'WriteToText' >> beam.io.WriteToText(known_args.output)) 

I am only reading from my google-cloud storage bucket for now, but in the future it will be from pubsub.

I would like the content to look like:

{"FIRST_NAME": "hector", "last_name": "ABAD", "birthdate": "6/15/1970", "voter_id": 11*******, "phone_number": 7*********}
etc.
etc.
etc.

回答1:


Processing the header element of csv files doesn't seem to be well supported by the python beam SDK (other than discarding it). Fortunately someone has created this repo for dealing with this use case: https://github.com/pabloem/beam_utils

It contains a CSVFileSource class extending FileBasedSource (Beam's abstract class for creating custom file sources) to create your dict from the file with variable headers.

Install:

pip install beam_utils
from beam_utils.sources import CsvFileSource

It can be used like:

 p | 'ReadCsvFile' >> beam.io.Read(CsvFileSource(known_args.input))

Should produce the output you're looking for.

Edit: To make the package available to Dataflow workers create a tar and provide to the job with --extra_package flag as in https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#local-or-nonpypi




回答2:


Check out python library module csv.DictReader: https://docs.python.org/2/library/csv.html#csv.DictReader

Copying the example from documentation for quick reference

>>> import csv
>>> with open('names.csv') as csvfile:
...     reader = csv.DictReader(csvfile)
...     for row in reader:
...         print(row['first_name'], row['last_name'])


来源:https://stackoverflow.com/questions/51053092/creating-a-dict-from-csv-dataflow-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!