Read data from CSV file and transform from string to correct data-type, including a list-of-integer column

前端 未结 7 1507
难免孤独
难免孤独 2020-11-28 09:44

When I read data back in from a CSV file, every cell is interpreted as a string.

  • How can I automatically convert the data I read in into the correct type?
相关标签:
7条回答
  • 2020-11-28 10:19

    An alternative (although it seems a bit extreme) in lieu of using ast.literal_eval is the pyparsing module available on PyPi - and see if the http://pyparsing.wikispaces.com/file/view/parsePythonValue.py code sample is either appropriate for what you require, or can be easily adapted.

    0 讨论(0)
  • 2020-11-28 10:21

    You have to map your rows:

    data = """True,foo,1,2.3,baz
    False,bar,7,9.8,qux"""
    
    reader = csv.reader(StringIO.StringIO(data), delimiter=",")
    parsed = (({'True':True}.get(row[0], False),
               row[1],
               int(row[2]),
               float(row[3]),
               row[4])
              for row in reader)
    for row in parsed:
        print row
    

    results in

    (True, 'foo', 1, 2.3, 'baz')
    (False, 'bar', 7, 9.8, 'qux')
    
    0 讨论(0)
  • 2020-11-28 10:26

    Props to Jon Clements and cortopy for teaching me about ast.literal_eval! Here's what I ended up going with (Python 2; changes for 3 should be trivial):

    from ast import literal_eval
    from csv import DictReader
    import csv
    
    
    def csv_data(filepath, **col_conversions):
        """Yield rows from the CSV file as dicts, with column headers as the keys.
    
        Values in the CSV rows are converted to Python values when possible,
        and are kept as strings otherwise.
    
        Specific conversion functions for columns may be specified via
        `col_conversions`: if a column's header is a key in this dict, its
        value will be applied as a function to the CSV data. Specify
        `ColumnHeader=str` if all values in the column should be interpreted
        as unquoted strings, but might be valid Python literals (`True`,
        `None`, `1`, etc.).
    
        Example usage:
    
        >>> csv_data(filepath,
        ...          VariousWordsIncludingTrueAndFalse=str,
        ...          NumbersOfVaryingPrecision=float,
        ...          FloatsThatShouldBeRounded=round,
        ...          **{'Column Header With Spaces': arbitrary_function})
        """
    
        def parse_value(key, value):
            if key in col_conversions:
                return col_conversions[key](value)
            try:
                # Interpret the string as a Python literal
                return literal_eval(value)
            except Exception:
                # If that doesn't work, assume it's an unquoted string
                return value
    
        with open(filepath) as f:
            # QUOTE_NONE: don't process quote characters, to avoid the value
            # `"2"` becoming the int `2`, rather than the string `'2'`.
            for row in DictReader(f, quoting=csv.QUOTE_NONE):
                yield {k: parse_value(k, v) for k, v in row.iteritems()}
    

    (I'm a little wary that I might have missed some corner cases involving quoting. Please comment if you see any issues!)

    0 讨论(0)
  • 2020-11-28 10:27

    I know this is a fairly old question, tagged python-2.5, but here's answer that works with Python 3.6+ which might be of interest to folks using more up-to-date versions of the language.

    It leverages the built-in typing.NamedTuple class which was added in Python 3.5. What may not be evident from the documentation is that the "type" of each field can be a function.

    The example usage code also uses so-called f-string literals which weren't added until Python 3.6, but their use isn't required to do the core data-type transformations.

    #!/usr/bin/env python3.6
    import ast
    import csv
    from typing import NamedTuple
    
    
    class Record(NamedTuple):
        """ Define the fields and their types in a record. """
        IsActive: bool
        Type: str
        Price: float
        States: ast.literal_eval  # Handles string represenation of literals.
    
        @classmethod
        def _transform(cls: 'Record', dct: dict) -> dict:
            """ Convert string values in given dictionary to corresponding Record
                field type.
            """
            return {name: cls.__annotations__[name](value)
                        for name, value in dict_.items()}
    
    
    filename = 'test_transform.csv'
    
    with open(filename, newline='') as file:
        for i, row in enumerate(csv.DictReader(file)):
            row = Record._transform(row)
            print(f'row {i}: {row}')
    

    Output:

    row 0: {'IsActive': True, 'Type': 'Cellphone', 'Price': 34.0, 'States': [1, 2]}
    row 1: {'IsActive': False, 'Type': 'FlatTv', 'Price': 3.5, 'States': [2]}
    row 2: {'IsActive': True, 'Type': 'Screen', 'Price': 100.23, 'States': [5, 1]}
    row 3: {'IsActive': True, 'Type': 'Notebook', 'Price': 50.0, 'States': [1]}
    

    Generalizing this by creating a base class with just the generic classmethod in it is not simple because of the way typing.NamedTuple is implemented.

    To avoid that issue, in Python 3.7+, a dataclasses.dataclass could be used instead because they do not have the inheritance issue — so creating a generic base class that can be reused is simple:

    #!/usr/bin/env python3.7
    import ast
    import csv
    from dataclasses import dataclass, fields
    from typing import Type, TypeVar
    
    T = TypeVar('T', bound='GenericRecord')
    
    class GenericRecord:
        """ Generic base class for transforming dataclasses. """
        @classmethod
        def _transform(cls: Type[T], dict_: dict) -> dict:
            """ Convert string values in given dictionary to corresponding type. """
            return {field.name: field.type(dict_[field.name])
                        for field in fields(cls)}
    
    
    @dataclass
    class CSV_Record(GenericRecord):
        """ Define the fields and their types in a record.
            Field names must match column names in CSV file header.
        """
        IsActive: bool
        Type: str
        Price: float
        States: ast.literal_eval  # Handles string represenation of literals.
    
    
    filename = 'test_transform.csv'
    
    with open(filename, newline='') as file:
        for i, row in enumerate(csv.DictReader(file)):
            row = CSV_Record._transform(row)
            print(f'row {i}: {row}')
    

    In one sense it's not really very important which one you use because an instance of the class in never created — using one is just a clean way of specifying and holding a definition of the field names and their type in a record data-structure.

    A TypeDict was added to the typing module in Python 3.8 that can also be used to provide the typing information, but must be used in a slightly different manner since it doesn't actually define a new type like NamedTuple and dataclasses do — so it requires having a standalone transforming function:

    #!/usr/bin/env python3.8
    import ast
    import csv
    from dataclasses import dataclass, fields
    from typing import TypedDict
    
    
    def transform(dict_, typed_dict) -> dict:
        """ Convert values in given dictionary to corresponding types in TypedDict . """
        fields = typed_dict.__annotations__
        return {name: fields[name](value) for name, value in dict_.items()}
    
    
    class CSV_Record_Types(TypedDict):
        """ Define the fields and their types in a record.
            Field names must match column names in CSV file header.
        """
        IsActive: bool
        Type: str
        Price: float
        States: ast.literal_eval
    
    
    filename = 'test_transform.csv'
    
    with open(filename, newline='') as file:
        for i, row in enumerate(csv.DictReader(file), 1):
            row = transform(row, CSV_Record_Types)
            print(f'row {i}: {row}')
    
    
    0 讨论(0)
  • 2020-11-28 10:29

    I too really liked @martineau's approach and was especially intrigued by his comment that the essence of his code was a clean mapping between fields and types. That suggested to me that a dictionary would work also. Hence the variation on his theme shown below. It's worked nicely for me.

    Clearly the value field in the dictionary is really just a callable and thus could be used to provide a hook for data massaging as well as typecasting if one so chose.

    import ast
    import csv
    
    fix_type = {'IsActive': bool, 'Type': str, 'Price': float, 'States': ast.literal_eval}
    
    filename = 'test_transform.csv'
    
    with open(filename, newline='') as file:
        for i, row in enumerate(csv.DictReader(file)):
            row = {k: fix_type[k](v) for k, v in row.items()}
            print(f'row {i}: {row}')
    

    Output

    row 0: {'IsActive': True, 'Type': 'Cellphone', 'Price': 34.0, 'States': [1, 2]}
    row 1: {'IsActive': False, 'Type': 'FlatTv', 'Price': 3.5, 'States': [2]}
    row 2: {'IsActive': True, 'Type': 'Screen', 'Price': 100.23, 'States': [5, 1]}
    row 3: {'IsActive': True, 'Type': 'Notebook', 'Price': 50.0, 'States': [1]}
    
    0 讨论(0)
  • 2020-11-28 10:31

    As the docs explain, the CSV reader doesn't perform automatic data conversion. You have the QUOTE_NONNUMERIC format option, but that would only convert all non-quoted fields into floats. This is a very similar behaviour to other csv readers.

    I don't believe Python's csv module would be of any help for this case at all. As others have already pointed out, literal_eval() is a far better choice.

    The following does work and converts:

    • strings
    • int
    • floats
    • lists
    • dictionaries

    You may also use it for booleans and NoneType, although these have to be formatted accordingly for literal_eval() to pass. LibreOffice Calc displays booleans in capital letters, when in Python booleans are Capitalized. Also, you would have to replace empty strings with None (without quotes)

    I'm writing an importer for mongodb that does all this. The following is part of the code I've written so far.

    [NOTE: My csv uses tab as field delimiter. You may want to add some exception handling too]

    def getFieldnames(csvFile):
        """
        Read the first row and store values in a tuple
        """
        with open(csvFile) as csvfile:
            firstRow = csvfile.readlines(1)
            fieldnames = tuple(firstRow[0].strip('\n').split("\t"))
        return fieldnames
    
    def writeCursor(csvFile, fieldnames):
        """
        Convert csv rows into an array of dictionaries
        All data types are automatically checked and converted
        """
        cursor = []  # Placeholder for the dictionaries/documents
        with open(csvFile) as csvFile:
            for row in islice(csvFile, 1, None):
                values = list(row.strip('\n').split("\t"))
                for i, value in enumerate(values):
                    nValue = ast.literal_eval(value)
                    values[i] = nValue
                cursor.append(dict(zip(fieldnames, values)))
        return cursor
    
    0 讨论(0)
提交回复
热议问题