Data Type Recognition/Guessing of CSV data in python

你。 提交于 2019-12-03 05:04:37

问题


My problem is in the context of processing data from large CSV files.

I'm looking for the most efficient way to determine (that is, guess) the data type of a column based on the values found in that column. I'm potentially dealing with very messy data. Therefore, the algorithm should be error-tolerant to some extent.

Here's an example:

arr1 = ['0.83', '-0.26', '-', '0.23', '11.23']               # ==> recognize as float
arr2 = ['1', '11', '-1345.67', '0', '22']                    # ==> regognize as int
arr3 = ['2/7/1985', 'Jul 03 1985, 00:00:00', '', '4/3/2011'] # ==> recognize as date
arr4 = ['Dog', 'Cat', '0.13', 'Mouse']                       # ==> recognize as str

Bottom line: I'm looking for a python package or an algorithm that can detect either

  • the schema of a CSV file, or even better
  • the data type of an individual column as an array

Method for guessing type of data represented currently represented as strings goes in a similar direction. I'm worried about performance, though, since I'm possibly dealing with many large spreadsheets (where the data stems from)


回答1:


You may be interested in this python library which does exactly this kind of type guessing on CSVs and XLS files for you:

  • https://github.com/okfn/messytables
  • https://messytables.readthedocs.org/ - docs

It happily scales to very large files, to streaming data off the internet etc.

There is also an even simpler wrapper library that includes a command line tool named dataconverters: http://okfnlabs.org/dataconverters/ (and an online service: https://github.com/okfn/dataproxy!)

The core algorithm that does the type guessing is here: https://github.com/okfn/messytables/blob/7e4f12abef257a4d70a8020e0d024df6fbb02976/messytables/types.py#L164




回答2:


After putting some thought into it, this is how I would design the algorithm myself:

  • For performance reasons: take a sample for each column (say, 1%)
  • run a regex match for each cell in the sample, checking for the data type
  • Choose the appropriate data type for the column based on the frequency distribution

The two questions that arise:

  • What's a sufficient sample size? For small data sets? For large data sets?
  • What's a high enough threshold for selecting a data type based on the frequency distribution?



回答3:


You could try a pre parse using regex. For example:

import re
pattern = re.compile(r'^-?\d+.{1}\d+$')
data = '123.42'
print pattern.match(data) # ----> object
data2 = 'NOT123.42GONNA31.4HAPPEN'
print pattern.match(data2) # ----> None

This way you could do a dictionary of regex and try each of them until you find a match

myregex = {int: r'^-?\d+$', float: r'^\d+.{1}\d+$', ....}

for key, reg in myregex.items():
    to_del = []
    for index, data in enumerate(arr1):
        if re.match(reg,data):
            d = key(data) # You will need to insert data differently depending on function
            ....#---> do something 
            to_del.append(data) # ---> delete this when you can from arr1

Don't forget the '^' at the beggining and the '$' at the end, if not the regex could match part of the string and return an object.

Hope this helps :)




回答4:


I solved the same problem in c#. This is how I built the sample set:
For every column in the CSV, I selected the row with the longest value, as well as the row with the shortest value.
I then built an array with the 1st 50 non-empty rows.
So my samples had at least 0 and at most 50 rows which covered the whole range in a column.
After that, I would try to parse from widest definition to narrowest:

if (value is String) then thisType = String;

if (value is DateTime) then thisType is DateTime;

if (value is Decimal) then thisType is Decimal;

if (value is Integer) then thisType is Integer;

if (value is Boolean) then thisType is Boolean;

I use TryParse in C#, but I am sure other languages would have similar methods to use.




回答5:


Maybe csvsql could be useful here? No idea how efficient it is but definitely gets the job done for generating sql create table statements out of csvs.

$ csvsql so_many_columns.csv  >> sql_create_table_with_char_types.txt


来源:https://stackoverflow.com/questions/6824862/data-type-recognition-guessing-of-csv-data-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!