RunTimeError while reading tab separated text file into Pandas dataframe

最后都变了- 提交于 2020-01-05 03:05:51

问题


I am reading a tab separated text file into pandas dataframe.I am getting a runtime error while reading this.I have gone through the posts related to this error and all of them are alluding to the rule that one should not modify dicts while iterating over them.In my case all I am doing is reading a file.How is this problem connected to an error of iterating and changing dicts ?

>>> import pandas as pd
>>> df=pd.read_csv("dummy_data.txt",header=None,chunksize=10000,error_bad_lines=False,warn_bad_lines=True,engine='c',sep="\t",encoding="latin-1")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
    df=pd.read_csv("dummy_data.txt",header=None,chunksize=10000,error_bad_lines=False,warn_bad_lines=True,engine='c',sep="\t",encoding="latin-1")
  File "/home/avadhut/.virtualenvs/avadhut_virtual/lib/python3.5/site-packages/pandas/io/parsers.py", line 709, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/avadhut/.virtualenvs/avadhut_virtual/lib/python3.5/site-packages/pandas/io/parsers.py", line 431, in _read
    compression = _infer_compression(filepath_or_buffer, compression)
  File "/home/avadhut/.virtualenvs/avadhut_virtual/lib/python3.5/site-packages/pandas/io/common.py", line 270, in _infer_compression
    filepath_or_buffer = _stringify_path(filepath_or_buffer)
  File "/home/avadhut/.virtualenvs/avadhut_virtual/lib/python3.5/site-packages/pandas/io/common.py", line 157, in _stringify_path
    from py.path import local as LocalPath
  File "/home/avadhut/.virtualenvs/avadhut_virtual/lib/python3.5/site-packages/py/__init__.py", line 148, in <module>
    'Syslog'             : '._log.log:Syslog',
  File "/home/avadhut/.virtualenvs/avadhut_virtual/lib/python3.5/site-packages/py/_vendored_packages/apipkg.py", line 63, in initpkg
    for module in sys.modules.values():
RuntimeError: dictionary changed size during iteration

Edit 1: While reading the file via the interactive mode I encounter the same error twice while trying to read the file.On the 3rd time running the same line doesn't throw any error.What could be the reason for such unstable behavior ?

>>> df=pd.read_csv("product_name.txt",header=None,chunksize=10000,error_bad_lines=False,warn_bad_lines=True,engine='c',sep="\t",encoding="latin-1")

Edit 2: To replicate the error here is a link to a 1000 row dataset: S3 link to the dataset

Edit 3: Found a link with a similar issue:Pandas CSV file with occasional extra column But the flags mentioned in it (error_bad_lines) doesn't seem to work in my case.

>>> df = pd.read_csv("unclean.csv", error_bad_lines=False, header=None)

Edit 4: I have developed a script to load the dummy data (mentioned in Edit 2) to a pandas dataframe and then save it to a hdf5 file.I ran this script 20 times and not once did I encounter a RuntimeError.On the other hand while trying to read the file on the interactive mode exposes a RuntimeError and a unstable behaviour.What could be the reason for a different behaviour for python script Vs interactive mode.I am using Pandas ==0.22.0 and Python==3.5.2 and tables==3.4.4

import pandas as pd
import tables

df=pd.read_csv("dummy.txt",header=None,error_bad_lines=False,warn_bad_lines=False,engine='c',sep="\t",encoding="latin-1",names=["product_name_id","current_product_name_id","product_n","active_f","create_d","create_user_n","change_d","change_user_n","ft_timestamp"])

df.to_hdf(path_or_buf="/home/avadhut/data_files/dummy_data.h5",key="dummy",mode="a",format="table")

df=pd.read_hdf("/home/avadhut/data_files/dummy_data.h5",key="dummy")
print(df.head(100))

回答1:


Run your code on the default python interpreter and see if the error persists.It should be a bug with bpython as I am not able to replicate the issue on default python interpreter




回答2:


The issue is with your data, the file contains inconsistent number of tabs in each line. After cleaning the data I was able to load the file into Pandas. You need to clean the data and make sure the number of columns in each row are same before loading.




回答3:


I had the same issue. What worked for me was to simply comment the incriminated lines in the file indicated by the error. "/home/avadhut/.virtualenvs/avadhut_virtual/lib/python3.5/site-packages/py/_vendored_packages/apipkg.py", line 63

Comment all the following lines:

# eagerload in bypthon to avoid their monkeypatching breaking packages
    if 'bpython' in sys.modules or eager:
        for module in sys.modules.values():
             if isinstance(module, ApiModule):
                 module.__dict__

Unfortunatly I have no idea what these lines are supposed to achieve so this dirty correction might induce other problems afterward. Does anyone know?



来源:https://stackoverflow.com/questions/51635417/runtimeerror-while-reading-tab-separated-text-file-into-pandas-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!