write.csv for large data.table

前端 未结 1 1195
生来不讨喜
生来不讨喜 2020-11-30 21:23

I have a data.table that is not very big (2 GB) but for some reason write.csv takes an extremely long time to write it out (I\'ve never actually fi

相关标签:
1条回答
  • 2020-11-30 22:06

    UPDATE 2019.01.07:

    fwrite has been on CRAN since 2016-11-25.

    install.packages("data.table")
    

    UPDATE 08.04.2016:

    fwrite has been recently added to the data.table package's development version. It also runs in parallel (implicitly).

    # Install development version of data.table
    install.packages("data.table", 
                      repos = "https://Rdatatable.github.io/data.table", type = "source")
    
    # Load package
    library(data.table)
    
    # Load data        
    data(USArrests)
    
    # Write CSV
    fwrite(USArrests, "USArrests_fwrite.csv")
    

    According to the detailed benchmark tests shown under speeding up the performance of write.table, fwrite is ~17x faster than write.csv there (YMMV).


    UPDATE 15.12.2015:

    In the future there might be a fwrite function in the data.table package, see: https://github.com/Rdatatable/data.table/issues/580. In this thread a GIST is linked, which provides a prototype for such a function speeding up the process by a factor of 2 (according to the author, https://gist.github.com/oseiskar/15c4a3fd9b6ec5856c89).

    ORIGINAL ANSWER:

    I had the same problems (trying to write even larger CSV files) and decided finally against using CSV files.

    I would recommend you to use SQLite as it is much faster than dealing with CSV files:

    require("RSQLite")
    # Set up database    
    drv <- dbDriver("SQLite")
    con <- dbConnect(drv, dbname = "test.db")
    # Load example data
    data(USArrests)
    # Write data "USArrests" in table "USArrests" in database "test.db"    
    dbWriteTable(con, "arrests", USArrests)
    
    # Test if the data was correctly stored in the database, i.e. 
    # run an exemplary query on the newly created database 
    dbGetQuery(con, "SELECT * FROM arrests WHERE Murder > 10")       
    # row_names Murder Assault UrbanPop Rape
    # 1         Alabama   13.2     236       58 21.2
    # 2         Florida   15.4     335       80 31.9
    # 3         Georgia   17.4     211       60 25.8
    # 4        Illinois   10.4     249       83 24.0
    # 5       Louisiana   15.4     249       66 22.2
    # 6        Maryland   11.3     300       67 27.8
    # 7        Michigan   12.1     255       74 35.1
    # 8     Mississippi   16.1     259       44 17.1
    # 9          Nevada   12.2     252       81 46.0
    # 10     New Mexico   11.4     285       70 32.1
    # 11       New York   11.1     254       86 26.1
    # 12 North Carolina   13.0     337       45 16.1
    # 13 South Carolina   14.4     279       48 22.5
    # 14      Tennessee   13.2     188       59 26.9
    # 15          Texas   12.7     201       80 25.5
    
    # Close the connection to the database
    dbDisconnect(con)
    

    For further information, see http://cran.r-project.org/web/packages/RSQLite/RSQLite.pdf

    You can also use a software like http://sqliteadmin.orbmu2k.de/ to access the database and export the database to CSV etc.

    --

    0 讨论(0)
提交回复
热议问题