Import MySQL dump into R (without requiring MySQL server)

后端 未结 3 1258
迷失自我
迷失自我 2020-12-11 23:35

Packages like RMySQL and sqldf allow one to interface with local or remote database servers. I\'m creating a portable project which involves import

3条回答
  •  生来不讨喜
    2020-12-11 23:44

    depending on what you want to extract from the table, here is how you can play around with the data

    numLines <- R.utils::countLines("sportsdb_sample_mysql_20080303.sql")
    # [1] 81266
    
    linesInDB <- readLines("sportsdb_sample_mysql_20080303.sql",n=60)
    

    Then you can do some regex to get tables names (after CREATE TABLE), column names (between first brackets) and VALUES (lines after CREATE TABLE and between second brackets)

    Reference: Reverse engineering a mysqldump output with MySQL Workbench gives "statement starting from pointed line contains non UTF8 characters" error


    EDIT: in response to OP's answer, if i interpret the python script correct, it is also reading it line by line, filter for INSERT INTO lines, parse as csv, then write to file. This is very similar to my original suggestion. My version below in R. If the file size is too large, it would be better to read in the file in chunks using some other R package

    options(stringsAsFactors=F)
    library(utils)
    library(stringi)
    library(plyr)
    
    mysqldumpfile <- "sportsdb_sample_mysql_20080303.sql"
    
    allLines <- readLines(mysqldumpfile)
    insertLines <- allLines[which(stri_detect_fixed(allLines, "INSERT INTO"))]
    allwords <- data.frame(stri_extract_all_words(insertLines, " "))
    d_ply(allwords, .(X3), function(x) {
        #x <- split(allwords, allwords$X3)[["baseball_offensive_stats"]]
        print(x[1,3])
    
        #find where the header/data columns start and end
        valuesCol <- which(x[1,]=="VALUES")
        lastCols <- which(apply(x, 2, function(y) all(is.na(y))))
        datLastCol <- head(c(lastCols, ncol(x)+1), 1) - 1
    
        #format and prepare for write to file
        df <- data.frame(x[,(valuesCol+1):datLastCol])
        df <- setNames(df, x[1,4:(valuesCol-1)])
        #type convert before writing to file otherwise its all strings
        df[] <- apply(df, 2, type.convert)
        #write to file
        write.csv(df, paste0(x[1,3],".csv"), row.names=F)
    })
    

提交回复
热议问题