Generate data by using existing dataset as the base dataset

前端 未结 2 894
栀梦
栀梦 2021-01-16 12:16

I have a dataset consisting of 100k unique data records, to benchmark the code, I need to test on data with 5 million unique records, I don\'t want to generate random data.

2条回答
  •  日久生厌
    2021-01-16 13:04

    You can generate data conforming to normal distribution easily using R, you can follow the following steps

    #Read the data into a dataframe
    library(data.table)
    data = data = fread("data.csv", sep=",", select = c("latitude", "longitude"))
    
    #Remove duplicate and null values
    df = data.frame("Lat"=data$"latitude", "Lon"=data$"longitude")
    df1 = unique(df[1:2])
    df2  <- na.omit(df1)
    
    #Determine the mean and standard deviation of latitude and longitude values
    meanLat = mean(df2$Lat)
    meanLon = mean(df2$Lon)
    sdLat = sd(df2$Lat)
    sdLon = sd(df2$Lon)
    
    #Use Normal distribution to generate new data of 1 million records
    
    newData = list()
    newData$Lat = sapply(rep(0, 1000000), function(x) (sum(runif(12))-6) * sdLat + meanLat)
    newData$Lon = sapply(rep(0, 1000000), function(x) (sum(runif(12))-6) * sdLon + meanLon)
    
    finalData = rbind(df2,newData)
    
    now final data contains both old records and new records
    

    Write the finalData dataframe to a CSV file and you can read it from Scala or python

提交回复
热议问题