large-data

Instant access to line from a large file without loading the file

丶灬走出姿态 提交于 2019-12-11 15:35:01
问题 In one of my recent projects I need to perform this simple task but I'm not sure what is the most efficient way to do so. I have several large text files (>5GB) and I need to continuously extract random lines from those files. The requirements are: I can't load the files into memory, I need to perform this very efficiently ( >>1000 lines a second), and preferably I need to do as less pre-processing as possible. The files consists of many short lines ~(20 mil lines). The "raw" files has

String tokenization in java (LARGE text)

删除回忆录丶 提交于 2019-12-11 12:06:47
问题 I have this large text (read LARGE). I need to tokenize every word, delimit on every non-letter. I used StringTokenizer to read one word at a time. However, as I was researching how to write the delimiter string ("every non-letter") instead of doing something like: new StringTokenizer(text, "\" ();,.'[]{}!?:”“…\n\r0123456789 [etc etc]"); I found that everyone basically hates StringTokenizer (why?). So, what can I use instead? Dont suggest String.split as it will duplicate my large text. I

Python 3.6: Deal with MemoryError

独自空忆成欢 提交于 2019-12-11 08:13:58
问题 There is a software I have written for 'machine learning' mission. To do this, I need to load a lot of data into the RAM of the program (for the required 'fit' function). In practice, in the spoken run, the 'load_Data' function should return 2 'ndarrays' (from 'numpy' library) of approximately 12,000 to 110,000 size of float64 type. I get Memory Error during the run. I tested the program on a smaller dataset (2,000 by 110,000 array) and it does work properly. There are 2 solutions I have

Can not load large image with opencv

大憨熊 提交于 2019-12-11 08:08:06
问题 I try to load an image with opencv that has the size 100.000 * 15.000 pixels and a file size of 4.305 kb. If i load this image with the following code: cv::Mat reallyBigImage = cv::imread("reallyBigImage.png"); i get the following error: I have a pc with 32 gb ram and compile this program in 64 bit. This should be enough for an image of this size. Is it possible to load such an large image as whole to opencv? Here is where my program breaks in assembler. The arrow indicates the specific

Python large files, how to find specific lines with a particular string

让人想犯罪 __ 提交于 2019-12-11 04:30:45
问题 I am using Python to process data from very large text files (~52GB, 800 million lines each with 30 columns of data). I am trying to find an efficient way to find specific lines. Luckily the string is always in the first column. The whole thing works, memory is not a problem (I'm not loading it, just opening and closing the file as needed) and I run it on a cluster anyway. Its more about speed. The script takes days to run! The data looks something like this: scaffold126 1 C 0:0:20:0:0:0 0:0

Merging and appending ffdf dataframes

拟墨画扇 提交于 2019-12-11 04:16:17
问题 I am trying to create an ffdf dataframe by merging and appending two existing ffdf dataframes. The ffdfs have different numbers of columns and different row numbers. I know that merge() performs only inner and left outer joins while ffdfappend() will not allow appending if columns are not identical. I am wondering if anyone has a workaround for this. Either a function like the smartbind() function in the gtools package or any other workaround. Of course converting back to as.data.frame() and

Error: protect(): protection stack overflow while feature extraction

别说谁变了你拦得住时间么 提交于 2019-12-11 03:59:30
问题 I have a dataframe that has 4755 rows and 27199 columns. It's actually a document term matrix and I'm trying to perform feature selection using the "FSelector" package. Here is some of the code below: library(FSelector) weights <- information.gain(Flag~., dtmmatdf) Each time I do this I get an error Error: protect(): protection stack overflow I have a 24GB RAM and the dataframe is about 500Mb in size. So I don't know what the problem is and how do I fix it? 来源: https://stackoverflow.com

Betweenness centrality for relatively large scale data

核能气质少年 提交于 2019-12-11 02:27:14
问题 Using R,I try to calculate Betweenness centrality for about 1 million nodes and more than 20 million edges. To do so I have a pretty decent machine with 128GB ram and 4*2.40GHz CPU and a 64bit windows. Yet, using betweeness() of Igraph takes ages. I am wondering is there any quick solution? would it be faster, if I use Gephi?! 来源: https://stackoverflow.com/questions/21718078/betweenness-centrality-for-relatively-large-scale-data

How to plot large time series (thousands of administration times/doses of a medication)?

♀尐吖头ヾ 提交于 2019-12-11 02:25:45
问题 I'm trying to plot how a single drug has been prescribed in the hospital. In this dummy database I have 1000 patient encounters after 2017/01/01. The goal of plotting is to see the pattern of administration of this drug: Is it given more frequently / high dose closer to time of admission, discharge, or in the middle of patient stay. #Get_random_dates that we will use multiple times gen_random_dates <- function(N, st, et) { st <- as.POSIXct(as.Date(st)) et <- as.POSIXct(as.Date(et)) dt <- as

Speed up inserting large datasets from txt file to mySQL using python

落花浮王杯 提交于 2019-12-11 01:38:14
问题 background: I have 500 formatted *.txt files that I need to insert into a mysql database. Currently I have a python script to read the files line by line and insert into mySQL database. Problem: the files are quite big (~100M per txt file), I tested the script and it takes too long to insert just one file to database. How can I speed up the process by modifying the scripts? code: for file in os.listdir(INPUTFILEPATH): ## index += 1 ## print "processing %s out of %s files " % (index,