large-files

reading sections from a large text file in python efficiently

半城伤御伤魂 提交于 2019-12-08 04:50:19
问题 I have a large text file containing several million lines of data. The very first column contains position coordinates. I need to create another file from this original data, but that only contains specified non-contiguous intervals based on the position coordinates. I have another file containing the coordinates for each interval. For instance, my original file is in a format similar to this: Position Data1 Data2 Data3 Data4 55 a b c d 63 a b c d 68 a b c d 73 a b c d 75 a b c d 82 a b c d

Is it possible to store only a checksum of a large file in git?

微笑、不失礼 提交于 2019-12-07 22:55:10
问题 I'm a bioinformatician currently extracting normal-sized sequences from genomic files. Some genomic files are large enough that I don't want to put them into the main git repository, whereas I'm putting the extracted sequences into git. Is it possible to tell git "Here's a large file - don't store the whole file, just take its checksum, and let me know if that file is missing or modified." If that's not possible, I guess I'll have to either git-ignore the large files, or, as suggested in this

Remote linux server to remote linux server large sparse files copy - How To?

心已入冬 提交于 2019-12-07 18:18:43
问题 I have two twins CentOS 5.4 servers with VMware Server installed on each. What is the most reliable and fast method for copying virtual machines files from one server to the other, assuming that I always use sparse file for my vmware virtual machines? The vm's files are a pain to copy since they are very large (50 GB) but since they are sparse files I think something can be done to improve the speed of the copy. 回答1: If you want to copy large data quickly, rsync over SSH is not for you. As

NegativeArraySizeException when creating a SequenceFile with large (>1GB) BytesWritable value size

亡梦爱人 提交于 2019-12-07 15:44:44
问题 I have tried different ways to create a large Hadoop SequenceFile with simply one short(<100bytes) key but one large (>1GB) value (BytesWriteable). The following sample works for out-of-box: https://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/BigMapOutput.java which writes multiple random-length key and value with total size >3GB. However, it is not what I am trying to do

Parsing a large (~40GB) XML text file in python

落爺英雄遲暮 提交于 2019-12-07 14:42:44
问题 I've got an XML file I want to parse with python. What is best way to do this? Taking into memory the entire document would be disastrous, I need to somehow read it a single node at a time. Existing XML solutions I know of: element tree minixml but I'm afraid they aren't quite going to work because of the problem I mentioned. Also I can't open it in a text editor - any good tips in generao for working with giant text files? 回答1: First, have you tried ElementTree (either the built-in pure

Comparing two large files

蓝咒 提交于 2019-12-07 10:54:33
问题 I need to write a program that will write to a file the difference between two files. The program has to loop through a 600 MB file with over 13.464.448 lines, check if a grep returns true on another file and then write the result onto another file. I wrote a quick test with about 1.000.000 records and it took over an hour, so i'm guessing this approach could take 9+ hours. Do you have any recommendations on how to make this faster? Any particular language i should use? I was planning on

How do I safely disable/remove the largefiles directory from a mercurial repository?

雨燕双飞 提交于 2019-12-07 09:38:53
问题 In the past, I have been working with the largefiles extension in mercurial to save data together with the code I have been working on. I think this was a mistake and I would like to remove the "largefiles" directory (8GB). Our network user directories are limited to 10 GB, and I need space. I have not used any large files for a long time now. I will not miss them when they are gone forever. So my questions are Can I remove the largefiles directory under .hg without damaging the repo? If I do

upload a large file over 1GB to 2GB using jQuery File Upload - blueimp (Ajax based) php / yii it showing error in Firefox Browser

半腔热情 提交于 2019-12-06 23:36:42
问题 I am trying to upload a large file over 1GB to 2GB using jQuery File Upload - blueimp (Ajax based) php / yii Framework 1.15 i have set these values to upload larger file memory_limit = 2048M upload_max_filesize = 2048M post_max_size = 2048M Session time set ini_set('session.gc_maxlifetime', 7200); I test lesser than 1GB file that is uploading successfully when I am trying to upload larger than 1GB file it shows Forbidden error after 50mins uploading time... Server Specifications it's a

How can I write/create a file larger than 2GB by using C/C++

两盒软妹~` 提交于 2019-12-06 16:01:23
I tried to use write() function to write a large piece of memory into a file (more than 2GB) but never succeed. Can somebody be nice and tell me what to do? Assuming Linux :) http://www.suse.de/~aj/linux_lfs.html 1/ define _FILE_OFFSET_BITS to 64 2/ define _LARGEFILE_SOURCE and _LARGEFILE_SOURCE64 4/ Use the O_LARGEFILE flag with open to operate on large file Also some information there: http://www.gnu.org/software/libc/manual/html_node/Opening-Streams.html#index-fopen64-931 These days the file systems you have on your system will support large file out of the box. It depends upon the

Converting very large files from xml to csv

我们两清 提交于 2019-12-06 15:18:20
Currently I'm using the following code snippet to convert a .txt file with XML data to .CSV format. My question is this, currently this works perfectly with files that are around 100-200 mbs and the conversion time is very low (1-2 minutes max), However I now need this to work for much bigger files (1-2 GB's each file). Currently the program freezes the computer and the conversion takes about 30-40 minutes with this function. Not sure how I would proceed changing this function. Any help will be appreciated! string all_lines = File.ReadAllText(p); all_lines = "<Root>" + all_lines + "</Root>";