large-data

Pandas, large data, HDF tables and memory usage when calling a function

若如初见. 提交于 2019-12-07 05:54:16
问题 Short question When Pandas work on a HDFStore (eg: .mean() or .apply() ), does it load the full data in memory as a DataFrame, or does it process record-by-record as a Serie? Long description I have to work on large data files, and I can specify the output format of the data file. I intend to use Pandas to process the data, and I would like to setup the best format so that it maximizes the performances. I have seen that panda.read_table() has gone a long way, but it still at least takes at

Replacing punctuation in a data frame based on punctuation list [duplicate]

杀马特。学长 韩版系。学妹 提交于 2019-12-07 05:23:43
问题 This question already has answers here : Fast punctuation removal with pandas (3 answers) Closed last year . Using Canopy and Pandas, I have data frame a which is defined by: a=pd.read_csv('text.txt') df=pd.DataFrame(a) df.columns=["test"] test.txt is a single column file that contains a list of string that contains text, numerical and punctuation. Assuming df looks like: test %hgh&12 abc123!!! porkyfries I want my results to be: test hgh12 abc123 porkyfries Effort so far: from string import

Constructing high resolution images in Python

◇◆丶佛笑我妖孽 提交于 2019-12-06 16:04:58
Say I have some huge amount of data stored in an HDF5 data file (size: 20k x 20k, if not more) and I want to create an image from all of this data using Python. Obviously, this much data cannot be opened and stored in the memory without an error. Therefore, is there some other library or method that would not require all of the data to be dumped into the memory and then processed into an image (like how the libraries: Image, matplotlib, numpy, etc. handle it)? Thanks. This question comes from a similar question I asked: Generating pcolormesh images from very large data sets saved in H5 files

Apache solr adding/editing/deleting records frequently

此生再无相见时 提交于 2019-12-06 11:32:22
I'm thinking about using Apache Solr. In my db I will have around 10.000.000 records.The worst case where I will use it has around 20 searchable/sortable fields. My problem is that these fields may change values frequently during the day. For example in my db I might change some fields at the same time of 10000 records and this may happen 0, 1 or 1000 times a day etc. The point is that each time I update a value in the db I want it to be updated in solr too so I can search with the updated data each time. For those of you that have used solr, how fast can re indexing in such volumes be? Will

Angular.Js Performance, large dataset, ng-repeat, html table with filters and two way binding

微笑、不失礼 提交于 2019-12-06 11:04:45
So I have a simple layout of a page which includes a panel of filters and a html table of records using ng-repeat. I am using MVC5 and an angularJs controller I may have to deal with up to 100000 records. Filters will occur for most of the columns including dates and text fields The records need to deal with two way binding (user has to select records which will be returned to the server). I'd like get opinions on the best design ideas for this....i.e. Would you load all the data to the browser upfront. If not when would more data be requested from the server. If all upfront should two arrays

How do I design a table which will store very large data?

▼魔方 西西 提交于 2019-12-06 11:01:01
I need to design a table in Oracle, which will store 2-5 TB of data in a day. It can grow to 200TB, and records will purged when it crosses 200 TB. Is it a feasible choice to keep it in OLTP, or do I need to shift it to data warehouse DB? Please advice considerations I should keep in mind when designing the schema of this table, or the database. Also, please advice if it is a SQL server, as I can use either database. That size puts you in the VLDB territory (very large databases). Things are fundamentally different at that altitude. Your question cannot be answered without the full

Crash on Core Data Migration

血红的双手。 提交于 2019-12-06 09:47:08
问题 Some of our users crash on Core Data migration. There are already several questions about "Core Data Migration & crash", mainly about memory usage and UI response. Migrating large Core Data database crash Out-Of-Memory while doing Core Data migration Core Data causing app to crash while migrating Core Data lightweight migration crash For high memory peak, Apple suggests multiple passes solution, and here is another large datasets solution. When I try to reproduce the problem, like migrating

Methods in R for large complex survey data sets?

强颜欢笑 提交于 2019-12-06 06:16:18
问题 I am not a survey methodologist or demographer, but am an avid fan of Thomas Lumley's R survey package. I've been working with a relatively large complex survey data set, the Healthcare Cost and Utilization Project (HCUP) National Emergency Department Sample (NEDS). As described by the Agency for Healthcare Research and Quality, it is "Discharge data for ED visits from 947 hospitals located in 30 States, approximating a 20-percent stratified sample of U.S. hospital-based EDs" The full dataset

Method for copying large amounts of data in C#

混江龙づ霸主 提交于 2019-12-06 01:13:46
I am using the following method to copy the contents of a directory to a different directory. public void DirCopy(string SourcePath, string DestinationPath) { if (Directory.Exists(DestinationPath)) { System.IO.DirectoryInfo downloadedMessageInfo = new DirectoryInfo(DestinationPath); foreach (FileInfo file in downloadedMessageInfo.GetFiles()) { file.Delete(); } foreach (DirectoryInfo dir in downloadedMessageInfo.GetDirectories()) { dir.Delete(true); } } //================================================================================= string[] directories = System.IO.Directory.GetDirectories

R - Why adding 1 column to data table nearly doubles peak memory used?

夙愿已清 提交于 2019-12-05 22:29:41
问题 After getting help from 2 kind gentlemen, I managed to switch over to data tables from data frame+plyr. The Situation and My Questions As I worked on, I noticed that peak memory usage nearly doubled from 3.5GB to 6.8GB (according to Windows Task Manager) when I added 1 new column using := to my data set containing ~200K rows by 2.5K columns. I then tried 200M row by 25 col, the increase was from 6GB to 7.6GB before dropping to 7.25GB after a gc() . Specifically regarding adding of new columns