large-data

Parallel.ForEach can cause a “Out Of Memory” exception if working with a enumerable with a large object

落花浮王杯 提交于 2019-12-17 03:27:43
问题 I am trying to migrate a database where images were stored in the database to a record in the database pointing at a file on the hard drive. I was trying to use Parallel.ForEach to speed up the process using this method to query out the data. However, I noticed that I was getting an OutOfMemory Exception. I know Parallel.ForEach will query a batch of enumerables to mitigate the cost of overhead if there is one for spacing the queries out (so your source will more likely have the next record

String split out of memory

雨燕双飞 提交于 2019-12-14 03:59:00
问题 I have a large collection of tab separated text data in the form of DATE NAME MESSAGE . By large I mean, a collection of 1.76GB divided into 1075 actual files. I have to get the NAME data from all the files. Till now I have this: File f = new File(directory); File files[] = f.listFiles(); // HashSet<String> all = new HashSet<String>(); ArrayList<String> userCount = new ArrayList<String>(); for (File file : files) { if (file.getName().endsWith(".txt")) { System.out.println(file.getName());

Is there a memory efficient way to replace a list of values in a pandas dataframe?

青春壹個敷衍的年華 提交于 2019-12-14 03:56:00
问题 I am trying to replace all of the unique strings in a large pandas dataframe (1.5 million rows, and about 15 columns) with an integer index. My problem is that my dataframe is 2Gigs and my list of unique strings ends up with around eighty thousand or more entries. To produce my list of unique strings I use: unique_string_list = pd.unique(df.values.ravel()).tolist() Then if I try to use df.replace() either with a pair of lists or with a dictionary the memory overhead is too much for my 8 Gigs

Populating SELECT with large JSON data set via ColdFusion (Lucee) very slow

北慕城南 提交于 2019-12-14 02:14:26
问题 Please forgive me if I have provided more information than required for this question. :D I am building an application that pulls large JSON data-sets from a remote machine. However, I am working within a secure environment that separates application servers with firewalls, etc. Because of this I have had to do a bit of fudging (using SSH) to get the data I need. I have requested additional ports to be opened so I could bypass using SSH but was denied. Here is the physical path to get my data

Post array getting truncated, max_input_vars not working

久未见 提交于 2019-12-13 19:20:03
问题 I'm developing an opencart solution with a cascading option plugin in the admin backend. As such, when saving the form, products with a large combination of options are creating large $_POST arrays. As far as I can see, the array (which is just over 1000 keys long for this product) is being truncated around the 1000 mark (which fits in with the default value of max_input_vars). I am on php 5.3.29 which should allow me to change max_input vars ini setting. I have added to the local php.ini and

Drawing massive networkx graph: Array too big

最后都变了- 提交于 2019-12-13 18:23:30
问题 I'm trying to draw a networkx graph with weighted edges, but right now I'm having some difficulty. As the title suggests, this graph is really huge: Number of Nodes: 103362 Number of Edges: 1419671 And when I try to draw this graph with the following code: pos = nx.spring_layout(G) nx.draw(G, node_color='#A0CBE2',edge_color='#BB0000',width=2,edge_cmap=plt.cm.Blues,with_labels=False) plt.savefig("edge_colormap.png") # save as png plt.show() # display (This is just me testing functionality, not

short text clustering with large dataset - user profiling

我与影子孤独终老i 提交于 2019-12-13 18:09:56
问题 Let me explain what I want to do: Input A csv file with millions of rows containing each one of them: id of the user and a string containing the list of keywords used by that user separated by spaces. The format of the second field, the string, is not so important, I can change that based on my needs, for example adding the counts of those keywords. The data comes from the Twitter database: users are Twitter users and keywords are "meaningful" words taken from their tweets (how is not

Ruby on Rails - Storing and accessing large data sets

北战南征 提交于 2019-12-13 18:00:20
问题 I am having a hard time managing the storage and access of a large dataset within a Ruby on Rails application. Here is my application in a nutshell: I am performing Dijkstra's algorithm as it pertains to a road network, and then displaying the nodes that it visits using the google maps API. I am using an open dataset of the US road network to construct the graph by iterating over two txt files given in the link, but I am having trouble storing this data in my app. I am under the impression

Receiving data with spaces, thru sockets

ぐ巨炮叔叔 提交于 2019-12-13 09:11:48
问题 I'm using C++ with QT4 for this. And when I try to send large html files(in this case, 8kb), the process of sending and receiving work well. But the file received come with spaces between each character of the html file. Here an example, the file is sent like this: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" "http://www.w3.org/TR/REC-html40/strict.dtd"> <html><head><meta name="qrichtext" content="1" /><style type="text/css"> p, li { white-space: pre-wrap; } </style></head><body style="

Creating user sessions with fast computation

二次信任 提交于 2019-12-13 07:33:03
问题 I have a data frame with three columns: "uuid" (that is class factor) and "created_at" (that is class POSIXct),and "trainer_item_id" (factor) and I created a third column that is named "Sessions". The column Sessions represents time sessions for each uuid ordered by time, such that the time difference between any consecutive pair of events is at most one hour (3600seconds). I have created the column Sessions using a "for loop" and iteration. The problem is that I have more than a million of