large-data | 易学教程

Parallel.ForEach can cause a “Out Of Memory” exception if working with a enumerable with a large object

阅读更多关于 Parallel.ForEach can cause a “Out Of Memory” exception if working with a enumerable with a large object

问题 I am trying to migrate a database where images were stored in the database to a record in the database pointing at a file on the hard drive. I was trying to use Parallel.ForEach to speed up the process using this method to query out the data. However, I noticed that I was getting an OutOfMemory Exception. I know Parallel.ForEach will query a batch of enumerables to mitigate the cost of overhead if there is one for spacing the queries out (so your source will more likely have the next record

String split out of memory

阅读更多关于 String split out of memory

问题 I have a large collection of tab separated text data in the form of DATE NAME MESSAGE . By large I mean, a collection of 1.76GB divided into 1075 actual files. I have to get the NAME data from all the files. Till now I have this: File f = new File(directory); File files[] = f.listFiles(); // HashSet<String> all = new HashSet<String>(); ArrayList<String> userCount = new ArrayList<String>(); for (File file : files) { if (file.getName().endsWith(".txt")) { System.out.println(file.getName());

Is there a memory efficient way to replace a list of values in a pandas dataframe?

阅读更多关于 Is there a memory efficient way to replace a list of values in a pandas dataframe?

问题 I am trying to replace all of the unique strings in a large pandas dataframe (1.5 million rows, and about 15 columns) with an integer index. My problem is that my dataframe is 2Gigs and my list of unique strings ends up with around eighty thousand or more entries. To produce my list of unique strings I use: unique_string_list = pd.unique(df.values.ravel()).tolist() Then if I try to use df.replace() either with a pair of lists or with a dictionary the memory overhead is too much for my 8 Gigs

Populating SELECT with large JSON data set via ColdFusion (Lucee) very slow

阅读更多关于 Populating SELECT with large JSON data set via ColdFusion (Lucee) very slow

问题 Please forgive me if I have provided more information than required for this question. :D I am building an application that pulls large JSON data-sets from a remote machine. However, I am working within a secure environment that separates application servers with firewalls, etc. Because of this I have had to do a bit of fudging (using SSH) to get the data I need. I have requested additional ports to be opened so I could bypass using SSH but was denied. Here is the physical path to get my data

Post array getting truncated, max_input_vars not working

阅读更多关于 Post array getting truncated, max_input_vars not working

问题 I'm developing an opencart solution with a cascading option plugin in the admin backend. As such, when saving the form, products with a large combination of options are creating large $_POST arrays. As far as I can see, the array (which is just over 1000 keys long for this product) is being truncated around the 1000 mark (which fits in with the default value of max_input_vars). I am on php 5.3.29 which should allow me to change max_input vars ini setting. I have added to the local php.ini and

Drawing massive networkx graph: Array too big

阅读更多关于 Drawing massive networkx graph: Array too big

问题 I'm trying to draw a networkx graph with weighted edges, but right now I'm having some difficulty. As the title suggests, this graph is really huge: Number of Nodes: 103362 Number of Edges: 1419671 And when I try to draw this graph with the following code: pos = nx.spring_layout(G) nx.draw(G, node_color='#A0CBE2',edge_color='#BB0000',width=2,edge_cmap=plt.cm.Blues,with_labels=False) plt.savefig("edge_colormap.png") # save as png plt.show() # display (This is just me testing functionality, not

short text clustering with large dataset - user profiling

阅读更多关于 short text clustering with large dataset - user profiling

问题 Let me explain what I want to do: Input A csv file with millions of rows containing each one of them: id of the user and a string containing the list of keywords used by that user separated by spaces. The format of the second field, the string, is not so important, I can change that based on my needs, for example adding the counts of those keywords. The data comes from the Twitter database: users are Twitter users and keywords are "meaningful" words taken from their tweets (how is not

Ruby on Rails - Storing and accessing large data sets

阅读更多关于 Ruby on Rails - Storing and accessing large data sets

问题 I am having a hard time managing the storage and access of a large dataset within a Ruby on Rails application. Here is my application in a nutshell: I am performing Dijkstra's algorithm as it pertains to a road network, and then displaying the nodes that it visits using the google maps API. I am using an open dataset of the US road network to construct the graph by iterating over two txt files given in the link, but I am having trouble storing this data in my app. I am under the impression

Receiving data with spaces, thru sockets

阅读更多关于 Receiving data with spaces, thru sockets

问题 I'm using C++ with QT4 for this. And when I try to send large html files(in this case, 8kb), the process of sending and receiving work well. But the file received come with spaces between each character of the html file. Here an example, the file is sent like this: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" "http://www.w3.org/TR/REC-html40/strict.dtd"> <html><head><meta name="qrichtext" content="1" /><style type="text/css"> p, li { white-space: pre-wrap; } </style></head><body style="

Creating user sessions with fast computation

阅读更多关于 Creating user sessions with fast computation

问题 I have a data frame with three columns: "uuid" (that is class factor) and "created_at" (that is class POSIXct),and "trainer_item_id" (factor) and I created a third column that is named "Sessions". The column Sessions represents time sessions for each uuid ordered by time, such that the time difference between any consecutive pair of events is at most one hour (3600seconds). I have created the column Sessions using a "for loop" and iteration. The problem is that I have more than a million of