large-data

numpy.memmap for an array of strings?

我的梦境 提交于 2019-12-04 09:29:45
Is it possible to use numpy.memmap to map a large disk-based array of strings into memory? I know it can be done for floats and suchlike, but this question is specifically about strings. I am interested in solutions for both fixed-length and variable-length strings. The solution is free to dictate any reasonable file format. If all the strings have the same length, as suggested by the term "array", this is easily possible: a = numpy.memmap("data", dtype="S10") would be an example for strings of length 10. Edit : Since apparently the strings don't have the same length, you need to index the

import/export very large mysql database in phpmyadmin

折月煮酒 提交于 2019-12-04 04:59:50
i have a db in phpmyadmin having 3000000 records. i want to export this to another pc. now when i export this only 200000 entries exported into .sql file and that is also not imported on the other pc. Answering this for anyone else who lands here. If you can only use phpMyAdmin because you do not have SSH access to the MySQL service or do not know how to use command line tools, then this might help. However as the comment above suggest, exporting a database of this size would be far easer with mysqldump. phpMyAdmin (I'm using v3.5.6) allows tables to be exported individually like so: Select

Generating large Excel files from MySQL data with PHP from corporate applications

青春壹個敷衍的年華 提交于 2019-12-04 04:22:57
We're developing and maintaining a couple of systems, which need to export reports in Excel format to the end user. The reports are gathered from a MySQL database with some trivial processing and usually result in ~40000 rows of data with 10-15 columns, we're expecting the amount of data to grow steadily. At the moment we're using PHPExcel for the Excel generation, but it's not working for us anymore. After we go above 5000 rows, the memory consumption and loading times become untolerable, and can't be solved by indefinitely increasing PHP's maximum limits for memory usage and script execution

fread protection stack overflow error

南笙酒味 提交于 2019-12-04 03:51:14
I'm using fread in data.table (1.8.8, R 3.0.1) in a attempt to read very large files. The file in questions has 313 rows and ~6.6 million cols of numeric data rows and the file is around around 12gb. This is a Centos 6.4 with 512GB of RAM. When I attempt to read in the file: g=fread('final.results',header=T,sep=' ') 'header' changed by user from 'auto' to TRUE Error: protect(): protection stack overflow I tried starting R with --max-ppsize 500000 , which is the max, but the same error. I also tried setting the stack size to unlimited via ulimit -s unlimited Virtual memory was already set to

PHP and the million array baby

余生长醉 提交于 2019-12-04 01:26:57
Imagine you have the following array of integers: array(1, 2, 1, 0, 0, 1, 2, 4, 3, 2, [...] ); The integers go on up to one million entries; only instead of being hardcoded they've been pre-generated and stored in a JSON formatted file (of approximately 2MB in size). The order of these integers matters, I can't randomly generate it every time because it should be consistent and always have the same values at the same indexes. If this file is read back in PHP afterwards (e.g. using file_get_contents + json_decode ) it takes from 700 to 900ms just to get the array back — "Okay" I thought, "it's

Components with large datasets runs slow on IE11/Edge only

前提是你 提交于 2019-12-03 13:34:07
Consider the code below. <GridBody Rows={rows} /> and imagine that rows.length would amount to any value 2000 or more with each array has about 8 columns in this example. I use a more expanded version of this code to render a part of a table that has been bottle necking my web application. var GridBody = React.createClass({ render: function () { return <tbody> {this.props.Rows.map((row, rowKey) => { return this.renderRow(row, rowKey); })} </tbody>; }, renderRow: function (row, rowKey) { return <tr key={rowKey}> {row.map((col, colKey) => { return this.renderColumn(col, colKey); })} </tr>; },

svd of very large matrix in R program

时间秒杀一切 提交于 2019-12-03 07:50:55
问题 I have a matrix 60 000 x 60 000 in a txt file, I need to get svd of this matrix. I use R but I don´t know if R can generate it. 回答1: I think it's possible to compute (partial) svd using the irlba package and bigmemory and bigalgebra without using a lot of memory. First let's create a 20000 * 20000 matrix and save it into a file require(bigmemory) require(bigalgebra) require(irlba) con <- file("mat.txt", open = "a") replicate(20, { x <- matrix(rnorm(1000 * 20000), nrow = 1000) write.table(x,

How much data can R handle? [closed]

眉间皱痕 提交于 2019-12-03 05:47:08
Closed. This question is off-topic. It is not currently accepting answers. Learn more . Want to improve this question? Update the question so it's on-topic for Stack Overflow. By "handle" I mean manipulate multi-columnar rows of data. How does R stack up against tools like Excel, SPSS, SAS, and others? Is R a viable tool for looking at "BIG DATA" (hundreds of millions to billions of rows)? If not, which statistical programming tools are best suited for analysis large data sets? If you look at the High-Performance Computing Task View on CRAN, you will get a good idea of what R can do in a sense

With Haskell, how do I process large volumes of XML?

◇◆丶佛笑我妖孽 提交于 2019-12-03 05:06:37
问题 I've been exploring the Stack Overflow data dumps and thus far taking advantage of the friendly XML and “parsing” with regular expressions. My attempts with various Haskell XML libraries to find the first post in document-order by a particular user all ran into nasty thrashing. TagSoup import Control.Monad import Text.HTML.TagSoup userid = "83805" main = do posts <- liftM parseTags (readFile "posts.xml") print $ head $ map (fromAttrib "Id") $ filter (~== ("<row OwnerUserId=" ++ userid ++ ">")

Find Top 10 Most Frequent visited URl, data is stored across network

自闭症网瘾萝莉.ら 提交于 2019-12-03 04:36:43
问题 Source: Google Interview Question Given a large network of computers, each keeping log files of visited urls, find the top ten most visited URLs. Have many large <string (url) -> int (visits)> maps . Calculate < string (url) -> int (sum of visits among all distributed maps) , and get the top ten in the combined map. Main constraint: The maps are too large to transmit over the network. Also can't use MapReduce directly. I have now come across quite a few questions of this type, where