bigdata | 易学教程

Search in 300 million addresses with pg_trgm

阅读更多关于 Search in 300 million addresses with pg_trgm

问题 I have 300 million addresses in my PostgreSQL 9.3 DB and I want to use pg_trgm to fuzzy search the rows. The final purpose is to implement a search function just like Google Map search. When I used pg_trgm to search these addresses, it cost about 30s to get the results. There are many rows matching the default similarity threshold condition of 0.3 but I just need about 5 or 10 results. I created a trigram GiST index: CREATE INDEX addresses_trgm_index ON addresses USING gist (address gist_trgm

Use tm's Corpus function with big data in R

阅读更多关于 Use tm's Corpus function with big data in R

问题 I'm trying to do text mining on big data in R with tm . I run into memory issues frequently (such as can not allocation vector of size.... ) and use the established methods of troubleshooting those issues, such as using 64-bit R trying different OS's (Windows, Linux, Solaris, etc) setting memory.limit() to its maximum making sure that sufficient RAM and compute is available on the server (which there is) making liberal use of gc() profiling the code for bottlenecks breaking up big operations

Why Kafka consumer performance is slow?

阅读更多关于 Why Kafka consumer performance is slow?

问题 I have one simple topic, and one simple Kafka consumer and producer, using the default configuration. The program is very simple, I have two threads. In the producer, it keeps sending 16 bytes data. And in consumer side, it keeps receiving. I found the fact that, the throughput for producer is roughly 10MB/s, that is fine. But the throughput for consumer is only 0.2MB/s. I have disabled all the debugging logs but that does not make it any better. The test is running on local machine. Any body

Datastore for large astrophysics simulation data

阅读更多关于 Datastore for large astrophysics simulation data

问题 I'm a grad student in astrophysics. I run big simulations using codes mostly developed by others over a decade or so. For examples of these codes, you can check out gadget http://www.mpa-garching.mpg.de/gadget/ and enzo http://code.google.com/p/enzo/. Those are definitely the two most mature codes (they use different methods). The outputs from these simulations are huge . Depending on your code, your data is a bit different, but it's always big data. You usually take billions of particles and

Datastore for large astrophysics simulation data

阅读更多关于 Datastore for large astrophysics simulation data

Name Node stores what?

阅读更多关于 Name Node stores what?

问题 In case of "Name Node", what gets stored in main memory and what gets stored in secondary memory ( hard disk ). What we mean by "file to block mapping" ? What exactly is fsimage and edit logs ? 回答1: In case of "Name Node", what gets stored in main memory and what gets stored in secondary memory ( hard disk ). The file to block mapping, locations of blocks on data nodes, active data nodes, a bunch of other metadata is all stored in memory on the NameNode. When you check the NameNode status

how to fetch all of data from hbase table in spark

阅读更多关于 how to fetch all of data from hbase table in spark

问题 I have a big table in hbase that name is UserAction, and it has three column families(song,album,singer). I need to fetch all of data from 'song' column family as a JavaRDD object. I try this code, but it's not efficient. Is there a better solution to do this ? static SparkConf sparkConf = new SparkConf().setAppName("test").setMaster( "local[4]"); static JavaSparkContext jsc = new JavaSparkContext(sparkConf); static void getRatings() { Configuration conf = HBaseConfiguration.create(); conf

R vector size limit: “long vectors (argument 5) are not supported in .C”

阅读更多关于 R vector size limit: “long vectors (argument 5) are not supported in .C”

问题 I have a very large matrix I'm trying to run through glmnet on a server with plenty of memory. It works fine even on very large data sets up to a certain point, after which I get the following error: Error in elnet(x, ...) : long vectors (argument 5) are not supported in .C If I understand correctly this is caused by a limitation in R which cannot have any vector with length longer than INT_MAX. Is that correct? Are there any available solutions to this that don't require a complete rewrite

Fastest way to cross-tabulate two massive logical vectors in R

阅读更多关于 Fastest way to cross-tabulate two massive logical vectors in R

问题 For two logical vectors, x and y , of length > 1E8, what is the fastest way to calculate the 2x2 cross tabulations? I suspect the answer is to write it in C/C++, but I wonder if there is something in R that is already quite smart about this problem, as it's not uncommon. Example code, for 300M entries (feel free to let N = 1E8 if 3E8 is too big; I chose a total size just under 2.5GB (2.4GB). I targeted a density of 0.02, just to make it more interesting (one could use a sparse vector, if that

Postgresql - performance of using array in big database

阅读更多关于 Postgresql - performance of using array in big database

问题 Let say we have a table with 6 million records. There are 16 integer columns and few text column. It is read-only table so every integer column have an index. Every record is around 50-60 bytes. The table name is "Item" The server is: 12 GB RAM, 1,5 TB SATA, 4 CORES. All server for postgres. There are many more tables in this database so RAM do not cover all database. I want to add to table "Item" a column "a_elements" (array type of big integers) Every record would have not more than 50-60