bigdata | 易学教程

How to load large table into tableau for data visualization?

阅读更多关于 How to load large table into tableau for data visualization?

I am able to connect tableau with my database but the table size is really large here. Everytime I try to load the table into tableau, it is crashing and I am not able to find any work around. The table size varies from 10 million - 400 million rows. How should I approach this issue any suggestion ? I found a simple solution for optimising Tableau to work with very large datasets (1 billion+ rows): Google BigQuery, which is essentially a managed data warehouse. Upload data to BigQuery (you can append multiple files into a single table). Link that table to Tableau as an external data source

What is the actual difference between Data Warehouse & Big Data?

阅读更多关于 What is the actual difference between Data Warehouse & Big Data?

问题 I know what is Data Warehouse & what is Big Data. But I am confused with Data Warehouse Vs Big Data. Both are same with different names or both are different(Conceptually & Physically). 回答1: I know that this is an older thread but there have been some developments in the last year or so. Comparing the data warehouse to Hadoop is like comparing apples to oranges. The data warehouse is a concept: clean, integrated data of high quality. I don't think the need for a data warehouse will go away

Spark dataframe: collect () vs select ()

阅读更多关于 Spark dataframe: collect () vs select ()

问题 Calling collect() on an RDD will return the entire dataset to the driver which can cause out of memory and we should avoid that. Will collect() behave the same way if called on a dataframe? What about the select() method? Does it also work the same way as collect() if called on a dataframe? 回答1: Actions vs Transformations Collect (Action) - Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a

Django + Postgres + Large Time Series

阅读更多关于 Django + Postgres + Large Time Series

I am scoping out a project with large, mostly-uncompressible time series data, and wondering if Django + Postgres with raw SQL is the right call. I have time series data that is ~2K objects/hour, every hour. This is about 2 million rows per year I store, and I would like to 1) be able to slice off data for analysis through a connection, 2) be able to do elementary overview work on the web, served by Django. I think the best idea is to use Django for the objects themselves, but drop to raw SQL to deal with the large time series data associated. I see this as a hybrid approach; that might be a

What is the status on Neo4j's horizontal scalability project Rassilon?

阅读更多关于 What is the status on Neo4j's horizontal scalability project Rassilon?

问题 Just wondering if anyone has any information on the status of project Rassilon, Neo4j's side project which focuses on improving horizontal scalability of Neo4j? It was first announced in January 2013 here. I'm particularly interested in knowing more about when the graph size limitation will be removed and when sharding across clusters will become available. 回答1: The node & relationship limits are going away in 2.1, which is the next release post 2.0 (which now has a release candidate).

How to view Apache Parquet file in Windows?

阅读更多关于 How to view Apache Parquet file in Windows?

问题 I couldn't find any plain English explanations regarding Apache Parquet files. Such as: What are they? Do I need Hadoop or HDFS to view/create/store them? How can I create parquet files? How can I view parquet files? Any help regarding these questions is appreciated. 回答1: What is Apache Parquet? Apache Parquet is a binary file format that stores data in a columnar fashion. Data inside a Parquet file is similar to an RDBMS style table where you have columns and rows. But instead of accessing

Generating a very large matrix of string combinations using combn() and bigmemory package

阅读更多关于 Generating a very large matrix of string combinations using combn() and bigmemory package

I have a vector x of 1,344 unique strings. I want to generate a matrix that gives me all possible groups of three values, regardless of order, and export that to a csv. I'm running R on EC2 on a m1.large instance w 64bit Ubuntu. When using combn(x, 3) I get an out of memory error: Error: cannot allocate vector of size 9.0 Gb The size of the resulting matrix is C1344,3 = 403,716,544 rows and three columns - which is the transpose of the result of combn() function. I thought of using the bigmemory package to create a file backed big.matrix so I can then assign the results of the combn() function

Matrix multiplication using hdf5

阅读更多关于 Matrix multiplication using hdf5

问题 I'm trying to multiplicate 2 big matrices with memory limit using hdf5 (pytables) but function numpy.dot seems to give me error: Valueerror: array is too big I need to do matrix multiplication by myself maybe blockwise or there is some another python function similar to numpy.dot? import numpy as np import time import tables import cProfile import numexpr as ne n_row=10000 n_col=100 n_batch=10 rows = n_row cols = n_col batches = n_batch atom = tables.UInt8Atom() #? filters = tables.Filters

What are the differences between Sort Comparator and Group Comparator in Hadoop?

阅读更多关于 What are the differences between Sort Comparator and Group Comparator in Hadoop?

What are the differences between Sort Comparator and Group Comparator in Hadoop? Eswara Reddy Adapa To understand GroupComparator , see my answer to this question - What is the use of grouping comparator in hadoop map reduce SortComparator :Used to define how map output keys are sorted Excerpts from the book Hadoop - Definitive Guide: Sort order for keys is found as follows: If the property mapred.output.key.comparator.class is set, either explicitly or by calling setSortComparatorClass() on Job, then an instance of that class is used. (In the old API the equivalent method is

What methods can we use to reshape VERY large data sets?

阅读更多关于 What methods can we use to reshape VERY large data sets?

When due to very large data calculations will take a long time and, hence, we don't want them to crash, it would be valuable to know beforehand which reshape method to use. Lately, methods for reshaping data have been further developed regarding performance, e.g. data.table::dcast and tidyr::spread . Especially dcast.data.table seems to set the tone [1] , [2] , [3] , [4] . This makes other methods as base R's reshape in benchmarks seem outdated and almost useless [5] . Theory However , I've heard that reshape was still unbeatable when it comes to very large datasets (probably those exceeding