bigdata | 易学教程

HDFS as volume in cloudera quickstart docker

阅读更多关于 HDFS as volume in cloudera quickstart docker

I am fairly new to both hadoop and docker. I haven been working on extending the cloudera/quickstart docker image docker file and wanted to mount a directory form host and map it to hdfs location, so that performance is increased and data are persist localy. When i mount volume anywhere with -v /localdir:/someDir everything works fine, but that's not my goal. But when i do -v /localdir:/var/lib/hadoop-hdfs both datanode and namenode fails to start and I get : "cd /var/lib/hadoop-hdfs: Permission denied". And when i do -v /localdir:/var/lib/hadoop-hdfs/cache no permission denied but datanode

Bigtable performance influence column families

阅读更多关于 Bigtable performance influence column families

We are currently investigating the influence of using multiple column families on the performance of our bigtable queries. We found that splitting the columns into multiple column families does not increase the performance. Does anyone have had similar experiences? Some more details about our benchmark setup. At this moment each row in our production table contains around 5 columns, each containing between 0,1 to 1 KB of data. All columns are stored into one column family. When performing a row key range filter (which returns on average 340 rows) and apply a column regex fitler (which returns

(R error) Error: cons memory exhausted (limit reached?)

阅读更多关于 (R error) Error: cons memory exhausted (limit reached?)

I am working with big data and I have a 70GB JSON file. I am using jsonlite library to load in the file into memory. I have tried AWS EC2 x1.16large machine (976 GB RAM) to perform this load but R breaks with the error: Error: cons memory exhausted (limit reached?) after loading in 1,116,500 records. Thinking that I do not have enough RAM, I tried to load in the same JSON on a bigger EC2 machine with 1.95TB of RAM. The process still broke after loading 1,116,500 records. I am using R version 3.1.1 and I am executing it using --vanilla option. All other settings are default. here is the code:

Efficient operations of big non-sparse matrices in Matlab

阅读更多关于 Efficient operations of big non-sparse matrices in Matlab

I need to operate in big 3-dim non-sparse matrices in Matlab. Using pure vectorization gives a high computation time. So, I have tried to split the operations into 10 blocks and then parse the results. I got surprised when I saw the the pure vectorization does not scale very well with the data size as presented in the following figure. I include an example of the two approaches. % Parameters: M = 1e6; N = 50; L = 4; K = 10; % Method 1: Pure vectorization mat1 = randi(L,[M,N,L]); mat2 = repmat(permute(1:L,[3 1 2]),M,N); result1 = nnz(mat1>mat2)./(M+N+L); % Method 2: Split computations result2 =

key validation class type in cassandra UTF8 or LongType?

阅读更多关于 key validation class type in cassandra UTF8 or LongType?

Using cassandra, I want to store 20 million+ of row key in column family. my question is: Is there a REAL performance difference between long and utf8 rowKey keys? any,row key storage size problem? my userkey look like this rowKey=>112512462152451 rowKey=>135431354354343 rowKey=>145646546546463 rowKey=>154354354354354 rowKey=>156454343435435 rowKey=>154435435435745 Cassandra stores all data on disk (including row key values) as a hex byte array. In terms of performance, the datatype of the row key really doesn't matter. The only place that it does matter, is that the type validator/comparator

Time-based drilldowns in Power BI powered by Azure Data Warehouse

阅读更多关于 Time-based drilldowns in Power BI powered by Azure Data Warehouse

I have designed a simple Azure Data Warehouse where I want to track stock of my products on periodic basis. Moreover I want to have an ability to see that data grouped by month, weeks, days and hours with ability to drill down from top to bottom. I have defined 3 dimensions: DimDate DimTime DimProduct I have also defined a Fact table to track product stocks: FactStocks - DateKey (20160510, 20160511, etc) - TimeKey (0..23) - ProductKey (Product1, Product2) - StockValue (number, 1..9999) My fact sample data is below: 20160510 20 Product1 100 20160510 20 Product2 30 20160510 21 Product1 110

Is there maximum size of string data type in Hive?

阅读更多关于 Is there maximum size of string data type in Hive?

问题 Google a ton but haven't found it anywhere. Or does that mean Hive can support arbitrary large string data type as long as cluster is allowed? If so, where I can find what is the largest size of string data type that my cluster can support? Thanks in advance! 回答1: The current documentation for Hive lists STRING as a valid datatype, distinct from VARCHAR and CHAR See official apache doc here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-Strings It

Hadoop Nodemanager and Resourcemanager not starting

阅读更多关于 Hadoop Nodemanager and Resourcemanager not starting

I am trying to setup the latest Hadoop 2.2 single node cluster on Ubuntu 13.10 64 bit. the OS is a fresh installation, and I have tried using both java-6 64 bit and java-7 64 bit. After following the steps from this and after failing, from this link, I am not able to start nodemanager and resourcemanager with the command: sbin/yarn-daemon.sh start nodemanager sudo sbin/yarn-daemon.sh start nodemanager and resource manager with sbin/yarn-daemon.sh start resourcemanager sudo sbin/yarn-daemon.sh start resourcemanager and both fails with error: starting nodemanager, logging to /home/hduser/yarn

How to get the first not null value from a column of values in Big Query?

阅读更多关于 How to get the first not null value from a column of values in Big Query?

I am trying to extract the first not null value from a column of values based on timestamp. Can somebody share your thoughts on this. Thank you. What have i tried so far? FIRST_VALUE( column ) OVER ( PARTITION BY id ORDER BY timestamp) Input :- id,column,timestamp 1,NULL,10:30 am 1,NULL,10:31 am 1,'xyz',10:32 am 1,'def',10:33 am 2,NULL,11:30 am 2,'abc',11:31 am Output(expected) :- 1,'xyz',10:30 am 1,'xyz',10:31 am 1,'xyz',10:32 am 1,'xyz',10:33 am 2,'abc',11:30 am 2,'abc',11:31 am Try this old trick of string manipulation: Select ID, Column, ttimestamp, LTRIM(Right(CColumn,20)) as CColumn,

Long lag time importing large .CSV's in R WITH header in second row

阅读更多关于 Long lag time importing large .CSV's in R WITH header in second row

I am working on developing an application that ingests data from .csv's and then does some calculations to it. The challenge is that the .csv's can be very large in size. I have reviewed a number of posts here discussing the import of large .csv files using various functions & libraries. Some examples are below: ### size of csv file: 689.4MB (7,009,728 rows * 29 columns) ### system.time(read.csv('../data/2008.csv', header = T)) # user system elapsed # 88.301 2.416 90.716 library(data.table) system.time(fread('../data/2008.csv', header = T, sep = ',')) # user system elapsed # 4.740 0.048 4.785