bigdata | 易学教程

Matrix multiplication using hdf5

阅读更多关于 Matrix multiplication using hdf5

I'm trying to multiplicate 2 big matrices with memory limit using hdf5 (pytables) but function numpy.dot seems to give me error: Valueerror: array is too big I need to do matrix multiplication by myself maybe blockwise or there is some another python function similar to numpy.dot? import numpy as np import time import tables import cProfile import numexpr as ne n_row=10000 n_col=100 n_batch=10 rows = n_row cols = n_col batches = n_batch atom = tables.UInt8Atom() #? filters = tables.Filters(complevel=9, complib='blosc') # tune parameters fileName_a = 'C:\carray_a.h5' shape_a = (rows*batches,

How to handle large amouts of data in tensorflow?

阅读更多关于 How to handle large amouts of data in tensorflow?

问题 For my project I have large amounts of data, about 60GB spread into npy files, each holding about 1GB, each containing about 750k records and labels. Each record is a 345 float32 and the labels are 5 float32. I read the tensorflow dataset documentation and the queues / threads documentation as well but I can't figure out how to best handle the input for training and then how save the model and weights for future predicting. My model is pretty straight forward, it looks like this: x = tf

how to sort word count by value in hadoop? [duplicate]

阅读更多关于 how to sort word count by value in hadoop? [duplicate]

问题 This question already has answers here : hadoop map reduce secondary sorting (5 answers) Closed 6 years ago . hi i wanted to learn how to sort the word count by value in hadoop.i know hadoop takes of sorting keys, but not by values. i know to sort the values we must have a partitioner,groupingcomparator and a sortcomparator but i am bit confused in applying these concepts together to sort the word count by value. do we need another map reduce job to achieve the same or else a combiner to

Can Azure Search be used as a primary database for some data?

阅读更多关于 Can Azure Search be used as a primary database for some data?

问题 Microsoft promotes Azure Search as "cloud search", but doesn't necessarily say it's a "database" or "data storage". It stops short of saying it's big data. Can/should azure search be used as the primary database for some data? Or should there always be some "primary" datastore that is "duplicated" in azure search for search purposes? If so, under what circumstances/what scenarios make sense to use Azure Search as a primary database? 回答1: Although we generally don't recommend it, you might

Memory limits in data table: negative length vectors are not allowed

阅读更多关于 Memory limits in data table: negative length vectors are not allowed

问题 I have a data table with several social media users and his/her followers. The original data table has the following format: X.USERID FOLLOWERS 1081 4053807021,2476584389,4713715543, ... So each row contains a user together with his/her ID and a vector of followers (seperated by a comma). In total I have 24,000 unique user IDs together with 160,000,000 unique followers. I wish to convert my original table in the following format: X.USERID FOLLOWERS 1: 1081 4053807021 2: 1081 2476584389 3:

How to get array/bag of elements from Hive group by operator?

阅读更多关于 How to get array/bag of elements from Hive group by operator?

问题 I want to group by a given field and get the output with grouped fields. Below is an example of what I am trying to achieve:- Imagine a table named 'sample_table' with two columns as below:- F1 F2 001 111 001 222 001 123 002 222 002 333 003 555 I want to write Hive Query that will give the below output:- 001 [111, 222, 123] 002 [222, 333] 003 [555] In Pig, this can be very easily achieved by something like this:- grouped_relation = GROUP sample_table BY F1; Can somebody please suggest if

what should be considered before choosing hbase?

阅读更多关于 what should be considered before choosing hbase?

问题 I am very new in big data space. We got suggestion from team we should use hbase instead of RDBMS for high performance . We do not have any idea what should/must be considered before switching RDMS to hbase. Any ideas? 回答1: One of my favourite book describes.. Coming to @Whitefret's last point : There is some thing called CAP theorm based on which decision can be taken. Consistency (all nodes see the same data at the same time) Availability (every request receives a response about whether it

How to set data block size in Hadoop ? Is it advantage to change it?

阅读更多关于 How to set data block size in Hadoop ? Is it advantage to change it?

问题 If we can change the data block size in Hadoop please let me know how to do that. Is it advantageous to change the block size, If yes, then let me know Why and how? If no, then let me know why and how? 回答1: There seems to be much confusion about this topic and also wrong advise going around. To lift the confusion it helps to think about how HDFS is actually implemented: HDFS is an abstraction over distributed disk-based file systems. So the words "block" and "blocksize" have a different

How to restart a failed task on Airflow

阅读更多关于 How to restart a failed task on Airflow

I am using a LocalExecutor and my dag has 3 tasks where task(C) is dependant on task(A). Task(B) and task(A) can run in parallel something like below A-->C B So task(A) has failed and but task(B) ran fine . Task(C) is yet to run as task(A) has failed. My question is how do i re run Task(A) alone so Task(C) runs once Task(A) completes and Airflow UI marks them as success. In the UI: Go to the dag, and dag run of the run you want to change Click on GraphView Click on task A Click "Clear" This will let task A run again, and if it succeeds, task C should run. This works because when you clear a

How to use apply or sapply or lapply with ffdf?

阅读更多关于 How to use apply or sapply or lapply with ffdf?

问题 Is there a way to use an apply type construct directly to the columns of a ffdf object? I am trying to count the NAs in each column without having to turn it into a standard data frame. I can get the na count for the individual columns using: sum(is.na(ffdf$columnname)) But is there a way to do this for all the columns in the dataframe at once, something like: lapply(ffdf, function(x){sum(is.na(x))}) When I run this I get: $virtual [1] 0 $physical [1] 0 $row.names [1] 0 I have not been able