bigdata

BigQuery : is it possible to execute another query inside an UDF?

泄露秘密 提交于 2019-12-24 00:42:59
问题 I have a table that records a row for each unique user per day with some aggregated stats for that user on that day, and I need to produce a report that tells me for each day, the no. of unique users in the last 30 days including that day. eg. for Aug 31st, it'll count the unique users from Aug 2nd to Aug 31st for Aug 30th, it'll count the unique users from Aug 1st to Aug 30th and so on... I've looked at some related questions but they aren't quite what I need - if a user logs in on multiple

Flink: how does the parallelism set in the Jobmanager UI relate to task slots?

别说谁变了你拦得住时间么 提交于 2019-12-23 20:12:04
问题 Let's say I have 8 task managers with 16 task slots. If I submit a job using the Jobmanager UI and set the parallelism to 8, do I only utilise 8 task slots? What if I have 8 task managers with 8 slots, and submit the same job with a parallelism of 8? Is it exactly the same thing? Or is there a difference in the way the data is processed? Thank you. 回答1: The total number of task slots in a Flink cluster defines the maximum parallelism, but the number of slots used may exceed the actual

Flink: how does the parallelism set in the Jobmanager UI relate to task slots?

女生的网名这么多〃 提交于 2019-12-23 19:50:28
问题 Let's say I have 8 task managers with 16 task slots. If I submit a job using the Jobmanager UI and set the parallelism to 8, do I only utilise 8 task slots? What if I have 8 task managers with 8 slots, and submit the same job with a parallelism of 8? Is it exactly the same thing? Or is there a difference in the way the data is processed? Thank you. 回答1: The total number of task slots in a Flink cluster defines the maximum parallelism, but the number of slots used may exceed the actual

Cloudera Manager. Failed to detect Cloudera Manager Server

拥有回忆 提交于 2019-12-23 18:24:22
问题 I have two PC's with CentOS 6.5 client86-101.aihs.net 80.94.86.101 client86-103.aihs.net 80.94.86.103 cloudera-manager-server installed on client86-101.aihs.net. I have the problem on detecting Cloudera Manager Server(3rd step on cluster installation.) Issue trace: BEGIN host -t PTR 80.94.86.101 101.86.94.80.in-addr.arpa domain name pointer client86-101.aihs.net. END (0) using client86-101.aihs.net as scm server hostname BEGIN which python END (0) BEGIN python -c 'import socket; import sys; s

Fast way to “flatten” hierarchy table?

允我心安 提交于 2019-12-23 17:57:47
问题 I've got a very huge table with hierarchy which can not be modified. Nodes in the table have an Id , ParentId , a Level and some data. The Level means that node with level N can be a child not only for level N-1 but also for level N-2 , N-3 etc. The good news are that the number of levels is limited - there are only 8 of them. Level 1 is on the top of the hierarchy and level 8 is the end of it. And now I need to flatten that table with respect to the place of the levels. The result should be

How can a reduce a key value pair to key and list of values?

风格不统一 提交于 2019-12-23 17:18:01
问题 Let us Assume, I have a key value pair in Spark, such as the following. [ (Key1, Value1), (Key1, Value2), (Key1, Vaue3), (Key2, Value4), (Key2, Value5) ] Now I want to reduce this, to something like this. [ (Key1, [Value1, Value2, Value3]), (Key2, [Value4, Value5]) ] That is, from Key-Value to Key-List of Values. How can I do that using the map and reduce functions in python or scala? 回答1: collections.defaultdict can be the solution https://docs.python.org/2/library/collections.html

Is `ls -f | grep -c .` the fastest way to count files in directory, when using POSIX / Unix system (Big Data)?

江枫思渺然 提交于 2019-12-23 16:28:07
问题 I used to do ls path-to-whatever| wc -l , until I discovered, that it actually consumes huge amount of memory. Then I moved to find path-to-whatever -name "*" | wc -l , which seems to consume much graceful amount of memory, regardless how many files you have. Then I learned that ls is mostly slow and less memory efficient due to sorting the results. By using ls -f | grep -c . , one will get very fast results; the only problem is filenames which might have "line breaks" in them. However, that

How can I perform data lineage in GCP?

只谈情不闲聊 提交于 2019-12-23 15:15:08
问题 When we realize the data lake with GCP Cloud storage, and data processing with Cloud services such as Dataproc, Dataflow How can we generated data lineage report in GCP. Thanks. 回答1: Google Cloud Platform doesn't have serverless data lineage offering. Instead, you may want to install Apache Atlas on Google Cloud Dataproc and use it for data lineage. 回答2: If data lineage is important for you, you will find yourself wanting an Enterprise Data Cloud. Cloudera is the main supplier in this space,

Count number of records in a column family in an HBase table

喜你入骨 提交于 2019-12-23 14:02:13
问题 I'm looking for an HBase shell command that will count the number of records in a specified column family. I know I can run: echo "scan 'table_name'" | hbase shell | grep column_family_name | wc -l however this will run much slower than the standard counting command: count 'table_name' , CACHE => 50000 (because the use of the CACHE=>50000) and worse - it doesn't return the real number of records, but something like the total number of cells (if I'm not mistaken?) in the specified column

Count number of records in a column family in an HBase table

蓝咒 提交于 2019-12-23 14:02:01
问题 I'm looking for an HBase shell command that will count the number of records in a specified column family. I know I can run: echo "scan 'table_name'" | hbase shell | grep column_family_name | wc -l however this will run much slower than the standard counting command: count 'table_name' , CACHE => 50000 (because the use of the CACHE=>50000) and worse - it doesn't return the real number of records, but something like the total number of cells (if I'm not mistaken?) in the specified column