bigdata

Could not deallocate container for task attemptId NNN

ぃ、小莉子 提交于 2019-12-11 03:31:02
问题 I'm trying to understand how the container allocates memory in YARN and their performance based on different hardware configuration. So, the machine has 30 GB RAM and I picked 24 GB for YARN and leave 6 GB for the system. yarn.nodemanager.resource.memory-mb=24576 Then I followed http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.6.0/bk_installing_manually_book/content/rpm-chap1-11.html to come up with some vales for Map & Reduce tasks memory. I leave these two to their default value:

Transform data in Google bigquery - extract text, split it into multiple columns and pivoting the data

断了今生、忘了曾经 提交于 2019-12-11 03:08:56
问题 I have some weblog data in big query which I need to transform to make it easier to use and query. The data looks like: I want to extract and transform the data within the curled brackets after Results{…..} (colored blue). The data is of the form ‘(\d+((PQ)|(KL))+\d+)’ and there can be 1-20+ entries in the result array. I am only interested in the first 16 entries. I have been able to extract the data within curled brackets into a new column, using Substr and regext_extract. But I'm unable to

Google BigQuery queries are slow

本秂侑毒 提交于 2019-12-11 03:04:05
问题 I am using Google BigQuery and I am executing some simple queries from PHP. (e.g. SELECT * from emails WHERE email='mail@test.com') I am just checking if the email exists in the table. The table "emails" is empty for now. But still the PHP script takes around 4 minutes to check 175 emails on an empty table .. As I wish in future the table will be filled and will have 500 000 mails then I guess the request time will be longer. Is that normal ? Or are there any ideas/solutions to improve the

Sklearn-GMM on large datasets

老子叫甜甜 提交于 2019-12-11 02:48:33
问题 I have a large data-set (I can't fit entire data on memory). I want to fit a GMM on this data set. Can I use GMM.fit() ( sklearn.mixture.GMM ) repeatedly on mini batch of data ?? 回答1: There is no reason to fit it repeatedly. Just randomly sample as many data points as you think your machine can compute in a reasonable time. If variation is not very high, the random sample will have approximately the same distribution as the full dataset. randomly_sampled = np.random.choice(full_dataset, size

Logging all presto queries

风流意气都作罢 提交于 2019-12-11 02:29:02
问题 How can I store all queries submitted to presto cluster in a file (ORC file) or may be some other database. Purpose is the keep the record of all queries executed on presto workers. I am aware that I need to overwrite queryCompleted method, I have also tried to follow this and other link mentioned over there but I am unable to create correct jar using maven. After placing the presto jar file generated by maven, my presto stopped working. I am new to presto as well as in maven. It would be

When does Hadoop Framework creates a checkpoint (expunge) to its “current” directory in trash?

穿精又带淫゛_ 提交于 2019-12-11 01:45:04
问题 From a long time, I have observed that Hadoop framework set a checkpoint on the trash current directory irrespective of a time interval whereas permanently deletes the file/directory within the specified deletion interval after creating the automatic checkpoint. Here is what, I have tested: vi core-site.xml <property> <name>fs.trash.interval</name> <value>5</value> </property> hdfs dfs -put LICENSE.txt / hdfs dfs -rm /LICENSE.txt fs.TrashPolicyDefault: Namenode trash configuration: Deletion

how is flume distributed?

白昼怎懂夜的黑 提交于 2019-12-11 01:05:34
问题 I am working with flume to ingest a ton of data into hdfs (about petabytes of data). I would like to know how is flume making use of its distributed architecture? I have over 200 servers and I have installed flume in one of them from where I would get the data from (aka data source) and the sink is the hdfs. (hadoop is running over serengeti in these servers). I am not sure whether flume distributes itself over the cluster or I have installed it incorrectly. I followed apache's user guide for

Is there a function equivalent to Hive's 'explode' function in Apache Impala?

萝らか妹 提交于 2019-12-11 01:01:40
问题 Hive's function explode is documented here It is essentially a very practical function that generates many rows from a single one. Its basic version takes a column whose value is an array of values and produces a copy of the same row for each of those values. I wonder whether such a thing exists in Impala. I haven't been able to find it in the documentation. 回答1: Impala does not have any function like EXPLODE in hive to read complex data types and generate multiple rows. Currently through

Filter partial duplicates with mapWithState Spark Streaming

风格不统一 提交于 2019-12-11 00:27:50
问题 We have a DStream, such as val ssc = new StreamingContext(sc, Seconds(1)) val kS = KafkaUtils.createDirectStream[String, TMapRecord]( ssc, PreferConsistent, Subscribe[String, TMapRecord](topicsSetT, kafkaParamsInT)). mapPartitions(part => { part.map(_.value()) }). mapPartitions(part1 => { part1.map(c => { TMsg(1, c.field1, c.field2, //And others c.startTimeSeconds ) }) }) So each RDD has a bunch of TMsg objects with some of the (technical) key fields I can use to dediplicate DStream.

Smartest way to store huge amounts of data

試著忘記壹切 提交于 2019-12-10 21:39:54
问题 I want to access the flickr API with a REST request and download the Metadata of approx. 1 Mio photos (maybe more). I want to store them in a .csv file and import them then into a MySQL Database for further processing I am wondering what is the smartest way to handle such big data. What I am not sure about is how to store them after accessing the website in Python, passing them to the .csv file and from there to the db. Thats one big questionmark. Whats happening now (for my understanding,