bigdata

count objects in array Big Data

折月煮酒 提交于 2019-12-25 01:55:06
问题 hi how can i count how many objects are in this Object of arrays? In the children objects there are also Arrays with Objects. Dont really easy to understand... Hi looked at stackoverflow but dont find a helpful answer. I looked for example at this question: Link. Maybe i can do it recursive. Somebody has an idea? There can also be 100 of Arrays of Objects. This is only an example: let DATA = { "name": "flare", "children": [{ "name": "analytics", "children": [{ "name": "cluster", "children": [

does sqoop support dynamic partitioning with hive?

白昼怎懂夜的黑 提交于 2019-12-25 01:36:13
问题 does sqoop support dynamic partitioning with hive? i tried using below-mentioned options : --hive-partition-key and --hive=partition-vlaue is only for static partitioning for eg: sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/prac" --username root --password cloudera --hive-import --query "select id,name,ts from student where city='Mumbai' and \$CONDITIONS " --hive-partition-key city --hive-partition-value 'Mumbai' --hive-table prac.student --target-dir /user/mangesh/sqoop

spark select and add columns with alias

∥☆過路亽.° 提交于 2019-12-25 00:20:20
问题 I want to select few columns, add few columns or divide, with some columns as space padded and store them with new names as alias. For example in SQL should be something like: select " " as col1, b as b1, c+d as e from table How can I achieve this in Spark? 回答1: You can also use the native DF functions as well. For example given: import org.apache.spark.sql.functions._ val df1 = Seq( ("A",1,5,3), ("B",3,4,2), ("C",4,6,3), ("D",5,9,1)).toDF("a","b","c","d") select the columns as: df1.select

Hive JDBC connection setting or mapping with MySQL

早过忘川 提交于 2019-12-24 22:49:47
问题 I am new to big data; technically I am a Java developer and decided to learn big data concepts. I am just able to install Hadoop and Hive, and now I want to connect my Java program to Hive. I have configured MySQL as a back-end db. Tried to Google it and found a few Java program sample where they are using something like this URL jdbc:hive2://172.16.149.158:10000/default,"","" . My question is I didn't made any setting like this in hive-sites.xml . Where should I made these setting or if not

R{ff}:How to add a new column which depends on other elements in the same row in ffdf object?

南笙酒味 提交于 2019-12-24 21:06:14
问题 I have an ffdf objetct (23Mx4) and a character vector with the values "TUMOR" or "NORMAL" and each value has a name, an unique icgc_specimen_id, so this way I indicate if a certain specimen is a Normal cell or Tumor cell. > head(expresion,4) ffdf (all open) dim=c(23939146,4), dimorder=c(1,2) row.names=NULL ffdf virtual mapping PhysicalName VirtualVmode PhysicalVmode AsIs VirtualIsMatrix PhysicalIsMatrix PhysicalElementNo icgc_donor_id icgc_donor_id integer integer FALSE FALSE FALSE 1 icgc

What are the most feasible options to do processing on google books n-gram dataset using modest resources?

假装没事ソ 提交于 2019-12-24 20:23:29
问题 I need to calculate word co-occurrence statistics for some 10,000 target words and few hundred context words, for each target word, from n-gram corpus of google books Below is the link of the full dataset: Google Ngram Viewer As evident database is approximately of 2.2TB and contains few hundred billions of rows. For computing word co-occurrence statistics I need to process the whole data for each possible pair of target and context word . I am currently considering using Hadoop with Hive for

Spark Program finding Popular HashTags from twiiter

ぐ巨炮叔叔 提交于 2019-12-24 19:04:47
问题 I am trying to run this spark program which will get me the popular hashtags currently on twitter and will only show the top 10 hashtags. I have supplied the twiiter access token, Secret & the Customer Key, Secret via a text File. import org.apache.spark.streaming.StreamingContext import org.apache.spark.streaming.Seconds import org.apache.spark.streaming.twitter.TwitterUtils object PopularHashtags { def setupLogging() = { import org.apache.log4j.{ Level, Logger } val rootLogger = Logger

Calculate running total from csv line by line

时光总嘲笑我的痴心妄想 提交于 2019-12-24 18:46:36
问题 I'm loading in a csv file line by line because it has ~800 million lines in it and there are many of these files which I need to analyse so loading in parallel is paramount and loading line by line is also required so as to not blow up the memory. I have been given an answer to how to calculate the number of entries in which unique IDs are present throughout the dataset using collections.Counter() . (see Counting csv column occurrences on the fly in Python) But is there a way to calculate a

How to store this collection of documents?

主宰稳场 提交于 2019-12-24 17:43:25
问题 The dataset is like this: 39861 // number of documents 28102 // number of words of the vocabulary (another file) 3710420 // number of nonzero counts in the bag-of-words 1 118 1 // document_id index_in_vocabulary count 1 285 3 ... 2 46 1 ... 39861 27196 5 We are advised not to store that in matrix (of size 39861 x 39861 I guess), since it won't fit in memory * and from here I can assume that every integer will need 24 bytes to be stored, thus 27 Gb (=39861*28102*24 bytes) with a dense matrix.

How Hive stores the data (loaded from HDFS)?

折月煮酒 提交于 2019-12-24 17:24:49
问题 I am fairly new to Hadoop (HDFS and Hbase) and Hadoop Eco system (Hive, Pig, Impala etc.). I have got a good understanding of Hadoop components such as NamedNode, DataNode, Job Tracker, Task Tracker and how they work in tandem to store the data in efficient manner. While trying to understand fundamentals of data access layer such as Hive, I need to understand where exactly a table’s data (created in Hive) gets stored? We can create external and internal table in Hive. As external tables can