bigdata | 易学教程

Name Node stores what?

阅读更多关于 Name Node stores what?

In case of "Name Node", what gets stored in main memory and what gets stored in secondary memory ( hard disk ). What we mean by "file to block mapping" ? What exactly is fsimage and edit logs ? In case of "Name Node", what gets stored in main memory and what gets stored in secondary memory ( hard disk ). The file to block mapping, locations of blocks on data nodes, active data nodes, a bunch of other metadata is all stored in memory on the NameNode. When you check the NameNode status website, pretty much all of that information is stored in memory somewhere. The only thing stored on disk is

How can I save an RDD into HDFS and later read it back?

阅读更多关于 How can I save an RDD into HDFS and later read it back?

I have an RDD whose elements are of type (Long, String). For some reason, I want to save the whole RDD into the HDFS, and later also read that RDD back in a Spark program. Is it possible to do that? And if so, how? It is possible. In RDD you have saveAsObjectFile and saveAsTextFile functions. Tuples are stored as (value1, value2) , so you can later parse it. Reading can be done with textFile function from SparkContext and then .map to eliminate () So: Version 1: rdd.saveAsTextFile ("hdfs:///test1/"); // later, in other program val newRdds = sparkContext.textFile("hdfs:///test1/part-*").map (x

Apache Drill vs Spark

阅读更多关于 Apache Drill vs Spark

I have some expirience with Apache Spark and Spark-SQL. Recently I've found Apache Drill project. Could you describe me what are the most significant advantages/differences between them? I've already read Fast Hadoop Analytics (Cloudera Impala vs Spark/Shark vs Apache Drill) but this topic is still unclear for me. Here's an article I came across that discusses some of the SQL technologies: http://www.zdnet.com/article/sql-and-hadoop-its-complicated/ Drill is fundamentally different in both the user's experience and the architecture. For example: Drill is a schema-free query engine. For

Apache Spark vs Akka [closed]

阅读更多关于 Apache Spark vs Akka [closed]

问题 Closed . This question is opinion-based. It is not currently accepting answers. Want to improve this question? Update the question so it can be answered with facts and citations by editing this post. Closed 4 months ago . Could you please tell me the difference between Apache Spark and AKKA, I know that both frameworks meant to programme distributed and parallel computations, yet i don't see the link or the difference between them. Moreover, I would like to get the use cases suitable for each

Books to start learning big data [closed]

阅读更多关于 Books to start learning big data [closed]

问题 As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 7 years ago . I would like to start learning about the big data technologies. I want to work in this area in the future. Does anyone know good books

doing PCA on very large data set in R

阅读更多关于 doing PCA on very large data set in R

This question was migrated from Cross Validated because it can be answered on Stack Overflow. Migrated 7 years ago . Learn more . I have a very large training set (~2Gb) in a CSV file. The file is too large to read directly into memory ( read.csv() brings the computer to a halt) and I would like to reduce the size of the data file using PCA. The problem is that (as far as I can tell) I need to read the file into memory in order to run a PCA algorithm (e.g., princomp() ). I have tried the bigmemory package to read the file in as a big.matrix , but princomp doesn't function on big.matrix objects

how to fetch all of data from hbase table in spark

阅读更多关于 how to fetch all of data from hbase table in spark

I have a big table in hbase that name is UserAction, and it has three column families(song,album,singer). I need to fetch all of data from 'song' column family as a JavaRDD object. I try this code, but it's not efficient. Is there a better solution to do this ? static SparkConf sparkConf = new SparkConf().setAppName("test").setMaster( "local[4]"); static JavaSparkContext jsc = new JavaSparkContext(sparkConf); static void getRatings() { Configuration conf = HBaseConfiguration.create(); conf.set(TableInputFormat.INPUT_TABLE, "UserAction"); conf.set(TableInputFormat.SCAN_COLUMN_FAMILY, "song");

creating partition in external table in hive

阅读更多关于 creating partition in external table in hive

I have successfully created and added Dynamic partitions in an Internal table in hive. i.e. by using following steps: 1-created a source table 2-loaded data from local into source table 3- created another table with partitions - partition_table 4- inserted the data to this table from source table resulting in creation of all the partitions dynamically My question is, how to perform this in external table? I read so many articles on this, but i am confused , that do I have to specify path to the already existing partitions for creating partitions for external table?? example: Step 1: create

How to get started with Big Data Analysis [closed]

阅读更多关于 How to get started with Big Data Analysis [closed]

问题 Closed . This question is opinion-based. It is not currently accepting answers. Want to improve this question? Update the question so it can be answered with facts and citations by editing this post. Closed 3 years ago . I've been a long time user of R and have recently started working with Python. Using conventional RDBMS systems for data warehousing, and R/Python for number-crunching, I feel the need now to get my hands dirty with Big Data Analysis. I'd like to know how to get started with

I need to compare two dataframes for type validation and send a nonzero value as output

阅读更多关于 I need to compare two dataframes for type validation and send a nonzero value as output

问题 I am comparing two dataframes (basically these are schema of two different data sources one from hive and other from SAS9.2) I need to validate structure for both data sources so I converted schema into two dataframes and here they are: SAS schema will be in below format: scala> metadata.show +----+----------------+----+---+-----------+-----------+ |S_No| Variable|Type|Len| Format| Informat| +----+----------------+----+---+-----------+-----------+ | 1| DATETIME| Num| 8|DATETIME20.|DATETIME20.