bigdata

External shuffle: shuffling large amount of data out of memory

笑着哭i 提交于 2019-12-04 05:43:31
I am looking for a way to shuffle a large amount of data which does not fit into memory (approx. 40GB). I have around 30 millions entries, of variable length, stored in one large file. I know the starting and ending positions of each entry in that file. I need to shuffle this data which does not fit in the RAM. The only solution I thought of is to shuffle an array containing the numbers from 1 to N , where N is the number of entries, with the Fisher-Yates algorithm and then copy the entries in a new file, according to this order. Unfortunately, this solution involves a lot of seek operations,

XML parsing in Python for big data

柔情痞子 提交于 2019-12-04 05:28:52
问题 I am trying to parse an XML file using Python. But the problem is that the XML file size is around 30GB. So, it's taking hours to execute: tree = ET.parse('Posts.xml') In my XML file, there are millions of child elements of the root. Is there any way to make it faster? I don't need all the children to parse. Even the first 100,000 would be fine. All I need is to set a limit for the depth to parse. 回答1: You'll want an XML parsing mechanism that doesn't load everything into memory. You can use

import complex Json data to hive

十年热恋 提交于 2019-12-04 05:12:48
问题 A little spoon feeding required, how to import complex json into hive. Json file in the format of: {"some-headers":"", "dump":[{"item-id":"item-1"},{"item-id":"item-2"},...]} . Hive to have fields given under dump . Json file size, as now ,is not exceeding 200MB, but since its a dump, it will reach GBs very soon. Any other possible methods shall be greatly appreciated. 回答1: You can import JSON into Hive by implementing the HiveSerDe. This link serves as a sample implementation. https://github

Hadoop 2 IOException only when trying to open supposed cache files

淺唱寂寞╮ 提交于 2019-12-04 05:01:09
问题 I recent updated to hadoop 2.2 (using this tutorial here). My main job class looks like so, and throws an IOException: import java.io.*; import java.net.*; import java.util.*; import java.util.regex.*; import org.apache.hadoop.conf.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.chain.*; import org.apache.hadoop.mapreduce.lib.input.*; import org.apache.hadoop.mapreduce.lib.output.*; import org

R ff package ffsave 'zip' not found

こ雲淡風輕ζ 提交于 2019-12-04 04:48:40
Reproduceable Example: library("ff") m <- matrix(1:12, 3, 4, dimnames=list(c("r1","r2","r3"), c("m1","m2","m3","m4"))) v <- 1:3 ffm <- as.ff(m) ffv <- as.ff(v) d <- data.frame(m, v) ffd <- ffdf(ffm, v=ffv, row.names=row.names(ffm)) ffsave(ffd,file="C:\\Users\\R.wd\\ff\\ffd") ## Error in system(cmd, input = filelist, intern = TRUE) : 'zip' not found System: Windows 7 64bit, R 15.2 64bit Rtools installed zip 300xn-x64 and unzip 600xn folders set to windows Path already cmd line working, type zip or unzip it shows function info Need help! Any suggestion is appreciated. It seems your path is not

81个开源大数据处理工具汇总(下),包括日志收集系统/集群管理/RPC等

流过昼夜 提交于 2019-12-04 04:13:58
上一部分: http://my.oschina.net/u/2391658/blog/711016 第二部分主要收集整理的内容主要有日志收集系统、消息系统、分布式服务、集群管理、RPC、基础设施、搜索引擎、Iaas和监控管理等大数据开源工具。 日志收集系统 一、Facebook Scribe 贡献者 :Facebook 简介 :Scribe是Facebook开源的日志收集系统,在Facebook内部已经得到大量的应用。它能够从各种日志源上收集日志,存储到一个中央存储系统(可以是NFS,分布式文件系统等)上,以便于进行集中统计分析处理。它为日志的“分布式收集,统一处理”提供了一个可扩展的,高容错的方案。当中央存储系统的网络或者机器出现故障时,scribe会将日志转存到本地或者另一个位置,当中央存储系统恢复后,scribe会将转存的日志重新传输给中央存储系统。其通常与Hadoop结合使用,scribe用于向HDFS中push日志,而Hadoop通过MapReduce作业进行定期处理。 Scribe的系统架构 代码托管 : https://github.com/facebook/scribe 二、Cloudera Flume 贡献者 : Cloudera 简介: Flume是Cloudera提供的一个高可用的,高可靠的,分布式的海量日志采集、聚合和传输的系统

Why does Spark's OneHotEncoder drop the last category by default?

天涯浪子 提交于 2019-12-04 00:39:57
问题 I would like to understand the rational behind the Spark's OneHotEncoder dropping the last category by default. For example: >>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"]) >>> ss = StringIndexer(inputCol="c",outputCol="c_idx") >>> ff = ss.fit(fd).transform(fd) >>> ff.show() +----+---+-----+ | x| c|c_idx| +----+---+-----+ | 1.0| a| 0.0| | 1.5| a| 0.0| |10.0| b| 1.0| | 3.2| c| 2.0| +----+---+-----+ By default, the OneHotEncoder will drop the last

bigdata hadoop 面试问题一

柔情痞子 提交于 2019-12-03 20:14:57
数据结构 1.栈和队列都是线性数据结构。 2. 栈(FIlO):只能在一端操作,即栈顶(如出栈、入栈),这种只能从一端操作的性质,意味着栈中的元素只能后进先出(先进后出)(last in first out)。(它的这种一端性,有时会用来实现double-end stack 双端栈) 3. 队列(FIFO): 是一个双端操作的数据结构,入队、和出队分别在一端操作。能够保持先进先出的性质(first in first out). 为了充分利用的队列的空间,常用来实现循环队列。 1、说说你们公司的hadoop项目?    2、你们项目的集群有多大,有几个节点,总共的数据量是多少?    3、 每天大约有多少数据量?   4、hdfs如何保持数据的一致性?   5、多线程并发是如何开发的?   6,、nio有哪些核心的类?   7、你们是如何解决hive数据倾斜问题的?   8、mapreduce中shuffle的原理   还有很多Java基础的问题,比如java虚拟机,垃圾回收机制等等,有些问题一开始自己并不是很懂,都会说这个自己还没有接触到,但是那个技术我懂,然后就把知识点转移到你熟悉的领域了,这样就把握主动了。无论面试成功或失败,都要总结,把之前被问到,没有完全掌握的都熟悉。到下次面试的时候就更有把握,到后面就会越面试越顺。 一、 内部表和外部表的区别 : 1.在创建表的时候

NullPointerException in Spark RDD map when submitted as a spark job

拟墨画扇 提交于 2019-12-03 18:10:01
问题 We're trying to submit a spark job (spark 2.0, hadoop 2.7.2) but for some reason we're receiving a rather cryptic NPE in EMR. Everything runs just fine as a scala program so we're not really sure what's causing the issue. Here's the stack trace: 18:02:55,271 ERROR Utils:91 - Aborting task java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source) at org.apache.spark.sql.catalyst.expressions

Working with a big CSV file in MATLAB

爱⌒轻易说出口 提交于 2019-12-03 16:06:05
I have to work with a big CSV file, up to 2GB. More specifically I have to upload all this data to the mySQL database, but before I have to make a few calculation on that, so I need to do all this thing in MATLAB (also my supervisor want to do in MATLAB because he familiar just with MATLAB :( ). Any idea how can I handle these big files? You should probably use textscan to read the data in in chunks and then process. This will probably be more efficient than reading a single line at a time. For example, if you have 3 columns of data, you could do: filename = 'fname.csv'; [fh, errMsg] = fopen(