Big Data Tech and Analytics --- MapReduce and Frequent Itemsets
1. Standard Architecture to solve the problem of big data computation Cluster of commodity Linux nodes Commodity network (ethernet) to connect them 2. Issue and idea Issue: Copying data over a network takes time Idea: Bring computation close to the data Store files multiple times for reliability 3. HDFS 3.1 Function: Distributed File System, Provides global file namespace, Replica to ensure data recovery 3.2 Data Characteristics: Streaming data access Large data sets and files: gigabytes to terabytes size High aggregate data bandwidth Scale to hundreds of nodes in a cluster Tens of