2015년 12월 23일 수요일

MapReduce & YARN

출처: Hortonworks Hadoop Tutorial



MapReduce is the key algorithm that the Hadoop data processing engine uses to distribute work around a cluster. A MapReduce job splits a large data set into independent chunks and organizes them into key, value pairs for parallel processing. This parallel processing improves the speed and reliability of the cluster, returning solutions more quickly and with greater reliability.
The Map function divides the input into ranges by the InputFormat and creates a map task for each range in the input. The JobTracker distributes those tasks to the worker nodes. The output of each map task is partitioned into a group of key-value pairs for each reduce.
  • map(key1,value) -> list<key2,value2>
The Reduce function then collects the various results and combines them to answer the larger problem that the master node needs to solve. Each reduce pulls the relevant partition from the machines where the maps executed, then writes its output back into HDFS. Thus, the reduce is able to collect the data from all of the maps for the keys and combine them to solve the problem.
  • reduce(key2, list<value2>) -> list<value3>
The current Apache Hadoop MapReduce System is composed of the JobTracker, which is the master, and the per-node slaves called TaskTrackers. The JobTracker is responsible for resource management(managing the worker nodes i.e. TaskTrackers), tracking resource consumption/availability and also job life-cycle management (scheduling individual tasks of the job, tracking progress, providing fault-tolerance for tasks etc).
The TaskTracker has simple responsibilities – launch/teardown tasks on orders from the JobTracker and provide task-status information to the JobTracker periodically.


The Apache Hadoop projects provide a series of tools designed to solve big data problems. The Hadoop cluster implements a parallel computing cluster using inexpensive commodity hardware. The cluster is partitioned across many servers to provide a near linear scalability. The philosophy of the cluster design is to bring the computing to the data. So each datanode will hold part of the overall data and be able to process the data that it holds. The overall framework for the processing software is called MapReduce. Here’s a short video introduction to MapReduce:

Apache YARN (Yet Another Resource Negotiator):

Hadoop HDFS is the data storage layer for Hadoop and MapReduce was the data-processing layer in Hadoop 1x. However, the MapReduce algorithm, by itself, isn’t sufficient for the very wide variety of use-cases we see Hadoop being employed to solve. Hadoop 2.0 presents YARN, as a generic resource-management and distributed application framework, whereby, one can implement multiple data processing applications customized for the task at hand. The fundamental idea of YARN is to split up the two major responsibilities of the JobTracker i.e. resource management and job scheduling/monitoring, into separate daemons: a global ResourceManager and per-application ApplicationMaster (AM).
The ResourceManager and per-node slave, the NodeManager (NM), form the new, and generic, systemfor managing applications in a distributed manner.
The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The per-application ApplicationMaster is, in effect, a framework specific entity and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the component tasks.
ResourceManager has a pluggable Scheduler, which is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is apure scheduler in the sense that it performs no monitoring or tracking of status for the application, offering no guarantees on restarting failed tasks either due to application failure or hardware failures. The Scheduler performs its scheduling function based on the resource requirements of the applications; it does so based on the abstract notion of a *Resource Container *which incorporates resource elements such as memory, cpu, disk, network etc.
NodeManager is the per-machine slave, which is responsible for launching the applications’ containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager.
The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress. From the system perspective, the ApplicationMaster itself runs as a normal container.
Here is an architectural view of YARN:

One of the crucial implementation details for MapReduce within the new YARN system that I’d like to point out is that we have reused the existing MapReduce framework without any major surgery. This was very important to ensure compatibility for existing MapReduce applications and users. Here is a short video introduction for YARN

HDFS

출처: Hortonworks Hadoop Tutorial



HDFS(Hadoop Distributed File System)
HDFS is a distributed file system that is designed for storing large data files. HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks. HDFS is a scalable, fault-tolerant, distributed storage system that works closely with a wide variety of concurrent data access applications, coordinated by YARN. HDFS will “just work” under a variety of physical and systemic circumstances. By distributing storage and computation across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage.

An HDFS cluster is comprised of a NameNode, which manages the cluster metadata, and DataNodes that store the data. Files and directories are represented on the NameNode by inodes. Inodes record attributes like permissions, modification and access times, or namespace and disk space quotas.
The file content is split into large blocks (typically 128 megabytes), and each block of the file is independently replicated at multiple DataNodes. The blocks are stored on the local file system on the DataNodes.
The Namenode actively monitors the number of replicas of a block. When a replica of a block is lost due to a DataNode failure or disk failure, the NameNode creates another replica of the block. The NameNode maintains the namespace tree and the mapping of blocks to DataNodes, holding the entire namespace image in RAM.
The NameNode does not directly send requests to DataNodes. It sends instructions to the DataNodes by replying to heartbeats sent by those DataNodes. The instructions include commands to:
  • replicate blocks to other nodes,
  • remove local block replicas,
  • re-register and send an immediate block report, or
  • shut down the node.

With next generation HDFS data architecture that comes with HDP 2.0, HDFS has evolved to provideautomated failure with a hot standby, with full stack resiliency. Please spare some time to go through this video for more clarity on HDFS.

Apache Hadoop®

출처: Hortonworks Hadoop Tutorial


Apache Hadoop® is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly gain insight from massive amounts of structured and unstructured data. Numerous Apache Software Foundation projects make up the services required by an enterprise to deploy, integrate and work with Hadoop.
The base Apache Hadoop framework is composed of the following modules:
  • Hadoop Common – contains libraries and utilities needed by other Hadoop modules.
  • Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
  • Hadoop YARN – a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users’ applications.
  • Hadoop MapReduce – a programming model for large scale data processing.
Each project has been developed to deliver an explicit function and each has its own community of developers and individual release cycles. There are five pillars to Hadoop that make it enterprise ready:
  1. Data Management– Store and process vast quantities of data in a storage layer that scales linearly. Hadoop Distributed File System (HDFS) is the core technology for the efficient scale out storage layer, and is designed to run across low-cost commodity hardware. Apache Hadoop YARN is the pre-requisite for Enterprise Hadoop as it provides the resource management and pluggable architecture for enabling a wide variety of data access methods to operate on data stored in Hadoop with predictable performance and service levels.
    1. Apache Hadoop YARN– Part of the core Hadoop project, YARN is a next-generation framework for  Hadoop data processing extending MapReduce capabilities by supporting non-MapReduce workloads associated with other programming models.
    2. HDFS– Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers.
  2. Data Access– Interact with your data in a wide variety of ways – from batch to real-time. Apache       Hive is the most widely adopted data access technology, though there are many specialized engines. For instance, Apache Pig provides scripting capabilities, Apache Storm offers real-time processing,     Apache HBase offers columnar NoSQL storage and Apache Accumulo offers cell-level access           control. All of these engines can work across one set of data and resources thanks to YARN and       intermediate engines such as Apache Tez for interactive access and Apache Slider for long-running   applications. YARN also provides flexibility for new and emerging data access methods, such as         Apache Solr for search and programming frameworks such as Cascading.
    1. Apache Hive– Built on the MapReduce framework, Hive is a data warehouse that enables easy data summarization and ad-hoc queries via an SQL-like interface for large datasets stored in HDFS.
    2. Apache Pig– A platform for processing and analyzing large data sets. Pig consists of a high-level language (Pig Latin) for expressing data analysis programs paired with the MapReduce framework for processing these programs.
    3. MapReduce– MapReduce is a framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable and fault-tolerant manner.
    4. Apache Spark– Spark is ideal for in-memory data processing. It allows data scientists to implement fast, iterative algorithms for advanced analytics such as clustering and classification of datasets.
    5. Apache Storm– Storm is a distributed real-time computation system for processing fast, large streams of data adding reliable real-time data processing capabilities to Apache Hadoop® 2.x
    6. Apache HBase– A column-oriented NoSQL data storage system that provides random real-time read/write access to big data for user applications.
    7. Apache Tez– Tez generalizes the MapReduce paradigm to a more powerful framework for executing a complex DAG (directed acyclic graph) of tasks for near real-time big data processing.
    8. Apache Kafka– Kafka is a fast and scalable publish-subscribe messaging system that is often used in place of traditional message brokers because of its higher throughput, replication, and fault tolerance.
    9. Apache HCatalog– A table and metadata management service that provides a centralized way for data processing systems to understand the structure and location of the data stored within Apache Hadoop.
    10. Apache Slider– A framework for deployment of long-running data access applications in Hadoop. Slider leverages YARN’s resource management capabilities to deploy those applications, to manage their lifecycles and scale them up or down.
    11. Apache Solr– Solr is the open source platform for searches of data stored in Hadoop. Solr enables powerful full-text search and near real-time indexing on many of the world’s largest Internet sites.
    12. Apache Mahout– Mahout provides scalable machine learning algorithms for Hadoop which aids with data science for clustering, classification and batch based collaborative filtering.
    13. Apache Accumulo– Accumulo is a high performance data storage and retrieval system with cell-level access control. It is a scalable implementation of Google’s Big Table design that works on top of Apache Hadoop and Apache ZooKeeper.
  3. Data Governance and Integration– Quickly and easily load data, and manage   according to           policy.Apache Falcon provides policy-based workflows for data governance, while Apache Flume and Sqoop enable easy data ingestion, as do the NFS and WebHDFS interfaces to HDFS.
    1. Apache Falcon– Falcon is a data management framework for simplifying data lifecycle management and processing pipelines on Apache Hadoop®. It enables users to orchestrate data motion, pipeline processing,disaster recovery, and data retention workflows.
    2. Apache Flume– Flume allows you to efficiently aggregate and move large amounts of log data from many different sources to Hadoop.
    3. Apache Sqoop– Sqoop is a tool that speeds and eases movement of data in and out of Hadoop. It provides a reliable parallel load for various, popular enterprise data sources.
  4. Security– Address requirements of Authentication, Authorization, Accounting and Data Protection. Security is provided at every layer of the Hadoop stack from HDFS and YARN to Hive and the other Data Access components on up through the entire perimeter of the cluster via Apache Knox.
    1. Apache Knox– The Knox Gateway (“Knox”) provides a single point of authentication and access for Apache Hadoop services in a cluster. The goal of the project is to simplify Hadoop security for users who access the cluster data and execute jobs, and for operators who control access to the cluster.
    2. Apache Ranger– Apache Ranger delivers a comprehensive approach to security for a Hadoop cluster. It provides central security policy administration across the core enterprise security requirements of authorization, accounting and data protection.
  5. Operations–  Provision, manage, monitor and operate Hadoop clusters at scale.
    1. Apache Ambari– An open source installation lifecycle management, administration and monitoring system for Apache Hadoop clusters.
    2. Apache Oozie– Oozie Java Web application used to schedule Apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work.
    3. Apache ZooKeeper– A highly available system for coordinating distributed processes. Distributed applications use ZooKeeper to store and mediate updates to important configuration information.
Apache Hadoop can be useful across a range of use cases spanning virtually every vertical industry. It is becoming popular anywhere that you need to store, process, and analyze large volumes of data. Examples include digital marketing automation, fraud detection and prevention, social network and relationship analysis, predictive modeling for new drugs, retail in-store behavior analysis, and mobile device location-based marketing.

2015년 12월 4일 금요일

How to Build Chrome V8 Javascript Engine with GYP and MSVS 2013 on Windows 10

1. Download Chrome V8 Javascript Engine Source
> git clone https://chromium.googlesource.com/v8/v8.git

2. Download Chromium depot tools (or download directly)
> git clone https://chromium.googlesource.com/chromium/tools/depot_tools.git
> set DEPOT_TOOLS_WIN_TOOLCHAIN=0
> set PATH=%PATH%;C:\Git\depot_tools
> cd depot_tools
> gclient
> cd v8
> git config branch.autosetupmerge always
> git config branch.autosetuprebase always
> git pull

3. Download Google's GYP
> git clone https://chromium.googlesource.com/external/gyp.git build\gyp
> cd build\gyp
> gyp
> cd ..\..

4. Download Chromium Python 2.6
> git clone https://chromium.googlesource.com/chromium/deps/python_26.git third_party\python_26

5. Download Chromium Cygwin
> git clone https://chromium.googlesource.com/chromium/deps/cygwin.git third_party\cygwin
> set PATH=%PATH%;C:\Git\v8\third_party\cygwin\bin
> make dependencies

6. Download Chromium ICU
> git clone https://chromium.googlesource.com/chromium/deps/icu.git third_party\icu

7. Create VS Project
> copy tools\gyp\v8.gyp v8.gyp_env
> set GYP_MSVS_VERSION=2013
> third_party\python_26\python.exe build\gyp_v8

8. Build Solution
$ /cygdrive/c/Program\ Files\ \(x86\)/Microsoft\ Visual\ Studio\ 12.0/Common7/IDE/devenv.com /build Release build/all.sln

참조 사이트
https://www.chromium.org/developers/how-tos/build-instructions-windows
https://code.google.com/p/v8/issues/detail?id=2901
https://github.com/v8/v8/wiki/Using%20Git
http://stuff.stevenreid.uk/2015/04/12/build-google-v8-on-windows-8-x64/
http://gneu.org/2014/02/integrating-v8/
http://egloos.zum.com/haejung/v/1123470
http://namocom.tistory.com/218
http://funnylog.kr/354