2015년 12월 23일 수요일

MapReduce & YARN

출처: Hortonworks Hadoop Tutorial



MapReduce is the key algorithm that the Hadoop data processing engine uses to distribute work around a cluster. A MapReduce job splits a large data set into independent chunks and organizes them into key, value pairs for parallel processing. This parallel processing improves the speed and reliability of the cluster, returning solutions more quickly and with greater reliability.
The Map function divides the input into ranges by the InputFormat and creates a map task for each range in the input. The JobTracker distributes those tasks to the worker nodes. The output of each map task is partitioned into a group of key-value pairs for each reduce.
  • map(key1,value) -> list<key2,value2>
The Reduce function then collects the various results and combines them to answer the larger problem that the master node needs to solve. Each reduce pulls the relevant partition from the machines where the maps executed, then writes its output back into HDFS. Thus, the reduce is able to collect the data from all of the maps for the keys and combine them to solve the problem.
  • reduce(key2, list<value2>) -> list<value3>
The current Apache Hadoop MapReduce System is composed of the JobTracker, which is the master, and the per-node slaves called TaskTrackers. The JobTracker is responsible for resource management(managing the worker nodes i.e. TaskTrackers), tracking resource consumption/availability and also job life-cycle management (scheduling individual tasks of the job, tracking progress, providing fault-tolerance for tasks etc).
The TaskTracker has simple responsibilities – launch/teardown tasks on orders from the JobTracker and provide task-status information to the JobTracker periodically.


The Apache Hadoop projects provide a series of tools designed to solve big data problems. The Hadoop cluster implements a parallel computing cluster using inexpensive commodity hardware. The cluster is partitioned across many servers to provide a near linear scalability. The philosophy of the cluster design is to bring the computing to the data. So each datanode will hold part of the overall data and be able to process the data that it holds. The overall framework for the processing software is called MapReduce. Here’s a short video introduction to MapReduce:

Apache YARN (Yet Another Resource Negotiator):

Hadoop HDFS is the data storage layer for Hadoop and MapReduce was the data-processing layer in Hadoop 1x. However, the MapReduce algorithm, by itself, isn’t sufficient for the very wide variety of use-cases we see Hadoop being employed to solve. Hadoop 2.0 presents YARN, as a generic resource-management and distributed application framework, whereby, one can implement multiple data processing applications customized for the task at hand. The fundamental idea of YARN is to split up the two major responsibilities of the JobTracker i.e. resource management and job scheduling/monitoring, into separate daemons: a global ResourceManager and per-application ApplicationMaster (AM).
The ResourceManager and per-node slave, the NodeManager (NM), form the new, and generic, systemfor managing applications in a distributed manner.
The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The per-application ApplicationMaster is, in effect, a framework specific entity and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the component tasks.
ResourceManager has a pluggable Scheduler, which is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is apure scheduler in the sense that it performs no monitoring or tracking of status for the application, offering no guarantees on restarting failed tasks either due to application failure or hardware failures. The Scheduler performs its scheduling function based on the resource requirements of the applications; it does so based on the abstract notion of a *Resource Container *which incorporates resource elements such as memory, cpu, disk, network etc.
NodeManager is the per-machine slave, which is responsible for launching the applications’ containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager.
The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress. From the system perspective, the ApplicationMaster itself runs as a normal container.
Here is an architectural view of YARN:

One of the crucial implementation details for MapReduce within the new YARN system that I’d like to point out is that we have reused the existing MapReduce framework without any major surgery. This was very important to ensure compatibility for existing MapReduce applications and users. Here is a short video introduction for YARN

HDFS

출처: Hortonworks Hadoop Tutorial



HDFS(Hadoop Distributed File System)
HDFS is a distributed file system that is designed for storing large data files. HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks. HDFS is a scalable, fault-tolerant, distributed storage system that works closely with a wide variety of concurrent data access applications, coordinated by YARN. HDFS will “just work” under a variety of physical and systemic circumstances. By distributing storage and computation across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage.

An HDFS cluster is comprised of a NameNode, which manages the cluster metadata, and DataNodes that store the data. Files and directories are represented on the NameNode by inodes. Inodes record attributes like permissions, modification and access times, or namespace and disk space quotas.
The file content is split into large blocks (typically 128 megabytes), and each block of the file is independently replicated at multiple DataNodes. The blocks are stored on the local file system on the DataNodes.
The Namenode actively monitors the number of replicas of a block. When a replica of a block is lost due to a DataNode failure or disk failure, the NameNode creates another replica of the block. The NameNode maintains the namespace tree and the mapping of blocks to DataNodes, holding the entire namespace image in RAM.
The NameNode does not directly send requests to DataNodes. It sends instructions to the DataNodes by replying to heartbeats sent by those DataNodes. The instructions include commands to:
  • replicate blocks to other nodes,
  • remove local block replicas,
  • re-register and send an immediate block report, or
  • shut down the node.

With next generation HDFS data architecture that comes with HDP 2.0, HDFS has evolved to provideautomated failure with a hot standby, with full stack resiliency. Please spare some time to go through this video for more clarity on HDFS.

Apache Hadoop®

출처: Hortonworks Hadoop Tutorial


Apache Hadoop® is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly gain insight from massive amounts of structured and unstructured data. Numerous Apache Software Foundation projects make up the services required by an enterprise to deploy, integrate and work with Hadoop.
The base Apache Hadoop framework is composed of the following modules:
  • Hadoop Common – contains libraries and utilities needed by other Hadoop modules.
  • Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
  • Hadoop YARN – a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users’ applications.
  • Hadoop MapReduce – a programming model for large scale data processing.
Each project has been developed to deliver an explicit function and each has its own community of developers and individual release cycles. There are five pillars to Hadoop that make it enterprise ready:
  1. Data Management– Store and process vast quantities of data in a storage layer that scales linearly. Hadoop Distributed File System (HDFS) is the core technology for the efficient scale out storage layer, and is designed to run across low-cost commodity hardware. Apache Hadoop YARN is the pre-requisite for Enterprise Hadoop as it provides the resource management and pluggable architecture for enabling a wide variety of data access methods to operate on data stored in Hadoop with predictable performance and service levels.
    1. Apache Hadoop YARN– Part of the core Hadoop project, YARN is a next-generation framework for  Hadoop data processing extending MapReduce capabilities by supporting non-MapReduce workloads associated with other programming models.
    2. HDFS– Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers.
  2. Data Access– Interact with your data in a wide variety of ways – from batch to real-time. Apache       Hive is the most widely adopted data access technology, though there are many specialized engines. For instance, Apache Pig provides scripting capabilities, Apache Storm offers real-time processing,     Apache HBase offers columnar NoSQL storage and Apache Accumulo offers cell-level access           control. All of these engines can work across one set of data and resources thanks to YARN and       intermediate engines such as Apache Tez for interactive access and Apache Slider for long-running   applications. YARN also provides flexibility for new and emerging data access methods, such as         Apache Solr for search and programming frameworks such as Cascading.
    1. Apache Hive– Built on the MapReduce framework, Hive is a data warehouse that enables easy data summarization and ad-hoc queries via an SQL-like interface for large datasets stored in HDFS.
    2. Apache Pig– A platform for processing and analyzing large data sets. Pig consists of a high-level language (Pig Latin) for expressing data analysis programs paired with the MapReduce framework for processing these programs.
    3. MapReduce– MapReduce is a framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable and fault-tolerant manner.
    4. Apache Spark– Spark is ideal for in-memory data processing. It allows data scientists to implement fast, iterative algorithms for advanced analytics such as clustering and classification of datasets.
    5. Apache Storm– Storm is a distributed real-time computation system for processing fast, large streams of data adding reliable real-time data processing capabilities to Apache Hadoop® 2.x
    6. Apache HBase– A column-oriented NoSQL data storage system that provides random real-time read/write access to big data for user applications.
    7. Apache Tez– Tez generalizes the MapReduce paradigm to a more powerful framework for executing a complex DAG (directed acyclic graph) of tasks for near real-time big data processing.
    8. Apache Kafka– Kafka is a fast and scalable publish-subscribe messaging system that is often used in place of traditional message brokers because of its higher throughput, replication, and fault tolerance.
    9. Apache HCatalog– A table and metadata management service that provides a centralized way for data processing systems to understand the structure and location of the data stored within Apache Hadoop.
    10. Apache Slider– A framework for deployment of long-running data access applications in Hadoop. Slider leverages YARN’s resource management capabilities to deploy those applications, to manage their lifecycles and scale them up or down.
    11. Apache Solr– Solr is the open source platform for searches of data stored in Hadoop. Solr enables powerful full-text search and near real-time indexing on many of the world’s largest Internet sites.
    12. Apache Mahout– Mahout provides scalable machine learning algorithms for Hadoop which aids with data science for clustering, classification and batch based collaborative filtering.
    13. Apache Accumulo– Accumulo is a high performance data storage and retrieval system with cell-level access control. It is a scalable implementation of Google’s Big Table design that works on top of Apache Hadoop and Apache ZooKeeper.
  3. Data Governance and Integration– Quickly and easily load data, and manage   according to           policy.Apache Falcon provides policy-based workflows for data governance, while Apache Flume and Sqoop enable easy data ingestion, as do the NFS and WebHDFS interfaces to HDFS.
    1. Apache Falcon– Falcon is a data management framework for simplifying data lifecycle management and processing pipelines on Apache Hadoop®. It enables users to orchestrate data motion, pipeline processing,disaster recovery, and data retention workflows.
    2. Apache Flume– Flume allows you to efficiently aggregate and move large amounts of log data from many different sources to Hadoop.
    3. Apache Sqoop– Sqoop is a tool that speeds and eases movement of data in and out of Hadoop. It provides a reliable parallel load for various, popular enterprise data sources.
  4. Security– Address requirements of Authentication, Authorization, Accounting and Data Protection. Security is provided at every layer of the Hadoop stack from HDFS and YARN to Hive and the other Data Access components on up through the entire perimeter of the cluster via Apache Knox.
    1. Apache Knox– The Knox Gateway (“Knox”) provides a single point of authentication and access for Apache Hadoop services in a cluster. The goal of the project is to simplify Hadoop security for users who access the cluster data and execute jobs, and for operators who control access to the cluster.
    2. Apache Ranger– Apache Ranger delivers a comprehensive approach to security for a Hadoop cluster. It provides central security policy administration across the core enterprise security requirements of authorization, accounting and data protection.
  5. Operations–  Provision, manage, monitor and operate Hadoop clusters at scale.
    1. Apache Ambari– An open source installation lifecycle management, administration and monitoring system for Apache Hadoop clusters.
    2. Apache Oozie– Oozie Java Web application used to schedule Apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work.
    3. Apache ZooKeeper– A highly available system for coordinating distributed processes. Distributed applications use ZooKeeper to store and mediate updates to important configuration information.
Apache Hadoop can be useful across a range of use cases spanning virtually every vertical industry. It is becoming popular anywhere that you need to store, process, and analyze large volumes of data. Examples include digital marketing automation, fraud detection and prevention, social network and relationship analysis, predictive modeling for new drugs, retail in-store behavior analysis, and mobile device location-based marketing.

2015년 12월 4일 금요일

How to Build Chrome V8 Javascript Engine with GYP and MSVS 2013 on Windows 10

1. Download Chrome V8 Javascript Engine Source
> git clone https://chromium.googlesource.com/v8/v8.git

2. Download Chromium depot tools (or download directly)
> git clone https://chromium.googlesource.com/chromium/tools/depot_tools.git
> set DEPOT_TOOLS_WIN_TOOLCHAIN=0
> set PATH=%PATH%;C:\Git\depot_tools
> cd depot_tools
> gclient
> cd v8
> git config branch.autosetupmerge always
> git config branch.autosetuprebase always
> git pull

3. Download Google's GYP
> git clone https://chromium.googlesource.com/external/gyp.git build\gyp
> cd build\gyp
> gyp
> cd ..\..

4. Download Chromium Python 2.6
> git clone https://chromium.googlesource.com/chromium/deps/python_26.git third_party\python_26

5. Download Chromium Cygwin
> git clone https://chromium.googlesource.com/chromium/deps/cygwin.git third_party\cygwin
> set PATH=%PATH%;C:\Git\v8\third_party\cygwin\bin
> make dependencies

6. Download Chromium ICU
> git clone https://chromium.googlesource.com/chromium/deps/icu.git third_party\icu

7. Create VS Project
> copy tools\gyp\v8.gyp v8.gyp_env
> set GYP_MSVS_VERSION=2013
> third_party\python_26\python.exe build\gyp_v8

8. Build Solution
$ /cygdrive/c/Program\ Files\ \(x86\)/Microsoft\ Visual\ Studio\ 12.0/Common7/IDE/devenv.com /build Release build/all.sln

참조 사이트
https://www.chromium.org/developers/how-tos/build-instructions-windows
https://code.google.com/p/v8/issues/detail?id=2901
https://github.com/v8/v8/wiki/Using%20Git
http://stuff.stevenreid.uk/2015/04/12/build-google-v8-on-windows-8-x64/
http://gneu.org/2014/02/integrating-v8/
http://egloos.zum.com/haejung/v/1123470
http://namocom.tistory.com/218
http://funnylog.kr/354

2015년 11월 30일 월요일

NoSQL Systems with CAP theorem

출처: http://blog.nahurst.com/visual-guide-to-nosql-systems



  • Consistency means that each client always has the same view of the data.
  • Availability means that all clients can always read and write.
  • Partition tolerance means that the system works well across physical network partitions.
According to the CAP Theorem, you can only pick two.

In addition to CAP configurations, another significant way data management systems vary is by the data model they use: relational, key-value, column-oriented, or document-oriented (there are others, but these are the main ones).
  • Relational systems are the databases we've been using for a while now. RDBMSs and systems that support ACIDity and joins are considered relational.
  • Key-value systems basically support get, put, and delete operations based on a primary key.
  • Column-oriented systems still use tables but have no joins (joins must be handled within your application). Obviously, they store data by column as opposed to traditional row-oriented databases. This makes aggregations much easier.
  • Document-oriented systems store structured "documents" such as JSON or XML but have no joins (joins must be handled within your application). It's very easy to map data from object-oriented software to these systems.
Now for the particulars of each CAP configuration and the systems that use each configuration:
Consistent, Available (CA) Systems have trouble with partitions and typically deal with it with replication. Examples of CA systems include:
  • Traditional RDBMSs like Postgres, MySQL, etc (relational)
  • Vertica (column-oriented)
  • Aster Data (relational)
  • Greenplum (relational)
Consistent, Partition-Tolerant (CP) Systems have trouble with availability while keeping data consistent across partitioned nodes. Examples of CP systems include:
Available, Partition-Tolerant (AP) Systems achieve "eventual consistency" through replication and verification. Examples of AP systems include:


2015년 9월 1일 화요일

TCP/IP, UDP 주요 포트

 
▶ TCP/IP 프로토콜에서 사용하는 Port
1) 가장 많이 사용되는 포트 : 0 ~ 1023
2) 예약된 포트 : 1024 ~ 49151
3) 동적, 사설 포트 : 49152 ~ 65535

▶ 가장 많이 사용되는 포트 요약 (0~1023)
키워드
포트번호
포트용도
icmp
8/tcp, 8/udp
0/tcp, 0/udp
Unassigned
ftp-data
20/tcp, 20/udp
File Transfer [Default Data]
ftp
21/tcp, 21/udp
File Transfer [Control]
ssh
22/tcp, 22/udp
SSH Remote Login Protocol
telnet
23/tcp, 23/udp
Telnet
smtp
25/tcp, 25/udp
Simple Mail Transfer
domain
53/tcp, 53/udp
Domain Name Server
whois++
63/tcp, 63/udp
whois++
tftp
69/tcp, 69/udp
Trivial File Transfer
gopher
70/tcp, 70/udp
Gopher
finger
79/tcp, 79/udp
Finger
www
80/tcp, 80/udp
World Wide Web HTTP
pop3
110/tcp, 110/udp
Post Office Protocol - Version 3
ntp
123/tcp, 123/udp
Network Time Protocol
epmap
135/tcp, 135/udp
DCE endpoint resolution
profile
136/tcp, 136/udp
PROFILE Naming System
netbios-ns
137/tcp, 137/udp
NETBIOS Name Service
netbios-dgm
138/tcp, 138/udp
NETBIOS Datagram Service
netbios-ssn
139/tcp, 139/udp
NETBIOS Session Service
imap
143/tcp, 143/udp
Internet Message Access Protocol
snmp
161/tcp, 161/udp
SNMP
namp
167/tcp, 167/udp
NAMP
imap3
220/tcp, 220/udp
Interactive Mail Access Protocol v3
ldap
389/tcp, 389/udp
Lightweight Directory Access Protocol
https
443/tcp, 443/udp
http protocol over TLS/SSL
shell
514/tcp
cmd
syslog
514/udp
syslog
printer
515/tcp, 515/udp
spooler
ftps-data
989/tcp, 989/udp
ftp protocol, data, over TLS/SSL
ftps
990/tcp, 990/udp
ftp protocol, control, over TLS/SSL
telnets
992/tcp, 992/udp
telnet protocol over TLS/SSL
imaps
993/tcp, 993/udp
imap4 protocol over TLS/SSL
pop3s
995/tcp, 995/udp
pop3 protocol over TLS/SSL (was spop3)
▶ 예약된 포트 요약 (1024 ~ 49151)
키워드
포트번호
포트용도
ms-sql-s
1433/tcp, 1433/udp
Microsoft-SQL-Server
ms-sql-m
1434/tcp, 1434/udp
Microsoft-SQL-Monitor
sybase-sqlany
1498/tcp, 1498/udp
Sybase SQL Any
atm-zip-office
1520/tcp, 1520/udp
atm zip office
ncube-lm
1521/tcp, 1521/udp
nCube License Manager
ricardo-lm
1522/tcp, 1522/udp
Ricardo North America License Manager
cichild-lm
1523/tcp, 1523/udp
cichild
ingreslock
1524/tcp, 1524/udp
ingres
orasrv
1525/tcp, 1525/udp
oracle
sybasedbsynch
2439/tcp, 2439/udp
SybaseDBSynch
sybaseanywhere
2638/tcp, 2638/udp
Sybase Anywhere
ms-wbt-server
3389/tcp, 3389/udp
MS WBT Server
http-alt
8080/tcp, 8080/udp
HTTP Alternate (see port 80)

▶ 웜/바이러스 포트로 방화벽에서 차단해야할 포트 요약
포트번호
원래 포트 용도
키워드
69/udp
TFTP
Nachi 웜,
Blaster 웜
80/udp
web server
Nachi 웜
135/tcp, 135/udp
NETBios
Nachi 웜,
Blaster 웜
137/udp
NETBios
Nachi 웜,
Blaster 웜
138/udp
NETBios
Nachi 웜,
Blaster 웜
139/tcp
NETBios
Nachi 웜,
Blaster 웜
443/tcp, 443/udp
HTTPS
Slapper 웜
445/tcp
NETBios
Nachi 웜,
Blaster 웜
514/tcp
SHELL
RPC Backdoor
515/tcp, 515/udp
LPRng
Red 웜
593/tcp
http-rpc-epmap, HTTP RPC Ep Map
Nachi 웜,
Blaster 웜
1008/udp
-
LiOn 웜
1243/tcp
-
ShoolBus Backdoor
1433/tcp, 1433/udp
ms-sql-m, Microsoft-SQL-Monitor
W32.Slammer 웜
1434/tcp, 1434/udp
ms-sql-m, Microsoft-SQL-Monitor
W32.Slammer 웜
3385/tcp
qnxnetman
Net-Worm.Win32.Mytob.dc
4444/tcp
krb524
Blaster 웜,
Welchia 웜
6667/tcp, 6667/udp
ircu 6665-6669/tcp IRCU
Welchia 웜
6668/tcp, 6668/udp
ircu 6665-6669/tcp IRCU
Welchia 웜
6669/tcp, 6669/udp
ircu 6665-6669/tcp IRCU
Welchia 웜
10008/tcp, 10008/udp
-
LiOn 웜
54321/tcp
-
ShoolBus Backdoor
17300/tcp
-
Kuang2 바이러스
30999/tcp
-
Kuang2 바이러스
27374/tcp, 27374/udp
-
SubSeven Backdoor

▶ 메신저 관련
Service
Name
Server
Port
Description
MSN
64.4.130.0/24
207.46.104.0/24
207.46.106.0/24
207.46.107.0/24
207.46.108.0/24
207.46.110.0/24
TCP 1863 ,80
1863접속 시도 후 차단 되면 80 접속 시도
TCP 6891-6900
파일 전송
UDP 6901
음성채팅
UDP1863,5190
Microsoft Network Messenger
Yahoo
216.136.233.152/32
216.136.233.153/32
216.136.175.144/32
216.136.224.143/32
66.163.173.203/32
216.136.233.133/32
216.136.233.148/32
66.163.173.201/32
216.136.224.213/32
TCP 5050,5101
5050 접속 시도 후 차단 되어 있으면Port 계속 변경
TCP 5000-5001
음성채팅
TCP 5100
화상채팅
Nate On
203.226.253.75/32
203.226.253.135/32
203.226.253.82/32
TCP 5004-5010
기본 포트 5004-5010 접속 시도후 차단되어 있으면 Port를 계속 변경
TCP80,83,7003
웹 컨텐츠 및 문자 보내기
Daum
211.233.29.78/32
TCP 8062

SayClub
211.233.47.20/32


AOL

TCP 5190
AOL Instant Messenger Also used by: ICQ
UDP 4000
ICQ_locator
Dreamwize
211.39.128.236/32
211.39.128.184/32
TCP 10000

버디버디

TCP 810

TCP 940

TCP 950

케이친구

TCP 7979

천리안

TCP 1420

TCP4949, 8989
파일 송수신
ICQ

TCP 5190

UIN

TCP 8080

Genile

TCP 10000


▶ P2P 관련
service name
TCP
UDP
소리바다
22322, 22323, 7675
22321, 7674
당나귀
4661, 4662, 4665
8719, 4665, 4672
구루구루
9292, 9293, 8282, 31200

Direct
411-412
411-412
Gnutella
6346, 6347

GoBoogy

5325
Hotline
5497, 5498, 5500, 5501, 5503

KaZaA
1214

Madster
23172, 9922

Maniac
2000, 2222
2010
V-Share
8401-8404
8401-8404
shareshare
6399, 6777

WINMX
6699
6257
엔유
8185, 8184

파일구리
9493
9493
파일피아
8090-8091

iMash
5000

BitTorrent
6881, 6889

Guntella-Morpheus
6346-6347
6346-6347
GuRuGuRu
9292, 8282, 31200

Madster-Aimster
23172, 9922

MiRC
6667, 6665-6670, 7000

Bluster

41170
GoToMyPc
8200

Napster
6600-6699, 4444, 5555, 6666, 7777, 8888, 8875


▶ Game 관련
service name
TCP
UDP
스타크래프트
6112, 1156-1158
6112, 1156-1158