DOCX

In VPC to publish and run an Amazon EMR cluster (a)

By Amber Bradley,2015-05-18 16:56
14 views 0
In VPC to publish and run an Amazon EMR cluster (a)

    In VPC to publish and run an Amazon EMR cluster (a) Nowadays, VPC has become released Amazon EC2 instance of the default environment, therefore holds the Amazon EMR cluster in the environment of Amazon VPC works is of crucial importance.In this post, we will find out why you need to run a Hadoop cluster in EC2 VPC environment.Then, we will build a new VPC environment, and publish a EMR cluster.This is the first part of this series of posts, the second part we will detail how to customize the DNS.

    Hadoop communication requirements (1)

    throughHadoop wikiWe know that: "in order to let the Hadoop to work properly, all the machines must be mutual perception, and communicate with each other; therefore, they need a recognition mechanism, and other hosts can quickly find them."In the EMR, perception and communication between the host by settingVPCThe DNS solution and DNS host name.By enabling these default Settings, your instance is automatically assigned a host name - the use of public and private IP addresses (at the same time, the use of "-" replace ". ", such as IP - 10-128-8-1. Ec2. Internal).Thus, instance can get through itEMR-managed security groupsRealize communication.

    (2) the Hadoop communications needs

    In Hadoop 1, communication demand is very simple, mostly at the same time is also very tolerant.For example, Hadoop 1, if the HDFS DataNodes may not be the NameNode parsed into a full domain name, then it will be rolled back to an IP address, to continue communication.Therefore, in addition to the environment problems, a proper configuration of Hadoop 1 environment rarely collapse.

    As the Hadoop project gradually mature, more robustHadoop security modelDemand is also rising.Kerberos recognition, network encryption, and other functions were added to the Hadoop to help stop node is added to a cluster, such as the following HDFS configuration parameters:

    dfs.namenode.datanode.registration.ip-hostname-check

    In Hadoop 2, can not be the NameNode parsing the DataNode will be refused to communication:

    org.apache.hadoop.hdfs.server.protocol.DisallowedDatanodeException: Datanode denied

    communication with namenode

    In essence, Hadoop 1 2 after upgrading to Hadoop did not cause the change of the technical requirements.In order to guarantee the normal operation, still need to mutual perception and communication between nodes.But in the Hadoop 2 upgrade caused by the

    change is likely to lead to inappropriate environment configuration, resulting in the HDFS because UnknownHost, ConnectionRefused and abnormal NoRouteToHost cannot load file.

    Set of VPC

    EMR in VPC, at the same time also need at any time can interact with all the nodes.Through the cluster instance needs to locate (DNS) and interaction with EMR and Amazon S3 service endpoint (using loose security group and network ACLs).These will help to the creation of the cluster, CloudWatch monitor, log and submitted for processing.Micro accomplish these, also need a EMR clusterInernet gateway

    Default routing or non VGWs cases, NAT instance will not be supported.In both of these scenarios try to release a cluster will lead to a cluster to terminate or error: "The subnet configuration was invalid: Cannot find route to InternetGateway in main RouteTable rtb-036bc666 for vpc vpc-a33690c6"

    In addition, the use of EMR in network ACL subnet cluster may interrupt the communication.This is mainly becauseACLsRules are stateless, input and output must be clearly defined, don't likeEMR-managed security groupsThe use of stateful rules.If EMR cluster publish in a subnet using the ACL, and did not provide enough access, then the cluster will be terminated.If you must use the ACL, then useEMR-managed security groupsAs the scope of IP address and port navigation in order to meet the example of communication.

    To understand the DNS solution and DNS host name

    In VPC, between nodes must use DNS lookups to locate each other.Otherwise, similar Yarn and the ResourceManager HDFS NameNode such crucial daemon will appear

    problem.Given the importance of the name resolution in the cluster, any neglected DNS problem will cost a lot of time management.

    No matter whether your instance automatically receive a fully describe DNS name, DNS host name can be set in the VPC control.If you are using a AmazonProvidedDNS as DNS server default configuration, ban this setting also prevent your instance parse any in VPC subnet CIDR range of host name.

    The code below shows the open DNS host name on the instance of find another instance in the VPC:

    [ec2-user@ip-192-168-128-5 ~]$ nslookup 192.168.128.6

    Server: 192.168.0.2

    Address: 192.168.0.2#53

    Non-authoritative answer:

    6.128.168.192.in-addr.arpa name = ip-192-168-128-6.us-west-2.compute.internal. Authoritative answers can be found from:

    [ec2-user@ip-192-168-128-5 ~]$ nslookup ip-192-168-128-6.us-west-2.compute.internal Server: 192.168.0.2

    Address: 192.168.0.2#53

    Non-authoritative answer:

    Name: ip-192-168-128-6.us-west-2.compute.internal

    Address: 192.168.128.6

    [ec2-user@ip-192-168-128-5 ~]$

    After disabling the DNS host name support, restart the running the same search: aws ec2 modify-vpc-attribute --vpc-id vpc-a33690c6 --no-enable-dns-hostnames

    [ec2-user@ip-192-168-128-5 ~]$ nslookup 192.168.128.6

    Server: 192.168.0.2

    Address: 192.168.0.2#53

    ** server can't find 6.128.168.192.in-addr.arpa.: NXDOMAIN

    [ec2-user@ip-192-168-128-5 ~]$ nslookup ip-192-168-128-6.us-west-2.compute.internal Server: 192.168.0.2

    Address: 192.168.0.2#53

    ** server can't find ip-192-168-128-6.us-west-2.compute.internal: NXDOMAIN

    In this case, the forward and reverse lookup cannot be performed.Note that the external search can carry out.

    [ec2-user@ip-192-168-128-5 ~]$

    [ec2-user@ip-192-168-128-5 ~]$ nslookup <a

    href="http://www.google.com">www.google.com</a>

    Server: 192.168.0.2

    Address: 192.168.0.2#53

    Non-authoritative answer:

    Name: <a href="http://www.google.com">www.google.com</a>

    Address: 216.58.216.132

    Close the DNS will prevent us from parse any search.

    aws ec2 modify-vpc-attribute --vpc-id vpc-a33690c6 --no-enable-dns-support [ec2-user@ip-172-31-15-59 ~]$ nslookup google.com

    ;; connection timed out; trying next origin

    ;; connection timed out; no servers could be reached

    When key protection process DNS problems, they will not be able to start.In all key system was not completely under the condition of normal operation, EMR is not allowed to cluster.The following log illustrates this problem:

    SHUTDOWN_MSG: Shutting down ResourceManager at java.net.UnknownHostException: ip-192-168-128-13.hadoop.local: ip-192-168-128-13.hadoop.local: Name or service not known SHUTDOWN_MSG: Shutting down NameNode at java.net.UnknownHostException:

    ip-192-168-128-13.hadoop.local: ip-192-168-128-13.hadoop.local

    The problem of cluster results mainly from the DNS error is terminated, error types as follows:

    "On the master instance (i-b3b1e3bf), after bootstrap actions were run Hadoop failed to launch"

    Another example:

    On 2 slave instances (including i-92f8aa9e and i-8cf8aa80), after bootstrap actions were run Hadoop failed to launch.

    EMR assume your EC2 instance will be assigned the default internal host name (EC2. Internal or region.com pute. Internal) or an IP address(if you allow DNS host name in the configuration of VPC).When start a custom domain name, Hadoop 1 will not be an error, because in the Hadoop 1, not provided before the start of node must use DNS.

    In Hadoop 2, use a custom domain name will lead to the cluster configuration values (host name fill) could not parse correctly.Cluster will, in turn, daemon startup set of failure and terminated.This means that if the custom value can be parsed, Hadoop cluster 2 can use custom domain name.And by using the DNS serverVPC, we can easily implement this operation.

    Establish a VPC for EMR

    So far, we have the common requirements for Hadoop and EMR has had certain understanding, we can build a VPC to distribute cluster.This operation you can through the console in the VPC wizard, or through the CLI steps below.

    Open a / 24 VPC, and using a / 28 subnet.

    aws ec2 create-vpc --cidr-block 10.20.30.0/24 { "Vpc": { "InstanceTenancy": "default", "State": "pending", "VpcId": "vpc-055ef660", "CidrBlock": "10.20.30.0/24", "DhcpOptionsId": "dopt-a8c1c9ca" } }

    Set up with a subnet VPC, and for its designated a VPC ID and a range of subnet.In this case, we will use the 10.20.30.0/28.

    aws ec2 create-subnet --vpc-id vpc-055ef660 --cidr-block 10.20.30.0/28

    Subnet ID number and the IP address will be returned, pay attention to these. {

     "Subnet": {

     "VpcId": "vpc-055ef660",

     "CidrBlock": "10.20.30.0/28",

     "State": "pending",

     "AvailabilityZone": "us-west-2a",

     "SubnetId": "subnet-907af9f5",

     "AvailableIpAddressCount": 11

     }

    }

    Need a public Internet gateway and a routing table.Before this, we need to send with a

    VPC ID the create - the route - table command.Next, we will set up the default route through

    the create - the route.

    aws ec2 create-route-table --vpc-id vpc-055ef660

    Note here to return to the routing table ID.

{

     "RouteTable": {

     "Associations": [],

     "RouteTableId": "rtb-4640f623",

     "VpcId": "vpc-055ef660",

     "PropagatingVgws": [],

     "Tags": [],

     "Routes": [

     {

     "GatewayId": "local",

     "DestinationCidrBlock": "10.20.30.0/24",

     "State": "active",

     "Origin": "CreateRouteTable"

     }

]

     }

    }

    Because the demand of the EMR, here need to build a new Internet gateway. aws ec2 create-internet-gateway

    {

     "InternetGateway": {

     "Tags": [],

     "InternetGatewayId": "igw-24469141",

     "Attachments": []

     }

    }

    Note here the Internet gateway ID is returned.Next, we add the Internet gateway to the VPC:

    {

     "InternetGateway": {

     "Tags": [],

     "InternetGatewayId": "igw-24469141",

     "Attachments": []

     }

    }

    aws ec2 attach-internet-gateway --internet-gateway-id igw-24469141 --vpc-id vpc-055ef660

    Through the iot gateway ID before and routing table ID, we will Internet gateway is used as the default route.

    aws ec2 create-route --route-table-id rtb-e743f582 --destination-cidr-block 0.0.0.0/0 --gateway-id igw-24469141

    We need to check the DNS host name is allowed.If not allowed, so open. aws ec2 describe-vpc-attribute --vpc-id vpc-055ef660 --attribute enableDnsHostnames {

     "VpcId": "vpc-055ef660",

     "EnableDnsHostnames": {

     "Value": false

     }

    }

    aws ec2 modify-vpc-attribute --vpc-id vpc-055ef660 --enable-dns-hostnames

    In the end, we have done in the VPC release cluster all prerequisite preparation.We can use the following command to launch a test cluster, and use it to do word - cout.Remind: remember when testing the output in the following code address to your S3 bucket. aws emr create-cluster --steps Type=STREAMING,Name='Streaming

    Program&',ActionOnFailure=CONTINUE,Args=[-files,s3://elasticmapreduce/samples/wordcount/wordSplitter.py,-mapper,wordSplitter.py,-reducer,aggregate,-input,s3://elasticmapreduce/samples/wordcount/input,-output,s3://<mybucket>/wordcount/output] --ec2-attributes SubnetId=subnet-907af9f5 --ami-version 3.3.2 --instance-groups

    InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge

    InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate

    The cluster is returned.

    {

     ClusterId": "j-2TEFHMDR3LXWD"

    }

    After a few minutes, check if you have already completed the word cluster cout. aws emr describe-cluster --cluster-id j-2TEFHMDR3LXWD --query

    Cluster.Status.StateChangeReason.Message

    "Steps completed

Report this document

For any questions or suggestions please email
cust-service@docsford.com