Use Amazon EMR and Tableau data analysis and visualization

By Zachary Bell,2015-05-10 05:40
84 views 0
Use Amazon EMR and Tableau data analysis and visualization

    Use Amazon EMR and Tableau data analysis and


    In view of the different formats and the size of the data, Hadoop ecosystem provides a rich tool to analyze and extract the value.Initially, the Hadoop ecosystem focusing on the analysis of large quantities of data, provides similar graphs, Pig and Hive components.Right now, Hadoop has provides a number of tools for interactive data query, such as the Impala, Drill, and Presto.This article will teach you how to use the Amazon Elastic graphs (Amazon EMR) to analyze Amazon Simple Storage Service (Amazon S3) to store data, and USES the Tableau cooperate with Impala in visualization.

    Tool is introduced

    Amazon Elastic Map Reduce: Amazon EMR is a web-based services, using which the user can convenient and economical to handle huge amounts of data.Amazon EMR use Apache Hadoop this open source framework, in the Amazon Elastic Compute Cloud (Amazon EC2) distributed data processing of instances in the cluster.

    Impala: Hadoop ecosystem in an open source tools, now can be used in the EMR for interactive ad-hoc query based on SQL.Instead of using a Hive graphs engine, Impala use the advantage of massively parallel processing (MPP) engine, similar to a traditional relational database management system (RDBMS) used in the method, Impala can achieve faster query response time.

    Given the Hive and Impala provides similar ability of SQL, and can share the same Metastore (for table and segmentation metadata repository), they are in the Hadoop ecosystem in mutual conflict, do their job.Compare the Impala, Hive batch form needs more response time, so in cooperation with tools such as Tableau interactive data analysis is quite weak.Impala also have many limit, because the nature of memory processing, it needs to use a lot of memory resources, and single query operational data quantity will inevitably is limited by available memory resources cluster, but there is no such restrictions - Hive was completely on the same hardware, it can handle larger data set, which is more suitable for large data sets on the ETL load.

    Tableau: as a business intelligence solution, it is a combination of data analysis and report to form a continuous visual analysis process, easy to user perceptions and use.Tableau can deliver very fast analysis, visualization, and business intelligence, it can be directly connected to the AWS and other data sources.In the latest version of the Tableau Desktop, users can use Amazon EMR dedicated ODBC driver with running on Amazon EMR Hive or Impala connection.Exactly how the Amazon EMR as a Tableau of a data source, you can contact Tableau for help.

    Instance analysis

    In this article, we will show you how to make Amazon EMR as a Tableau of a data source, and connect Tableau and Impala interactive visualization analysis.

    Use Amazon EMR to analyze the Google Books n - grams

    Google Books n - grams free stored in Amazon S3 Data set has the AWS Public Data Sets, n - grams is fixed size yuan group composed of a large number of the item.Here, the item is Google Book corpus of words."N" refers to the number of element in a tuple, so 5 - "gramm represents the five words or letters.

    Traditional case, Apache Hadoop to run on the HDFS, but it also supports Amazon S3 as a file system.And compare the Hive can query stored in Amazon S3 data directly, Impala requires the support of HDFS.

    Google Books n - "gramm now has support Hadoop friendly file format, the largest volume of 2.2 TB.The data sets using SequenceFile format for storage, and use the block-level LZO compression way (block - level).Line number using LongWritable to storage, and as a key SequenceFile;Raw data using TextWritable for storage and as SequenceFile values.

    In view of the use of block-level LZO compression, SequenceFile need further transform, because the Impala can not directly establish or insert the data to its, Impala only can query LZO compression Text form.Hive is native support for use LZO compression SequenceFile format, and can query the external data stored on Amazon S3.Therefore, use it converts the data on the S3 (store) on HDFS Impala supported formats is a very good choice.

    Open an Amazon EMR cluster

    First, we need to set up an installation Impala and Amazon EMR cluster Hive.

    1. Use the AWS CLI to build an Amazon EMR cluster.If you don't have experience using the CLI before, AWS provides a description for installation and configuration.Is to use the CLI to establish Amazon EMR cluster of the following statements, and returns the unique identifier of the cluster.

    aws emr create-cluster --name ImpalaCluster--ami-version 3.1.0 --




    attributes KeyName=keyPairName,AvailabilityZone=availabilityZone --

    applications Name=Hive,Name=Impala --no-auto-terminate

    Note: before running this statement, keyPairName and availabilityZone string replaced with the appropriate values.In the following steps, you also need to use this statement execution get a unique identifier to replace j - XXXXXXXXXXXX string.

    2. 5 to 10 minutes, the cluster set up complete, its status will show as waiting.If you need to check the cluster initialization state, you can run the following command: aws emr describe-cluster --cluster-id j-XXXXXXXXXXXX--query

    'Cluster.Status.State' --output text

    3. When your cluster after entering "WAITING" status, you can use the following statement is connected to the master node.

    aws emr ssh --cluster-id j-XXXXXXXXXXXX --key-pair-file keyFilePath

    Note: the path of the replacement string keyFilePath private key file is used.

    In order to establish the External Table from the data on the Amazon S3

    By establishing the EXTERNAL TABLE can make Amazon EMR references from EXTERNAL data sources.Create references to this data is very simple, here will not produce data migration.

    1. After capturing the node, open the Hive shell:

    $ hive

    2. Using the CREATE TABLE statement to define the data source.Example here, we will use a English 1 - grams data set.


    eng_1M_1gram(token STRING, year INT, frequency INT, pages INT,


    AS SEQUENCEFILE LOCATION 's3://datasets.elasticmapreduce/ngrams/


    Build up a Table in the HDFS

    In order to use the Impala, we need to set up a store in the HDFS copy form.In a copy of the form, we will use the Parquet replace Sequence File format.Parquet is a column for binary file format, designed to service and efficient query on a large scale.

    1. Establish a copy of the form in the Hive:

    hive> CREATE TABLE

    eng_1M_1gram_paraquet(token STRING, year INT, frequency INT, pages INT, books

    INT) ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS

    inputformat 'parquet.hive.DeprecatedParquetInputFormat' outputformat


    2. Adjust mapred. Min. The split. The size set, because the data stored in Amazon S3 is a single file.

    hive> set mapred.min.split.size=134217728;

    This setting tells hive split into at least 128 MB of files for processing.This can prevent you deal with the data when using only a mapper, because it won't be able to use distributed nature of graphs.

    3. Use the select query insert data into this form.We read the data from original data table and insert the new form.

    hive> INSERT OVERWRITE TABLE eng_1M_1gram_paraquet SELECT lower(token),

    year, frequency, pages, books FROM eng_1M_1gram WHERE year >= 1890 AND

    token REGEXP "^[A-Za-z+'-]+quot;;

    This query also demonstrates a typical case, transform the data to downstream tools (such as a Tableau) can be used more convenient.In the query, the first step is to filter out the data before 1890.Also use regular expressions to filter out the letters or punctuation commonly used outside of the data.And in order to use the same format to subsequent steps can be more convenient query, we also use the built-in functions to convert all letters to lower case format.

    4. In all of the above step is completed, from the Hive.

    Make a copy of the form is available in the Impala

    Here, use the same Metastore for Hive and Impala.Therefore, in the form of Imapa query can be used before, we need to update the metadata in the Impala.INVALIDATE the METADATA statement can failure METADATA and query in the Impala before reloading required for its METADATA.

    1. Log on to the Impala.

$ impala-shell

    2. Invalid for Impala copy form of metadata

    impala> invalidate metadata;

    3. Quit the Impala shell, and close to the Amazon EMR cluster SSH connection.

    Use the Tableau visual data from the Impala

    The next step, you need to use in the Windows or MacOSX installed on the host Tableau Desktop.If you haven't installed the Tableau, you can use Amazon EC2, on which the installation in this case you need to use the Tableau.You also need some steps to Amazon EMR of Tableau is available.Of course, you could get help by contacting Tableau.

    1. Install on a host computer Tableau Desktop need to use the ODBC driver, thus you can Tableau

    On the Desktop to connect to the Amazon EMR Impala.

    A. download driver

    B. unzip the download file.May create a folder called ImpalaODBC.

    C. in the copies of the required documents for the installation.


    MacOSX: ImpalaODBC/

    D. running the above package, in accordance with the prompted to install the ODBC driver.

    2. Modify the Amazon EMR cluster Master Security Group, let the Tableau can connect to run on Amazon EMR cluster Master node the Impala on the server.

    A. Click on the AWS Management Console in the Amazon EC2 TAB to open the Amazon EC2 Console.

    B. in the navigation pane, select the Network and the Security group under the Security Groups.

    C. on the Security Groups in the list, select the Elastic graphs - master.

    D. near the bottom of the panel, click the Inbound TAB.

    E. in the Port Range field type 20150, mark Source field default values.

    F. Click the Add Rule, and then click Apply Rule Changes.

    3. According to the Tableau instructions allow Amazon EMR as a Tableau of a data connection option.Click on A can see the page as shown in the figure below.

    Figure 1

    4. The DNS Server in the fields to fill out the master node, and then click the Connect button.You can use the following statement to obtain DNS:

    aws emr describe-cluster --cluster-id

    j-XXXXXXXXXXXX --query 'Cluster.MasterPublicDnsName' --output text

    5. The next page, from the schema drop-down box, select the default mode, named "eng_1m_1g_paraquet" form will drag below the upper left corner of the panel, and then click the Go to Worksheet button.

    Figure 2

    This will open a Tableau workbook, Dimension and Measure automatically filled in.Now, we can already use Tableau and running on the Amazon EMR Impala.

    Video presentation

    To establish an interactive visualization

    The video below demonstrates how to use some interactive visualization Tableau.First, we need to build a trend for books published every year - line.

    Video 1

    Video, please click:

    A filter is established

    This video demonstrates how to create a filter, it will allow the user to choose for trend - line 1 - "gramm a specified.Around 1905, in "computer" 1 - "gramm words suddenly increase may be more interesting.If you have your own comment on the increase, please comment.

    Video 2

    Video, please click:

    Close the Amazon EMR Cluster

    When you finish all of the above steps, please don't forget to close the Amazon EMR cluster in the AWS console.You can be done via the CLI statement below: aws emr terminate-cluster --cluster-id j-XXXXXXXXXXXX

    If you are just to demonstrate the steps to establish the Amazon EMR cluster, please don't forget to shut down

Report this document

For any questions or suggestions please email