滴滴云部署 Hadoop2.7.7+Hive2.3.4
阅读原文时间:2021年04月26日阅读:1

1.本例集群架构如下:

此处我们使用的是滴滴云主机内网 IP,如果需要外部访问 Hadoop,需要绑定公网 IP 即 EIP。有关滴滴云 EIP 的使用请参考以下链接:
https://help.didiyun.com/hc/kb/section/1035272/

  • Master 节点保存着分布式文件系统信息,比如 inode 表和资源调度器及其记录。同时 master 还运行着两个守护进程:
    NameNode:管理分布式文件系统,存放数据块在集群中所在的位置。
    ResourceManger:负责调度数据节点(本例中为 node1 和 node2)上的资源,每个数据节点上都有一个 NodeManger 来执行实际工作。

  • Node1 和 node2 节点负责存储实际数据并提供计算资源,运行两个守护进程:
    DataNode:负责管理实际数据的物理储存。
    NodeManager:管理本节点上计算任务的执行。

2.系统配置#

本例中使用的滴滴云虚拟机配置如下:
2核CPU 4G 内存 40G HDD存储 3 Mbps带宽 CentOS 7.4

  • 滴滴云主机出于安全考虑,默认不能通过 root 用户直接登录,需要先用 dc2-user 登录,让后用 sudo su 切换至 root。本例中默认全部以 dc2-user 用户运行命令,Hadoop默认用户同样为 dc2-user。

  • 将三台节点的IP和主机名分别写入三台节点的 /etc/hosts 文件,并把前三行注释掉 。

    sudo vi /etc/hosts
    #127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
    #::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
    #127.0.0.1 10-254-149-24
    10.254.149.24 master
    10.254.88.218 node1
    10.254.84.165 node2

  • Master 节点需要与 node1 和 node2 进行 ssh 密钥对连接,在 master 节点上为 dc2-user 生成公钥。

    ssh-keygen -b 4096
    Generating public/private rsa key pair.
    Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
    Created directory '/home/hadoop/.ssh'.
    Enter passphrase (empty for no passphrase):
    Enter same passphrase again:
    Your identification has been saved in /home/hadoop/.ssh/id_rsa.
    Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
    The key fingerprint is:
    SHA256:zRhhVpEfSIZydqV75775sZB0GBjZ/f7nnZ4mgfYrWa8 hadoop@10-254-149-24
    The key's randomart image is:
    +---[RSA 4096]----+
    | ++=*+ . |
    | .o+o+o+. . |
    | +…o o .|
    | = .. o .|
    | S + oo.o |
    | +.=o .|
    | . +o+..|
    | o +.+O|
    | .EXO=|
    +----[SHA256]-----+

输入以下命令将生成的公钥复制到三个节点上:

ssh-copy-id -i $HOME/.ssh/id_rsa.pub dc2-user@master
ssh-copy-id -i $HOME/.ssh/id_rsa.pub dc2-user@node1
ssh-copy-id -i $HOME/.ssh/id_rsa.pub dc2-user@node2

接下来可以用在 master 输入 ssh dc2-user@node1,ssh dc2-user@node2 来验证是否不输入密码就可以连接成功。

  • 配置 Java 环境

在 3 台节点下载 JDK。

mkdir /home/dc2-user/java
cd /home/dc2-user/java
wget --no-check-certificate --no-cookies --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u191-b12/2787e4a523244c269598db4e85c51e0c/jdk-8u191-linux-x64.tar.gz
tar -zxf jdk-8u191-linux-x64.tar.gz

在 3 台节点配置 Java 变量。

sudo vi /etc/profile.d/jdk-1.8.sh
export JAVA_HOME=/home/dc2-user/java/jdk1.8.0_191
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH

使环境变量生效。

source /etc/profile

查看 Java 版本。

java -version
java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)

说明 Java 环境已经配置成功。

3.安装 Hadoop

在 master 节点下载 Hadoop3.1.1 并解压。

cd /home/dc2-user
wget http://mirrors.shu.edu.cn/apache/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz
tar zxf hadoop-2.7.7.tar.gz

在 /home/dc2-user/hadoop-2.7.7/etc/hadoop 下需要配置的 6 个文件分别是 hadoop-env.sh、core-site.xml、hdfs-site.xml、yarn-site.xml、mapred-site.xml、workers 。

添加如下内容:

export JAVA_HOME=/home/dc2-user/java/jdk1.8.0_191
export HDFS_NAMENODE_USER="dc2-user"
export HDFS_DATANODE_USER="dc2-user"
export HDFS_SECONDARYNAMENODE_USER="dc2-user"
export YARN_RESOURCEMANAGER_USER="dc2-user"
export YARN_NODEMANAGER_USER="dc2-user"
  • core-site.xml

    <configuration>
        <property>
            <name>fs.default.name</name>
            <value>hdfs://master:9000</value>
        </property>
    </configuration>
  • hdfs-site.xml

    dfs.namenode.name.dir /home/dc2-user/data/nameNode dfs.datanode.data.dir /home/dc2-user/data/dataNode dfs.replication 1 dfs.http.address master:50070

  • yarn-site.xml


    yarn.acl.enable 0
    yarn.resourcemanager.hostname master
    yarn.resourcemanager.webapp.address master:8088
    yarn.nodemanager.aux-services mapreduce_shuffle
    yarn.nodemanager.auxservices.mapreduce_shuffle.class org.apache.hadoop.mapred.ShuffleHandler

  • mapred-site.xml


    mapreduce.framework.name yarn
    yarn.app.mapreduce.am.env HADOOP_MAPRED_HOME=${HADOOP_HOME}
    mapreduce.map.env HADOOP_MAPRED_HOME=${HADOOP_HOME}
    mapreduce.reduce.env HADOOP_MAPRED_HOME=${HADOOP_HOME}
    mapreduce.map.memory.mb 1536
    mapreduce.map.java.opts -Xmx1024M
    mapreduce.reduce.memory.mb 3072
    mapreduce.reduce.java.opts -Xmx2560M

    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>master:10020</value>
    </property>
    
    <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>master:19888</value>
    </property>

  • 编辑 workers

    node1
    node2

4.启动 Hadoop

  • 复制以下配置文件到 node1 和 node2:

    scp -r /home/dc2-user/hadoop-2.7.7 dc2-user@node1:/home/dc2-user/
    scp -r /home/dc2-user/hadoop-2.7.7 dc2-user@node2:/home/dc2-user/

  • 配置 Hadoop 环境变量(三台节点)

    sudo vi /etc/profile.d/hadoop-2.7.7.sh
    export HADOOP_HOME="/home/dc2-user/hadoop-2.7.7"
    export PATH="$HADOOP_HOME/bin:$PATH"
    export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
    export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop

使环境变量生效

source /etc/profile

在 3 台节点输入 Hadoop version 看是否有输出,验证环境变量是否生效:

hadoop version
Hadoop 2.7.7
Subversion Unknown -r c1aad84bd27cd79c3d1a7dd58202a8c3ee1ed3ac
Compiled by stevel on 2018-07-18T22:47Z
Compiled with protoc 2.5.0
From source with checksum 792e15d20b12c74bd6f19a1fb886490
This command was run using /home/dc2-user/hadoop-2.7.7/share/hadoop/common/hadoop-common-2.7.7.jar
  • 格式化 HDFS,只在 master 上操作

    /home/dc2-user/hadoop-2.7.7/bin/hdfs namenode -format testCluster

  • 开启服务

    /home/dc2-user/hadoop-2.7.7/sbin/start-dfs.sh
    /home/dc2-user/hadoop-2.7.7/sbin/start-yarn.sh

  • 查看三个节点服务是否已启动

master

jps
1654 Jps
31882 NameNode
32410 ResourceManager
32127 SecondaryNameNode

node1

jps
19827 NodeManager
19717 DataNode
20888 Jps

node2

jps
30707 Jps
27675 NodeManager
27551 DataNode

出现以上结果,即说明服务已经正常启动,可以通过 master 的公网 IP 访问 ResourceManager 的 web 页面,注意要打开安全组的 8088 端口,关于滴滴云安全组的使用请参考以下链接:https://help.didiyun.com/hc/kb/article/1091031/

注:公网开放8088端口可能会被黑客利用植入木马,因此建议在安全组中限制可访问的来源IP,或者不在安全组中开放此端口。

5.实例验证

最后用 Hadoop 中自带的 WordCount 程序来验证一下 MapReduce 功能,操作在 master 节点进行:
首先在当前目录创建两个文件 test1 和 test2,内容如下:

vi test1
hello world
bye world
vi test2
hello hadoop
bye hadoop

接下来在 HDFS 中创建文件夹并将以上两个文件上传到文件夹中。

hadoop fs -mkdir /input
hadoop fs -put test* /input

当集群启动的时候,会首先进入安全模式,因此要先离开安全模式。

hdfs dfsadmin -safemode leave

运行 WordCount 程序统计两个文件中个单词出现的次数。

yarn jar /home/dc2-user/hadoop-2.7.7/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar wordcount /input /output

WARNING: YARN_CONF_DIR has been replaced by HADOOP_CONF_DIR. Using value of YARN_CONF_DIR.
2018-11-09 20:27:12,233 INFO client.RMProxy: Connecting to ResourceManager at master/10.254.149.24:8032
2018-11-09 20:27:12,953 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1541766351311_0001
2018-11-09 20:27:14,483 INFO input.FileInputFormat: Total input files to process : 2
2018-11-09 20:27:16,967 INFO mapreduce.JobSubmitter: number of splits:2
2018-11-09 20:27:17,014 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enab
2018-11-09 20:27:17,465 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1541766351311_0001
2018-11-09 20:27:17,466 INFO mapreduce.JobSubmitter: Executing with tokens: []
2018-11-09 20:27:17,702 INFO conf.Configuration: resource-types.xml not found
2018-11-09 20:27:17,703 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2018-11-09 20:27:18,256 INFO impl.YarnClientImpl: Submitted application application_1541766351311_0001
2018-11-09 20:27:18,296 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1541766351311_0001/
2018-11-09 20:27:18,297 INFO mapreduce.Job: Running job: job_1541766351311_0001
2018-11-09 20:28:24,929 INFO mapreduce.Job: Job job_1541766351311_0001 running in uber mode : false
2018-11-09 20:28:24,931 INFO mapreduce.Job:  map 0% reduce 0%
2018-11-09 20:28:58,590 INFO mapreduce.Job:  map 50% reduce 0%
2018-11-09 20:29:19,437 INFO mapreduce.Job:  map 100% reduce 0%
2018-11-09 20:29:33,038 INFO mapreduce.Job:  map 100% reduce 100%
2018-11-09 20:29:36,315 INFO mapreduce.Job: Job job_1541766351311_0001 completed successfully
2018-11-09 20:29:36,619 INFO mapreduce.Job: Counters: 54
    File System Counters
        FILE: Number of bytes read=75
        FILE: Number of bytes written=644561
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=237
        HDFS: Number of bytes written=31
        HDFS: Number of read operations=11
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters 
        Killed map tasks=1
        Launched map tasks=3
        Launched reduce tasks=1
        Data-local map tasks=3
        Total time spent by all maps in occupied slots (ms)=164368
        Total time spent by all reduces in occupied slots (ms)=95475
        Total time spent by all map tasks (ms)=82184
        Total time spent by all reduce tasks (ms)=31825
        Total vcore-milliseconds taken by all map tasks=82184
        Total vcore-milliseconds taken by all reduce tasks=31825
        Total megabyte-milliseconds taken by all map tasks=168312832
        Total megabyte-milliseconds taken by all reduce tasks=97766400
    Map-Reduce Framework
        Map input records=5
        Map output records=8
        Map output bytes=78
        Map output materialized bytes=81
        Input split bytes=190
        Combine input records=8
        Combine output records=6
        Reduce input groups=4
        Reduce shuffle bytes=81
        Reduce input records=6
        Reduce output records=4
        Spilled Records=12
        Shuffled Maps =2
        Failed Shuffles=0
        Merged Map outputs=2
        GC time elapsed (ms)=2230
        CPU time spent (ms)=2280
        Physical memory (bytes) snapshot=756064256
        Virtual memory (bytes) snapshot=10772656128
        Total committed heap usage (bytes)=541589504
        Peak Map Physical memory (bytes)=281268224
        Peak Map Virtual memory (bytes)=3033423872
        Peak Reduce Physical memory (bytes)=199213056
        Peak Reduce Virtual memory (bytes)=4708827136
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=47
    File Output Format Counters 
        Bytes Written=31

如果出现以上输出,说明计算完成,结果保存在 HDFS 中的 /output 文件夹中。

hadoop fs -ls /output
Found 2 items
-rw-r--r--   1 root supergroup          0 2018-11-09 20:29 /output/_SUCCESS
-rw-r--r--   1 root supergroup         31 2018-11-09 20:29 /output/part-r-00000

打开 part-r-00000 查看结果:

hadoop fs -cat /output/part-r-00000
bye    2
hadoop    2
hello    2
world    2

Hive2.3.4 安装和配置

Hive 是基于 Hadoop 的一个数据仓库,可以将结构化的数据文件映射为一张表,并提供类 sql 查询功能,Hive 底层将 sql 语句转化为 MapReduce 任务运行。

编辑 /etc/profile 文件, 在其中添加以下内容。

sudo vi /etc/profile
export HIVE_HOME=/home/dc2-user/apache-hive-2.3.4-bin
export PATH=$PATH:$HIVE_HOME/bin

使环境变量生效:

source /etc/profile
  • 配置 Hive

重命名以下配置文件:

cd apache-hive-2.3.4-bin/conf/
cp hive-env.sh.template hive-env.sh 
cp hive-default.xml.template hive-site.xml 
cp hive-log4j2.properties.template hive-log4j2.properties 
cp hive-exec-log4j2.properties.template hive-exec-log4j2.properties

修改 hive-env.sh:

export JAVA_HOME=/home/dc2-user/java/jdk1.8.0_191  ##Java路径
export HADOOP_HOME=/home/dc2-user/hadoop-2.7.7   ##Hadoop安装路径
export HIVE_HOME=/home/dc2-user/apache-hive-2.3.4-bin ##Hive安装路径
export HIVE_CONF_DIR=$HIVE_HOME/conf    ##Hive配置文件路径

修改 hive-site.xml:
修改对应属性的 value 值

vi hive-site.xml
  <property>
    <name>hive.exec.scratchdir</name>
    <value>/tmp/hive-${user.name}</value>
    <description>HDFS root scratch dir for Hive jobs which gets 
    created with write all (733) permission. For each connecting user, 
    an HDFS scratch dir: ${hive.exec.scratchdir}/<username> is created, 
    with ${hive.scratch.dir.permission}.
    </description>
  </property>
  <property>
    <name>hive.exec.local.scratchdir</name>
    <value>/tmp/${user.name}</value>
    <description>Local scratch space for Hive jobs</description>
  </property>
  <property>
    <name>hive.downloaded.resources.dir</name>
    <value>/tmp/hive/resources</value>
    <description>Temporary local directory for added resources in the remote 
    file system.</description>
  </property>
  <property>
    <name> hive.querylog.location</name>
    <value>/tmp/${user.name}</value>
    <description>Location of Hive run time structured log file</description>
  </property>
  <property>
    <name>hive.server2.logging.operation.log.location</name>
    <value>/tmp/${user.name}/operation_logs</value>
    <description>Top level directory where operation logs are stored if logging functionality is enabled</description>
  </property>

配置 Hive Metastore
Hive Metastore 是用来获取 Hive 表和分区的元数据,本例中使用 mariadb 来存储此类元数据。
将 mysql-connector-java-5.1.40-bin.jar 放入 $HIVE_HOME/lib 下并在 hive-site.xml 中配置 MySQL 数据库连接信息。

<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true&characterEncoding=UTF-8&useSSL=false</value>
</property>
<property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
</property>
<property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
</property>
<property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>hive</value>
</property>

为 Hive 创建 HDFS 目录

start-dfs.s   #如果在安装配置hadoop是已经启动,则此命令可省略
hdfs dfs -mkdir /tmp
hdfs dfs -mkdir -p /usr/hive/warehouse
hdfs dfs -chmod g+w /tmp
hdfs dfs -chmod g+w /usr/hive/warehouse

安装 mysql,本例中使用的是 mariadb。

sudo yum install -y mariadb-server
sudo systemctl start mariadb

登录 mysql,初始无密码,创建 Hve 用户并设置密码。

mysql -uroot
MariaDB [(none)]> create user'hive'@'localhost' identified by 'hive';
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> grant all privileges on *.* to hive@localhost identified by 'hive';
Query OK, 0 rows affected (0.00 sec)
  • 运行 Hive

运行 Hive 之前必须保证 HDFS 已经启动,可以使用 start-dfs.sh 来启动,如果之前安装 Hadoop 是已启动,次步骤可略过。
从 Hive 2.1 版本开始, 在启动 Hive 之前需运行 schematool 命令来执行初始化操作:

schematool -dbType mysql -initSchema

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/dc2-user/apache-hive-2.3.4-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/dc2-user/hadoop-2.7.7/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL:     jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true
Metastore Connection Driver :     com.mysql.jdbc.Driver
Metastore connection User:     hive
Starting metastore schema initialization to 2.3.0
Initialization script hive-schema-2.3.0.mysql.sql
Initialization script completed
schemaTool completed

启动 Hive,输入命令 Hive

hive

which: no hbase in (/home/dc2-user/java/jdk1.8.0_191/bin:/home/dc2-user/hadoop-2.7.7/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/local/bin:/home/dc2-user/apache-hive-2.3.4-bin/bin:/home/dc2-user/.local/bin:/home/dc2-user/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/dc2-user/apache-hive-2.3.4-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/dc2-user/hadoop-2.7.7/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in file:/home/dc2-user/apache-hive-2.3.4-bin/conf/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> 
  • 测试 Hive

在 Hive中创建一个表:

hive> create table test_hive(id int, name string)
    > row format delimited fields terminated by '\t' #字段之间用tab键进行分割
    > stored as textfile;  # 设置加载数据的数据类型,默认是TEXTFILE,如果文件数据是纯文本,就是使用 [STORED AS TEXTFILE],然后从本地直接拷贝到HDFS上,hive直接可以识别数据
OK
Time taken: 10.857 seconds
hive> show tables;
OK
test_hive
Time taken: 0.396 seconds, Fetched: 1 row(s)

可以看到表已经创建成功,输入 quit ; 退出 Hive,接下来以文本形式创建数据:

vi test_tb.txt
101    aa
102    bb
103    cc

进入 Hive,导入数据:

hive> load data local inpath '/home/dc2-user/test_db.txt' into table test_hive;
Loading data to table default.test_hive
OK
Time taken: 6.679 seconds

hive> select * from test_hive;
101    aa
102    bb
103    cc
Time taken: 2.814 seconds, Fetched: 3 row(s)

可以看到数据插入成功并且可以正常查询。