此处我们使用的是滴滴云主机内网 IP,如果需要外部访问 Hadoop,需要绑定公网 IP 即 EIP。有关滴滴云 EIP 的使用请参考以下链接:
https://help.didiyun.com/hc/kb/section/1035272/
Master 节点保存着分布式文件系统信息,比如 inode 表和资源调度器及其记录。同时 master 还运行着两个守护进程:
NameNode:管理分布式文件系统,存放数据块在集群中所在的位置。
ResourceManger:负责调度数据节点(本例中为 node1 和 node2)上的资源,每个数据节点上都有一个 NodeManger 来执行实际工作。
Node1 和 node2 节点负责存储实际数据并提供计算资源,运行两个守护进程:
DataNode:负责管理实际数据的物理储存。
NodeManager:管理本节点上计算任务的执行。
本例中使用的滴滴云虚拟机配置如下:
2核CPU 4G 内存 40G HDD存储 3 Mbps带宽 CentOS 7.4
滴滴云主机出于安全考虑,默认不能通过 root 用户直接登录,需要先用 dc2-user 登录,让后用 sudo su 切换至 root。本例中默认全部以 dc2-user 用户运行命令,Hadoop默认用户同样为 dc2-user。
将三台节点的IP和主机名分别写入三台节点的 /etc/hosts 文件,并把前三行注释掉 。
sudo vi /etc/hosts
#127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
#::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
#127.0.0.1 10-254-149-24
10.254.149.24 master
10.254.88.218 node1
10.254.84.165 node2
Master 节点需要与 node1 和 node2 进行 ssh 密钥对连接,在 master 节点上为 dc2-user 生成公钥。
ssh-keygen -b 4096
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:zRhhVpEfSIZydqV75775sZB0GBjZ/f7nnZ4mgfYrWa8 hadoop@10-254-149-24
The key's randomart image is:
+---[RSA 4096]----+
| ++=*+ . |
| .o+o+o+. . |
| +…o o .|
| = .. o .|
| S + oo.o |
| +.=o .|
| . +o+..|
| o +.+O|
| .EXO=|
+----[SHA256]-----+
输入以下命令将生成的公钥复制到三个节点上:
ssh-copy-id -i $HOME/.ssh/id_rsa.pub dc2-user@master
ssh-copy-id -i $HOME/.ssh/id_rsa.pub dc2-user@node1
ssh-copy-id -i $HOME/.ssh/id_rsa.pub dc2-user@node2
接下来可以用在 master 输入 ssh dc2-user@node1,ssh dc2-user@node2 来验证是否不输入密码就可以连接成功。
在 3 台节点下载 JDK。
mkdir /home/dc2-user/java
cd /home/dc2-user/java
wget --no-check-certificate --no-cookies --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u191-b12/2787e4a523244c269598db4e85c51e0c/jdk-8u191-linux-x64.tar.gz
tar -zxf jdk-8u191-linux-x64.tar.gz
在 3 台节点配置 Java 变量。
sudo vi /etc/profile.d/jdk-1.8.sh
export JAVA_HOME=/home/dc2-user/java/jdk1.8.0_191
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH
使环境变量生效。
source /etc/profile
查看 Java 版本。
java -version
java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
说明 Java 环境已经配置成功。
在 master 节点下载 Hadoop3.1.1 并解压。
cd /home/dc2-user
wget http://mirrors.shu.edu.cn/apache/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz
tar zxf hadoop-2.7.7.tar.gz
在 /home/dc2-user/hadoop-2.7.7/etc/hadoop 下需要配置的 6 个文件分别是 hadoop-env.sh、core-site.xml、hdfs-site.xml、yarn-site.xml、mapred-site.xml、workers 。
添加如下内容:
export JAVA_HOME=/home/dc2-user/java/jdk1.8.0_191
export HDFS_NAMENODE_USER="dc2-user"
export HDFS_DATANODE_USER="dc2-user"
export HDFS_SECONDARYNAMENODE_USER="dc2-user"
export YARN_RESOURCEMANAGER_USER="dc2-user"
export YARN_NODEMANAGER_USER="dc2-user"
core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
</configuration>
hdfs-site.xml
yarn-site.xml
mapred-site.xml
<property>
<name>mapreduce.jobhistory.address</name>
<value>master:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>master:19888</value>
</property>
编辑 workers
node1
node2
复制以下配置文件到 node1 和 node2:
scp -r /home/dc2-user/hadoop-2.7.7 dc2-user@node1:/home/dc2-user/
scp -r /home/dc2-user/hadoop-2.7.7 dc2-user@node2:/home/dc2-user/
配置 Hadoop 环境变量(三台节点)
sudo vi /etc/profile.d/hadoop-2.7.7.sh
export HADOOP_HOME="/home/dc2-user/hadoop-2.7.7"
export PATH="$HADOOP_HOME/bin:$PATH"
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
使环境变量生效
source /etc/profile
在 3 台节点输入 Hadoop version 看是否有输出,验证环境变量是否生效:
hadoop version
Hadoop 2.7.7
Subversion Unknown -r c1aad84bd27cd79c3d1a7dd58202a8c3ee1ed3ac
Compiled by stevel on 2018-07-18T22:47Z
Compiled with protoc 2.5.0
From source with checksum 792e15d20b12c74bd6f19a1fb886490
This command was run using /home/dc2-user/hadoop-2.7.7/share/hadoop/common/hadoop-common-2.7.7.jar
格式化 HDFS,只在 master 上操作
/home/dc2-user/hadoop-2.7.7/bin/hdfs namenode -format testCluster
开启服务
/home/dc2-user/hadoop-2.7.7/sbin/start-dfs.sh
/home/dc2-user/hadoop-2.7.7/sbin/start-yarn.sh
查看三个节点服务是否已启动
master
jps
1654 Jps
31882 NameNode
32410 ResourceManager
32127 SecondaryNameNode
node1
jps
19827 NodeManager
19717 DataNode
20888 Jps
node2
jps
30707 Jps
27675 NodeManager
27551 DataNode
出现以上结果,即说明服务已经正常启动,可以通过 master 的公网 IP 访问 ResourceManager 的 web 页面,注意要打开安全组的 8088 端口,关于滴滴云安全组的使用请参考以下链接:https://help.didiyun.com/hc/kb/article/1091031/
注:公网开放8088端口可能会被黑客利用植入木马,因此建议在安全组中限制可访问的来源IP,或者不在安全组中开放此端口。
最后用 Hadoop 中自带的 WordCount 程序来验证一下 MapReduce 功能,操作在 master 节点进行:
首先在当前目录创建两个文件 test1 和 test2,内容如下:
vi test1
hello world
bye world
vi test2
hello hadoop
bye hadoop
接下来在 HDFS 中创建文件夹并将以上两个文件上传到文件夹中。
hadoop fs -mkdir /input
hadoop fs -put test* /input
当集群启动的时候,会首先进入安全模式,因此要先离开安全模式。
hdfs dfsadmin -safemode leave
运行 WordCount 程序统计两个文件中个单词出现的次数。
yarn jar /home/dc2-user/hadoop-2.7.7/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar wordcount /input /output
WARNING: YARN_CONF_DIR has been replaced by HADOOP_CONF_DIR. Using value of YARN_CONF_DIR.
2018-11-09 20:27:12,233 INFO client.RMProxy: Connecting to ResourceManager at master/10.254.149.24:8032
2018-11-09 20:27:12,953 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1541766351311_0001
2018-11-09 20:27:14,483 INFO input.FileInputFormat: Total input files to process : 2
2018-11-09 20:27:16,967 INFO mapreduce.JobSubmitter: number of splits:2
2018-11-09 20:27:17,014 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enab
2018-11-09 20:27:17,465 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1541766351311_0001
2018-11-09 20:27:17,466 INFO mapreduce.JobSubmitter: Executing with tokens: []
2018-11-09 20:27:17,702 INFO conf.Configuration: resource-types.xml not found
2018-11-09 20:27:17,703 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2018-11-09 20:27:18,256 INFO impl.YarnClientImpl: Submitted application application_1541766351311_0001
2018-11-09 20:27:18,296 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1541766351311_0001/
2018-11-09 20:27:18,297 INFO mapreduce.Job: Running job: job_1541766351311_0001
2018-11-09 20:28:24,929 INFO mapreduce.Job: Job job_1541766351311_0001 running in uber mode : false
2018-11-09 20:28:24,931 INFO mapreduce.Job: map 0% reduce 0%
2018-11-09 20:28:58,590 INFO mapreduce.Job: map 50% reduce 0%
2018-11-09 20:29:19,437 INFO mapreduce.Job: map 100% reduce 0%
2018-11-09 20:29:33,038 INFO mapreduce.Job: map 100% reduce 100%
2018-11-09 20:29:36,315 INFO mapreduce.Job: Job job_1541766351311_0001 completed successfully
2018-11-09 20:29:36,619 INFO mapreduce.Job: Counters: 54
File System Counters
FILE: Number of bytes read=75
FILE: Number of bytes written=644561
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=237
HDFS: Number of bytes written=31
HDFS: Number of read operations=11
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Killed map tasks=1
Launched map tasks=3
Launched reduce tasks=1
Data-local map tasks=3
Total time spent by all maps in occupied slots (ms)=164368
Total time spent by all reduces in occupied slots (ms)=95475
Total time spent by all map tasks (ms)=82184
Total time spent by all reduce tasks (ms)=31825
Total vcore-milliseconds taken by all map tasks=82184
Total vcore-milliseconds taken by all reduce tasks=31825
Total megabyte-milliseconds taken by all map tasks=168312832
Total megabyte-milliseconds taken by all reduce tasks=97766400
Map-Reduce Framework
Map input records=5
Map output records=8
Map output bytes=78
Map output materialized bytes=81
Input split bytes=190
Combine input records=8
Combine output records=6
Reduce input groups=4
Reduce shuffle bytes=81
Reduce input records=6
Reduce output records=4
Spilled Records=12
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=2230
CPU time spent (ms)=2280
Physical memory (bytes) snapshot=756064256
Virtual memory (bytes) snapshot=10772656128
Total committed heap usage (bytes)=541589504
Peak Map Physical memory (bytes)=281268224
Peak Map Virtual memory (bytes)=3033423872
Peak Reduce Physical memory (bytes)=199213056
Peak Reduce Virtual memory (bytes)=4708827136
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=47
File Output Format Counters
Bytes Written=31
如果出现以上输出,说明计算完成,结果保存在 HDFS 中的 /output 文件夹中。
hadoop fs -ls /output
Found 2 items
-rw-r--r-- 1 root supergroup 0 2018-11-09 20:29 /output/_SUCCESS
-rw-r--r-- 1 root supergroup 31 2018-11-09 20:29 /output/part-r-00000
打开 part-r-00000 查看结果:
hadoop fs -cat /output/part-r-00000
bye 2
hadoop 2
hello 2
world 2
Hive 是基于 Hadoop 的一个数据仓库,可以将结构化的数据文件映射为一张表,并提供类 sql 查询功能,Hive 底层将 sql 语句转化为 MapReduce 任务运行。
下载 Hive2.3.4 到 maste r的 /home/dc2-user 并解压
wget http://mirror.bit.edu.cn/apache/hive/hive-2.3.4/apache-hive-2.3.4-bin.tar.gz
tar zxvf apache-hive-2.3.4-bin.tar.gz
设置 Hive 环境变量
编辑 /etc/profile 文件, 在其中添加以下内容。
sudo vi /etc/profile
export HIVE_HOME=/home/dc2-user/apache-hive-2.3.4-bin
export PATH=$PATH:$HIVE_HOME/bin
使环境变量生效:
source /etc/profile
重命名以下配置文件:
cd apache-hive-2.3.4-bin/conf/
cp hive-env.sh.template hive-env.sh
cp hive-default.xml.template hive-site.xml
cp hive-log4j2.properties.template hive-log4j2.properties
cp hive-exec-log4j2.properties.template hive-exec-log4j2.properties
修改 hive-env.sh:
export JAVA_HOME=/home/dc2-user/java/jdk1.8.0_191 ##Java路径
export HADOOP_HOME=/home/dc2-user/hadoop-2.7.7 ##Hadoop安装路径
export HIVE_HOME=/home/dc2-user/apache-hive-2.3.4-bin ##Hive安装路径
export HIVE_CONF_DIR=$HIVE_HOME/conf ##Hive配置文件路径
修改 hive-site.xml:
修改对应属性的 value 值
vi hive-site.xml
<property>
<name>hive.exec.scratchdir</name>
<value>/tmp/hive-${user.name}</value>
<description>HDFS root scratch dir for Hive jobs which gets
created with write all (733) permission. For each connecting user,
an HDFS scratch dir: ${hive.exec.scratchdir}/<username> is created,
with ${hive.scratch.dir.permission}.
</description>
</property>
<property>
<name>hive.exec.local.scratchdir</name>
<value>/tmp/${user.name}</value>
<description>Local scratch space for Hive jobs</description>
</property>
<property>
<name>hive.downloaded.resources.dir</name>
<value>/tmp/hive/resources</value>
<description>Temporary local directory for added resources in the remote
file system.</description>
</property>
<property>
<name> hive.querylog.location</name>
<value>/tmp/${user.name}</value>
<description>Location of Hive run time structured log file</description>
</property>
<property>
<name>hive.server2.logging.operation.log.location</name>
<value>/tmp/${user.name}/operation_logs</value>
<description>Top level directory where operation logs are stored if logging functionality is enabled</description>
</property>
配置 Hive Metastore
Hive Metastore 是用来获取 Hive 表和分区的元数据,本例中使用 mariadb 来存储此类元数据。
将 mysql-connector-java-5.1.40-bin.jar 放入 $HIVE_HOME/lib 下并在 hive-site.xml 中配置 MySQL 数据库连接信息。
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true&characterEncoding=UTF-8&useSSL=false</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive</value>
</property>
为 Hive 创建 HDFS 目录
start-dfs.s #如果在安装配置hadoop是已经启动,则此命令可省略
hdfs dfs -mkdir /tmp
hdfs dfs -mkdir -p /usr/hive/warehouse
hdfs dfs -chmod g+w /tmp
hdfs dfs -chmod g+w /usr/hive/warehouse
安装 mysql,本例中使用的是 mariadb。
sudo yum install -y mariadb-server
sudo systemctl start mariadb
登录 mysql,初始无密码,创建 Hve 用户并设置密码。
mysql -uroot
MariaDB [(none)]> create user'hive'@'localhost' identified by 'hive';
Query OK, 0 rows affected (0.00 sec)
MariaDB [(none)]> grant all privileges on *.* to hive@localhost identified by 'hive';
Query OK, 0 rows affected (0.00 sec)
运行 Hive 之前必须保证 HDFS 已经启动,可以使用 start-dfs.sh 来启动,如果之前安装 Hadoop 是已启动,次步骤可略过。
从 Hive 2.1 版本开始, 在启动 Hive 之前需运行 schematool 命令来执行初始化操作:
schematool -dbType mysql -initSchema
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/dc2-user/apache-hive-2.3.4-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/dc2-user/hadoop-2.7.7/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL: jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true
Metastore Connection Driver : com.mysql.jdbc.Driver
Metastore connection User: hive
Starting metastore schema initialization to 2.3.0
Initialization script hive-schema-2.3.0.mysql.sql
Initialization script completed
schemaTool completed
启动 Hive,输入命令 Hive
hive
which: no hbase in (/home/dc2-user/java/jdk1.8.0_191/bin:/home/dc2-user/hadoop-2.7.7/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/local/bin:/home/dc2-user/apache-hive-2.3.4-bin/bin:/home/dc2-user/.local/bin:/home/dc2-user/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/dc2-user/apache-hive-2.3.4-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/dc2-user/hadoop-2.7.7/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Logging initialized using configuration in file:/home/dc2-user/apache-hive-2.3.4-bin/conf/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive>
在 Hive中创建一个表:
hive> create table test_hive(id int, name string)
> row format delimited fields terminated by '\t' #字段之间用tab键进行分割
> stored as textfile; # 设置加载数据的数据类型,默认是TEXTFILE,如果文件数据是纯文本,就是使用 [STORED AS TEXTFILE],然后从本地直接拷贝到HDFS上,hive直接可以识别数据
OK
Time taken: 10.857 seconds
hive> show tables;
OK
test_hive
Time taken: 0.396 seconds, Fetched: 1 row(s)
可以看到表已经创建成功,输入 quit ; 退出 Hive,接下来以文本形式创建数据:
vi test_tb.txt
101 aa
102 bb
103 cc
进入 Hive,导入数据:
hive> load data local inpath '/home/dc2-user/test_db.txt' into table test_hive;
Loading data to table default.test_hive
OK
Time taken: 6.679 seconds
hive> select * from test_hive;
101 aa
102 bb
103 cc
Time taken: 2.814 seconds, Fetched: 3 row(s)
可以看到数据插入成功并且可以正常查询。
手机扫一扫
移动阅读更方便
你可能感兴趣的文章