centos7安装hadoop3.0.3和jdk1.8的伪分布式模式
添加普通用户hadoop
useradd hadoop
passwd hadoop
1
给hadoop用户sudo权限
chmod u+w /etc/sudoers
vi /etc/sudoers
添加
hadoop ALL=(ALL) ALL
或者
hadoop ALL=(root) NOPASSWD:ALL
切换到hadoop用户
su - hadoop
安装到/home/hadoop/hadoop3.03目录
sudo mkidr /home/hadoop/hadoop3.03
tar -zxvf hadoop-3.0.3.tar.gz
mv hadoop-3.0.3 hadoop3.03
安装到/home/hadoop/java/jdk1.8
tar -zxvf jdk-8u172-linux-x64.gz
mv jdk_1.8.0.172 jdk1.8
配置环境变量
vi /etc/profile
##java export JAVA_HOME=/home/hadoop/java/jdk1.8 export PATH=$PATH:$JAVA_HOME
/bin##hadoop export HADOOP_HOME=/home/hadoop/hadoop3.03 export PATH=$HADOOP_HOME
/bin:$HADOOP_HOME/sbin:$PATH
验证
echo $JAVA_HOME echo $HADOOP_HOME
配置 hadoop-env.sh、mapred-env.sh、yarn-env.sh文件的JAVA_HOME参数
export JAVA_HOME=/home/hadoop/java/jdk1.8
配置core-site.xml
hadoop-localhost为主机名称,/opt/data/tmp要先创建好目录
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl"
href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with the
License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or
agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
or implied. See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file. --> <!-- Put
site-specific property overrides in this file. --> <configuration> <property> <
name>fs.defaultFS</name> <value>hdfs://hadoop-localhost:8020</value> <
description>HDFS的URI,文件系统://namenode标识:端口号</description> </property> <property>
<name>hadoop.tmp.dir</name> <value>/opt/data/tmp</value> <description>
namenode上本地的hadoop临时文件夹</description> </property> </configuration>
hadoop.tmp.dir配置的是Hadoop临时目录,比如HDFS的NameNode数据默认都存放这个目录下,查看*-default.xml等默认配置文件,就可以看到很多依赖${hadoop.tmp.dir}的配置。
默认的hadoop.tmp.dir是/tmp/hadoop-${user.name},此时有个问题就是NameNode会将HDFS的元数据存储在这个/tmp目录下,如果操作系统重启了,系统会清空/tmp目录下的东西,导致NameNode元数据丢失,是个非常严重的问题,所有我们应该修改这个路径。
sudo mkdir -p /opt/data/tmp
将临时目录的所有者修改为hadoop
sudo chown –R hadoop:hadoop /opt/data/tmp
配置hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl"
href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with the
License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or
agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
or implied. See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file. --> <!-- Put
site-specific property overrides in this file. --> <configuration> <property> <
name>dfs.name.dir</name> <value>/opt/data/tmp/dfs/name</value> <description>
namenode上存储hdfs名字空间元数据</description> </property> <property> <name>dfs.data.dir</
name> <value>/opt/data/tmp/dfs/data</value> <description>datanode上数据块的物理存储位置</
description> </property> <!--设置hdfs副本数量--> <property> <name>dfs.replication</
name> <value>1</value> </property> </configuration>
格式化HDFS
sudo chown -R hadoop:hadoop /opt/data
hdfs namenode –format
查看NameNode格式化后的目录
$ ll /opt/data/tmp/dfs/name/current
启动NameNode
sbin/hadoop-daemon.sh start namenode
启动DataNode
sbin/hadoop-daemon.sh start datanode
启动SecondaryNameNode
sbin/hadoop-daemon.sh start secondarynamenode
JPS命令查看是否已经启动成功,有结果就是启动成功了
$ jps
HDFS上测试创建目录、上传、下载文件
[hadoop@hadoop-localhost hadoop3.03]#
创建目录
bin/hdfs dfs -mkdir /demo1
上传
bin/hdfs dfs -put etc/hadoop/core-site.xml /demo1
读取HDFS上的文件内容
bin/hdfs dfs -cat /demo1/core-site.xml
从HDFS上下载文件到本地
bin/hdfs dfs -get /demo1/core-site.xml
查看hdfs的web页面
hdfs 2.X版本的web页面端口号为50070
http://192.168.145.129:50070 <http://192.168.145.129:50070>
hdfs 3.X版本的web页面端口号为9870
http://192.168.145.129:9870/dfshealth.html#tab-overview
<http://192.168.145.129:9870/dfshealth.html#tab-overview>
配置、启动YARN
配置mapred-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl"
href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with the
License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or
agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
or implied. See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file. --> <!-- Put
site-specific property overrides in this file. --> <!-- 指定mr运行在yarn上 --> <!--
${full path of your hadoop distribution directory} --> <configuration> <
property> <name>mapreduce.framework.name</name> <value>yarn</value> </
property> <property> <name>yarn.app.mapreduce.am.env</name> <value>
HADOOP_MAPRED_HOME=/home/hadoop/hadoop3.03</value> </property> <property> <name>
mapreduce.map.env</name> <value>HADOOP_MAPRED_HOME=/home/hadoop/hadoop3.03</
value> </property> <property> <name>mapreduce.reduce.env</name> <value>
HADOOP_MAPRED_HOME=/home/hadoop/hadoop3.03</value> </property> </configuration>
配置yarn-site.xml
arn.nodemanager.aux-services配置了yarn的默认混洗方式,选择为mapreduce的默认混洗算法。
yarn.resourcemanager.hostname指定了Resourcemanager运行在哪个节点上。
<?xml version="1.0"?> <!-- Licensed under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or
agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
or implied. See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file. --> <!--
指定YARN的老大(ResourceManager)的地址 --> <configuration> <!-- Site specific YARN
configuration properties --> <property> <name>yarn.nodemanager.aux-services</
name> <value>mapreduce_shuffle</value> </property> <!-- reducer获取数据的方式 --> <
property> <name>yarn.resourcemanager.hostname</name> <value>hadoop-localhost
</value> </property> </configuration>
启动Resourcemanager
sbin/yarn-daemon.sh start resourcemanager
启动nodemanager
sbin/yarn-daemon.sh start nodemanager
也可执行批处理文件启动服务
启动hdfs 和yarn
sbin/start-dfs.sh
sbin/start-yarn.sh
sbin/start-all.sh
YARN的Web页面
YARN的Web客户端端口号是8088,通过http://192.168.145.129:8088/
<http://192.168.145.129:8088/>可以查看。
运行MapReduce Job
创建测试用的Input文件
bin/hdfs dfs -mkdir -p /wordcountdemo/input
wc.input文件内容为:
hadoop mapreduce hive hbase spark storm sqoop hadoop hive spark hadoop
将wc.input文件上传到HDFS的/wordcountdemo/input目录中:
bin/hdfs dfs -put /opt/data/wc.input /wordcountdemo/input
运行WordCount MapReduce Job
[hadoop@hadoop-localhost hadoop3.03]$ bin/yarn jar
share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.3.jar wordcount
/wordcountdemo/input /wordcountdemo/output
2018-07-03 19:38:23,956 INFO client.RMProxy: Connecting to ResourceManager at
hadoop-localhost/192.168.145.129:8032 2018-07-03 19:38:24,565 INFO
mapreduce.JobResourceUploader: Disabling Erasure Codingfor path:
/tmp/hadoop-yarn/staging/hadoop/.staging/job_1530615244194_00022018-07-03 19:38:
24,879 INFO input.FileInputFormat: Total input files to process : 1 2018-07-03
19:38:25,784 INFO mapreduce.JobSubmitter: number of splits:1 2018-07-03 19:38:25
,841 INFO Configuration.deprecation:
yarn.resourcemanager.system-metrics-publisher.enabledis deprecated. Instead, use
yarn.system-metrics-publisher.enabled2018-07-03 19:38:26,314 INFO
mapreduce.JobSubmitter: Submitting tokensfor job: job_1530615244194_0002 2018-07
-03 19:38:26,315 INFO mapreduce.JobSubmitter: Executing with tokens: [] 2018-07-
03 19:38:26,466 INFO conf.Configuration: resource-types.xml not found 2018-07-03
19:38:26,466 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2018-07-03 19:38:26,547 INFO impl.YarnClientImpl: Submitted application
application_1530615244194_00022018-07-03 19:38:26,590 INFO mapreduce.Job: The
urlto track the job: http://hadoop-localhost:8088
/proxy/application_1530615244194_0002/2018-07-03 19:38:26,590 INFO
mapreduce.Job: Running job: job_1530615244194_00022018-07-03 19:38:35,985 INFO
mapreduce.Job: Job job_1530615244194_0002 runningin uber mode : false 2018-07-03
19:38:35,988 INFO mapreduce.Job: map 0% reduce 0% 2018-07-03 19:38:42,310 INFO
mapreduce.Job:map 100% reduce 0% 2018-07-03 19:38:47,402 INFO mapreduce.Job: map
100% reduce 100% 2018-07-03 19:38:49,469 INFO mapreduce.Job: Job
job_1530615244194_0002 completed successfully2018-07-03 19:38:49,579 INFO
mapreduce.Job: Counters:53 File System Counters FILE: Number of bytes read=94
FILE: Number of bytes written=403931 FILE: Number of read operations=0 FILE:
Numberof large read operations=0 FILE: Number of write operations=0 HDFS: Number
of bytes read=195 HDFS: Number of bytes written=60 HDFS: Number of read
operations=8 HDFS: Number of large read operations=0 HDFS: Number of write
operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1
Data-localmap tasks=1 Total time spent by all maps in occupied slots (ms)=4573
Totaltime spent by all reduces in occupied slots (ms)=2981 Total time spent by
all map tasks (ms)=4573 Total time spent by all reduce tasks (ms)=2981 Total
vcore-milliseconds taken byall map tasks=4573 Total vcore-milliseconds taken by
all reduce tasks=2981 Total megabyte-milliseconds taken by all map tasks=4682752
Total megabyte-milliseconds taken byall reduce tasks=3052544 Map-Reduce
FrameworkMap input records=4 Map output records=11 Map output bytes=115 Map
output materialized bytes=94 Input split bytes=122 Combine input records=11
Combine output records=7 Reduce input groups=7 Reduce shuffle bytes=94 Reduce
input records=7 Reduce output records=7 Spilled Records=14 Shuffled Maps =1
Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=171 CPU time spent
(ms)=1630 Physical memory (bytes) snapshot=332750848 Virtual memory (bytes)
snapshot=5473169408 Total committed heap usage (bytes)=165810176 Peak Map
Physical memory (bytes)=214093824 Peak Map Virtual memory (bytes)=2733207552
Peak Reduce Physical memory (bytes)=118657024 Peak Reduce Virtual memory
(bytes)=2739961856 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=
0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=73 File
Output Format Counters Bytes Written=60 [hadoop@hadoop-localhost hadoop3.03]$
输出统计结果为:
[hadoop@hadoop-localhost hadoop3.03]$ bin/hdfs dfs -cat
/wordcountdemo/output/part-r-00000 hadoop 3 hbase 1 hive 2 mapreduce 1 spark 2
sqoop1 storm 1 [hadoop@hadoop-localhost hadoop3.03]$
结果是按照键值排好序的
停止Hadoop
sbin/hadoop-daemon.sh stop namenode
sbin/hadoop-daemon.sh stop datanode
sbin/yarn-daemon.sh stop resourcemanager
sbin/yarn-daemon.sh stop nodemanager
全部停止批处理文件
sbin/stop_yarn.sh
sbin/stop_dfs.sh
sbin/stop_all.sh
HDFS模块简介
HDFS负责大数据的存储,通过将大文件分块后进行分布式存储方式,突破了服务器硬盘大小的限制,解决了单台机器无法存储大文件的问题,HDFS是个相对独立的模块,可以为YARN提供服务,也可以为HBase等其他模块提供服务。
YARN模块简介
YARN是一个通用的资源协同和任务调度框架,是为了解决Hadoop1.x中MapReduce里NameNode负载太大和其他问题而创建的一个框架。
YARN是个通用框架,不止可以运行MapReduce,还可以运行Spark、Storm等其他计算框架。
MapReduce模块简介
MapReduce是一个计算框架,它给出了一种数据处理的方式,即通过Map阶段、Reduce阶段来分布式地流式处理数据。它只适用于大数据的离线处理,对实时性要求很高的应用不适用。
—-the—–end—-
热门工具 换一换