Hadoop介绍Hadoop-大数据开源世界的亚当夏娃。核心是HDFS数据存储系统,和MapReduce散布式计算框架。
HDFS原理是把大块数据切碎,
每个碎块复造三份,分隔放在三个廉价机上,不断连结有三块可用的数据互为备份。利用的时候只从此中一个备份读出来,那个碎块数据就有了。
存数据的喊datenode(格子间),治理datenode的喊namenode(执伞人)。
MapReduce原理是大使命先分堆处置-Map,再汇总处置成果-Reduce。分和汇是多台办事器并行停止,才气表现集群的能力。难度在于若何把使命拆解成契合MapReduce模子的分和汇,以及中间过程的输进输出k,v 都是什么。
单机版Hadoop介绍关于进修hadoop原理和hadoop开发的人来说,搭建一套hadoop系统是必需的。但
设置装备摆设该系统长短常头疼的,良多人设置装备摆设过程就舍弃了。没有办事器供你利用那里介绍一种免设置装备摆设的单机版hadoop安拆利用办法,能够简单快速的跑一跑hadoop例子辅助进修、开发和测试。要求条记本上拆了Linux虚拟机,虚拟机上拆了docker。
安拆利用docker下载sequenceiq/hadoop-docker:2.7.0镜像并运行。
[root@bogon~]#dockerpullsequenceiq/hadoop-docker:2.7.02.7.0:Pullingfromsequenceiq/hadoop-docker860d0823bcab:Pullingfslayere592c61b2522:Pullingfslayer下载胜利输出
Digest:sha256:a40761746eca036fee6aafdf9fdbd6878ac3dd9a7cd83c0f3f5d8a0e6350c76aStatus:Downloadednewerimageforsequenceiq/hadoop-docker:2.7.0启动[root@bogon~]#dockerrun-itsequenceiq/hadoop-docker:2.7.0/etc/bootstrap.sh-bash--privileged=trueStartingsshd:[OK]Startingnamenodeson[b7a42f79339c]b7a42f79339c:startingnamenode,loggingto/usr/local/hadoop/logs/hadoop-root-namenode-b7a42f79339c.outlocalhost:startingdatanode,loggingto/usr/local/hadoop/logs/hadoop-root-datanode-b7a42f79339c.outStartingsecondarynamenodes[0.0.0.0]0.0.0.0:startingsecondarynamenode,loggingto/usr/local/hadoop/logs/hadoop-root-secondarynamenode-b7a42f79339c.outstartingyarndaemonsstartingresourcemanager,loggingto/usr/local/hadoop/logs/yarn--resourcemanager-b7a42f79339c.outlocalhost:startingnodemanager,loggingto/usr/local/hadoop/logs/yarn-root-nodemanager-b7a42f79339c.out启动胜利后号令行shell会主动进进Hadoop的容器情况,不需要施行docker exec。在容器情况进进/usr/local/hadoop/sbin,施行./start-all.sh和./mr-jobhistory-daemon.sh start historyserver,如下
bash-4.1#cd/usr/local/hadoop/sbinbash-4.1#./start-all.shThisscriptisDeprecated.Insteadusestart-dfs.shandstart-yarn.shStartingnamenodeson[b7a42f79339c]b7a42f79339c:namenoderunningasprocess128.Stopitfirst.localhost:datanoderunningasprocess219.Stopitfirst.Startingsecondarynamenodes[0.0.0.0]0.0.0.0:secondarynamenoderunningasprocess402.Stopitfirst.startingyarndaemonsresourcemanagerrunningasprocess547.Stopitfirst.localhost:nodemanagerrunningasprocess641.Stopitfirst.bash-4.1#./mr-jobhistory-daemon.shstarthistoryserverchown:missingoperandafter`/usr/local/hadoop/logs'Try`chown--help'formoreinformation.startinghistoryserver,loggingto/usr/local/hadoop/logs/mapred--historyserver-b7a42f79339c.outHadoop启动完成,如斯简单。
要问散布式摆设有多费事,数数光设置装备摆设文件就有几个吧!我亲目睹过一个hadoop老鸟,因为新换的办事器hostname主机名带横线“-”,配了一上午,情况硬是没起来。
运行自带的例子回到Hadoop主目次,运行示例法式
bash-4.1#cd/usr/local/hadoopbash-4.1#bin/hadoopjarshare/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jargrepinputoutput'dfs[a-z.]+'20/07/0522:34:41INFOclient.RMProxy:ConnectingtoResourceManagerat/0.0.0.0:803220/07/0522:34:43INFOinput.FileInputFormat:Totalinputpathstoprocess:3120/07/0522:34:43INFOmapreduce.JobSubmitter:numberofsplits:3120/07/0522:34:44INFOmapreduce.JobSubmitter:Submittingtokensforjob:job_1594002714328_000120/07/0522:34:44INFOimpl.YarnClientImpl:Submittedapplicationapplication_1594002714328_000120/07/0522:34:45INFOmapreduce.Job:Theurltotrackthejob:计算完成,有如下输出
20/07/0522:55:26INFOmapreduce.Job:Counters:49FileSystemCountersFILE:Numberofbytesread=291FILE:Numberofbyteswritten=230541FILE:Numberofreadoperations=0FILE:Numberoflargereadoperations=0FILE:Numberofwriteoperations=0HDFS:Numberofbytesread=569HDFS:Numberofbyteswritten=197HDFS:Numberofreadoperations=7HDFS:Numberoflargereadoperations=0HDFS:Numberofwriteoperations=2JobCountersLaunchedmaptasks=1Launchedreducetasks=1Data-localmaptasks=1Totaltimespentbyallmapsinoccupiedslots(ms)=5929Totaltimespentbyallreducesinoccupiedslots(ms)=8545Totaltimespentbyallmaptasks(ms)=5929Totaltimespentbyallreducetasks(ms)=8545Totalvcore-secondstakenbyallmaptasks=5929Totalvcore-secondstakenbyallreducetasks=8545Totalmegabyte-secondstakenbyallmaptasks=6071296Totalmegabyte-secondstakenbyallreducetasks=8750080Map-ReduceFrameworkMapinputrecords=11Mapoutputrecords=11Mapoutputbytes=263Mapoutputmaterializedbytes=291Inputsplitbytes=132Combineinputrecords=0Combineoutputrecords=0Reduceinputgroups=5Reduceshufflebytes=291Reduceinputrecords=11Reduceoutputrecords=11SpilledRecords=22ShuffledMaps=1FailedShuffles=0MergedMapoutputs=1GCtimeelapsed(ms)=159CPUtimespent(ms)=1280Physicalmemory(bytes)snapshot=303452160Virtualmemory(bytes)snapshot=1291390976Totalcommittedheapusage(bytes)=136450048ShuffleErrorsBAD_ID=0CONNECTION=0IO_ERROR=0WRONG_LENGTH=0WRONG_MAP=0WRONG_REDUCE=0FileInputFormatCountersBytesRead=437FileOutputFormatCountersBytesWritten=197hdfs号令查看输出成果
bash-4.1#bin/hdfsdfs-catoutput/*6dfs.audit.logger4dfs.class3dfs.server.namenode.2dfs.period2dfs.audit.log.maxfilesize2dfs.audit.log.maxbackupindex1dfsmetrics.log1dfsadmin1dfs.servers1dfs.replication1dfs.file例子讲解grep是一个在输进入彀算正则表达式婚配的mapreduce法式,挑选出契合正则的字符串以及呈现次数。
shell的grep成果会展现完全的一行,那个号令只展现行中婚配的阿谁字符串
grepinputoutput'dfs[a-z.]+'正则表达式dfs[a-z.]+,表达字符串要以dfs开头,后面是小写字母或者换行符\n之外的肆意单个字符都能够,数量一个或者多个。输进是input里的所有文件,
bash-4.1#ls-lrttotal48-rw-r--r--.1rootroot690May162015yarn-site.xml-rw-r--r--.1rootroot5511May162015kms-site.xml-rw-r--r--.1rootroot3518May162015kms-acls.xml-rw-r--r--.1rootroot620May162015。计算流程如下
稍有差别的是那里有两次reduce,第二次reduce就是把成果根据呈现次数排个序。map和reduce流程开发者本身随意组合,只要各流程的输进输出能跟尾上就行。
治理系统介绍Hadoop供给了web界面的治理系统,
端标语 用处 50070 Hadoop Namenode UI端口 50075 Hadoop Datanode UI端口 50090 Hadoop SecondaryNamenode 端口 50030 JobTracker监控端口 50060 TaskTrackers端口 8088 Yarn使命监控端口 60010 Hbase HMaster监控UI端口 60030 Hbase HRegionServer端口 8080 Spark监控UI端口 4040 Spark使命UI端口
加号令参数docker run号令要加进参数,才气拜候UI治理页面
dockerrun-it-p50070:50070-p8088:8088-p50075:50075sequenceiq/hadoop-docker:2.7.0/etc/bootstrap.sh-bash--privileged=true施行那条号令后在宿主机阅读器就能够查看系统了,当然假设Linux有阅读器也能够查看。我的Linux没有图形界面,所以在宿主机查看。
50070 Hadoop Namenode UI端口
50075 Hadoop Datanode UI端口
8088 Yarn使命监控端口
已完成和正在运行的mapreduce使命都能够在8088里查看,上图有gerp和wordcount两个使命。
一些问题一、./sbin/mr-jobhistory-daemon.sh start historyserver必需施行,不然运行使命过程中会报
20/06/2921:18:49INFOipc.Client:Retryingconnecttoserver:0.0.0.0/0.0.0.0:10020.Alreadytried9time(s);retrypolicyisRetryUpToMaximumCountWithFixedSleep(maxRetries=10,sleepTime=1000MILLISECONDS)java.io.IOException:java.net.ConnectException:CallFrom87a4217b9f8a/172.17.0.1to0.0.0.0:10020failedonconnectionexception:java.net.ConnectException:Connectionrefused;Formoredetailssee:错误
三、docker run号令后面必需加--privileged=true,不然运行使命过程中会报java.io.IOException: Job status not available
四、重视,Hadoop 默认不会笼盖成果文件,因而再次运行上面实例会提醒出错,需要先将 ./output 删除。或者换成output01尝尝?
总结本文办法能够低成本的完成Hadoop的安拆设置装备摆设,关于进修理解和开发测试都有搀扶帮助的。假设开发本身的Hadoop法式,需要将法式打jar包上传到share/hadoop/mapreduce/目次,施行
bin/hadoopjarshare/hadoop/mapreduce/yourtest.jar来运行法式看察效果。