大数据集群搭建


#1

虚拟机环境配置

##集群规划

元数据服务存储

##网卡配置

vim /etc/sysconfig/network-scripts/ifcfg-eth0

内容如下:

DEVICE=eth0
TYPE=Ethernet
#UUID=4e24d937-945e-4253-9334-7a6335a4cada
ONBOOT=yes
NM_CONTROLLED=yes
IPV6INIT=no
BOOTPROTO=static
IPADDR=10.20.8.164
GATEWAY=10.20.8.254
NETMASK=255.255.255.0
HWADDR=00:0C:29:2E:6B:C6

修改主机名

vim /etc/sysconfig/network

内容如下:

NETWORKING=yes
HOSTNAME=masterZH
NETWORKING_IPV6=no

修改hosts

vim /etc/hosts

内容如下

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
10.20.8.164 masterZH
10.20.8.165 master2ZH
10.20.8.166 slave1ZH
10.20.8.167 slave2ZH
10.20.8.168 slave3ZH

##防火墙与selinux

停止防火墙:

service iptables stop

永久关闭防火墙:

chkconfig iptables off
setenforce 0

禁用selinux:

vim /etc/selinux/config

内容如下:

SELINUX=disabled

ntp服务

ntp服务器配置:

vi /etc/ntp.conf

内容如下:

driftfile /var/lib/ntp/drift
restrict 127.0.0.1
restrict -6 ::1
restrict default nomodify notrap
server 127.127.1.0  #local clock
fudge 127.127.1.0 stratum 10
includefile /etc/ntp/crypto/pw
keys /etc/ntp/keys

启动服务: service ntpd restart 其余节点ntp服务配置:

vim /etc/ntp.conf

内容如下:

driftfile /var/lib/ntp/drift
restrict default kod nomodify notrap nopeer noquery
restrict -6 default kod nomodify notrap nopeer noquery
restrict 127.0.0.1
restrict -6 ::1
server 10.20.8.164
includefile /etc/ntp/crypto/pw
keys /etc/ntp/keys

重启与更新时间:

service ntpd restart
ntpdate -u masterZH

设置开机启动:

chkconfig ntpd on

检查是否设置成功:

chkconfig –list ntpd

自动同步时间(每小时一次):

crontab –e

添加以下内容:

0 0 * * * /usr/sbin/ntpdate -u master

ssh免密登录

集群上每台主机上打开配置:

sudo vim /etc/ssh/sshd_config

开启下面的选项:

RSAAuthentications yes
PubkeyAuthentications yes
AuthorizedKeysFile .ssh/authorized_keys

生成SSH秘钥(一路回车):

ssh-keygen -t rsa

执行完之后在~/.ssh/目录下会生成一个保存有公钥的文件:id_rsa.pub

每个机器root用户下的公钥拷贝到集群中的master机器:

ssh-copy-id root@masterZH

最终在masterZH机器上生成如下的内容的~/.ssh/authorized_keys文件:

ssh-rsa ……
ssh-rsa ……
ssh-rsa ……

masterZH的公钥追加到authorized_keys:

cat id_rsa.pub >> authorized_keys

再把master的authorized_keys拷贝到master2ZH、slave12ZH、slave2ZH和slave3ZH:

scp ~/.ssh/authorized_keys root@master2ZH:~/.ssh/
scp ~/.ssh/authorized_keys root@slave1ZH:~/.ssh/
...

重启SSH服务:

sudo service sshd restart

在masterZH上测试连接:

ssh slave1ZH
ssh master2ZH

在slave2ZH测试连接:

ssh masterZH
ssh salve3ZH

#zookeeper-3.5.1安装

下载zookeeper-3.5.1

下载地址:

https://mirrors.tuna.tsinghua.edu.cn/apache/zookeeper/ 选择所需版本即可

zookeeper配置

创建zookeeper存放目录:

mkdir /opt/zookeeper/

解压到该目录下:

tar -xzvf ./zookeeper-3.5.1-alpha.tar.gz -C /opt/zookeeper/
mv /opt/zookeeper/zookeeper-3.5.1-alpha /opt/zookeeper/3.5.1

修改配置文件:

cd /opt/zookeeper/3.5.1/conf
cp zoo_sample.cfg zoo.cfg
vi zoo.cfg

配置文件内容如下:

tickTime=2000
initLimit=5
syncLimit=2
dataDir=/var/lib/zookeeper
clientPort=2181
server.1=slave1ZH:2888:3888
server.2=slave2ZH:2888:3888
server.3=slave3ZH:2888:3888

创建zookeeper data存放目录:

mkdir /var/lib/zookeeper

每台zk机器创建id:

echo "1" > /var/lib/zookeeper/myid
echo "2" > /var/lib/zookeeper/myid
echo "3" > /var/lib/zookeeper/myid

zk配置为一个服务:

cd /etc/rc.d/init.d/
pwd
touch zookeeper
chmod +x zookeeper
vi zookeeper

内容如下:

#!/bin/bash
# chkconfig:2345 20 90
# description:zookeeper
# processname:zookeeper
export JAVA_HOME=/usr/local/java/jdk1.8.0_151
export PATH=$JAVA_HOME/bin:$PATH
case $1 in
     start) su root /opt/zookeeper/3.5.1/bin/zkServer.sh start;;
     stop) su root /opt/zookeeper/3.5.1/bin/zkServer.sh stop;;
     status) su root /opt/zookeeper/3.5.1/bin/zkServer.sh status;;
     restart) su root /opt/zookeeper/3.5.1/bin/zkServer.sh restart;;
     *) echo "require start|stop|status|restart" ;;
esac

测试zookeeper服务的命令是否正常:

service zookeeper start
service zookeeper status
service zookeeper stop
service zookeeper status

添加到开机启动:

chkconfig zookeeper on
chkconfig --add zookeeper

至此zookeeper配置完成!

zookeeper进程与状态查看

查看进程:

查看zookeeper节点状态:

参考资料

  1. Installing ZooKeeper: http://docs.electric-cloud.com/commander_doc/5_0_2/HTML5/Install/Content/Install%20Guide/horizontal_scalability/9InstallZookeeper.htm
  2. 设置zookeeper开机启动: http://blog.csdn.net/u012453843/article/details/70162796

hadoop-2.7.2 HA 集群安装

下载hadoop-2.7.2

下载地址:

http://archive.apache.org/dist/hadoop/common/hadoop-2.7.2/ 下载以下三个文件:

hadoop-2.7.2-tar.gz
hadoop-2.7.2-tar.gz.asc
hadoop-2.7.2-tar.gz.mds

hadoop 配置

创建 hadoop 存放目录:

mkdir /opt/hadoop/

解压到该目录下:

tar -xzvf ./hadoop-2.7.2.tar.gz -C /opt/hadoop/
mv hadoop-2.7.2 2.7.2

配置 core-site.xml:

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://ns1/</value>
  </property>
  <property>
    <name>io.file.buffer.size</name>
    <value>131072</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>file:/opt/hadoop/2.7.2/data/tmp</value>
  </property>
  <property>
    <name>hadoop.proxyuser.hduser.hosts</name>
    <value>*</value>
  </property>
  <property>
    <name>ha.zookeeper.quorum</name>
    <value>slave1ZH:2181,slave2ZH:2181,slave3ZH:2181</value>
  </property>
</configuration>

配置 hdfs-site.xml:

<configuration>
  <property>
    <name>dfs.nameservices</name>
	<value>ns1</value>
  </property>
  <property>
    <name>dfs.ha.namenodes.ns1</name>
	<value>nn1,nn2</value>
  </property>
  <property>
    <name>dfs.namenode.rpc-address.ns1.nn1</name>
	<value>masterZH:9000</value>
  </property>
  <property>
    <name>dfs.namenode.rpc-address.ns1.nn2</name>
	<value>master2ZH:9000</value>
  </property>
  <property>
    <name>dfs.namenode.http-address.ns1.nn1</name>
	<value>masterZH:50070</value>
  </property>
  <property>
    <name>dfs.namenode.http-address.ns1.nn2</name>
	<value>master2ZH:50070</value>
  </property>
  <property>
    <name>dfs.namenode.shared.edits.dir</name>
	<value>qjournal://slave1ZH:8485;slave2ZH:8485;slave3ZH:8485/ns1</value>
  </property>
  <property>
    <name>dfs.client.failover.proxy.provider.ns1</name>
	<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
  </property>
  <property>
    <name>dfs.ha.fencing.methods</name>
	<value>sshfence</value>
  </property>
  <property>
    <name>dfs.ha.fencing.ssh.private-key-files</name>
	<value>/root/.ssh/id_rsa</value>
  </property>
  <property>
    <name>dfs.journalnode.edits.dir</name>
	<value>/opt/hadoop/2.7.2/data/tmp/journal</value>
  </property>
  <property>
    <name>dfs.ha.automatic-failover.enabled.ns1</name>
	<value>true</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
	<value>file:/opt/hadoop/2.7.2/dfs/name</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
	<value>file:/opt/hadoop/2.7.2/dfs/data</value>
  </property>
  <property>
    <name>dfs.replication</name>
	<value>3</value>
  </property>
  <property>
    <name>dfs.webhdfs.enabled</name>
	<value>true</value>
  </property>
  <property>
    <name>dfs.journalnode.http-address</name>
	<value>0.0.0.0:8480</value>
  </property>
  <property>
    <name>dfs.journalnode.rpc-address</name>
	<value>0.0.0.0:8485</value>
  </property>
  <property>
    <name>ha.zookeeper.quorum</name>
	<value>slave1ZH:2181,slave2ZH:2181,slave3ZH:2181</value>
  </property>
</configuration>

配置 yarn-site.xml:

<configuration>
  <property>
    <name>yarn.resourcemanager.connect.retry-interval.ms</name>
	<value>2000</value>
  </property>
   <property>
    <name>yarn.resourcemanager.ha.enabled</name>
	<value>true</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.rm-ids</name>
	<value>rm1,rm2</value>
  </property>
  <property>
    <name>ha.zookeeper.quorum</name>
	<value>slave1ZH:2181,slave2ZH:2181,slave3ZH:2181</value>
  </property>
   <property>
    <name>yarn.resourcemanager.ha.automatic-failover.enabled</name>
	<value>true</value>
  </property>
  <property>
    <name>yarn.resourcemanager.hostname.rm1</name>
	<value>masterZH</value>
  </property>
  <property>
    <name>yarn.resourcemanager.hostname.rm2</name>
	<value>master2ZH</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.id</name>
	<value>rm1</value>
  </property>
  <property>
    <name>yarn.resourcemanager.recovery.enabled</name>
	<value>true</value>
  </property>
  <property>
    <name>yarn.resourcemanager.zk-state-store.address</name>
	<value>slave1ZH:2181,slave2ZH:2181,slave3ZH:2181</value>
  </property>
  <property>
    <name>yarn.resourcemanager.store.class</name>
	<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
  </property>
  <property>
    <name>yarn.resourcemanager.zk-address</name>
	<value>slave1ZH:2181,slave2ZH:2181,slave3ZH:2181</value>
  </property>
  <property>
    <name>yarn.resourcemanager.cluster-id</name>
	<value>yrc</value>
  </property>
  <property>
    <name>yarn.app.mapreduce.am.scheduler.connection.wait.interval-ms</name>
	<value>5000</value>
  </property>
  <property>
    <name>yarn.resourcemanager.address</name>
    <value>masterZH:8032</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>masterZH:8030</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>masterZH:8031</value>
  </property>
  <property>
    <name>yarn.resourcemanager.address.rm1</name>
	<value>masterZH:8132</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address.rm1</name>
	<value>masterZH:8130</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.address.rm1</name>
	<value>masterZH:8188</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.address.rm1</name>
	<value>masterZH:8131</value>
  </property>
  <property>
    <name>yarn.resourcemanager.admin.address.rm1</name>
	<value>masterZH:8033</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.admin.address.rm1</name>
	<value>masterZH:23142</value>
  </property>
  <property>
    <name>yarn.resourcemanager.address.rm2</name>
	<value>master2ZH:8132</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address.rm2</name>
	<value>master2ZH:8130</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.address.rm2</name>
	<value>master2ZH:8188</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.address.rm2</name>
	<value>master2ZH:8131</value>
  </property>
  <property>
    <name>yarn.resourcemanager.admin.address.rm2</name>
	<value>master2ZH:8033</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.admin.address.rm2</name>
	<value>master2ZH:23142</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services</name>
	<value>mapreduce_shuffle</value>
  </property>
   <property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
	<value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <property>
    <name>yarn.nodemanager.local-dirs</name>
	<value>/opt/hadoop/2.7.2/yarn/local</value>
  </property>
  <property>
    <name>yarn.nodemanager.log-dirs</name>
	<value>/opt/hadoop/2.7.2/logs</value>
  </property>
  <property>
    <name>mapreduce.shuffle.port</name>
	<value>23080</value>
  </property>
  <property>
    <name>yarn.client.failover-proxy-provider</name>
	<value>org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider</value>
  </property>
  <property>
    <name>yarn.resourcemanager.ha.automatic-failover.zk-base-path</name>
	<value>/yarn-leader-election</value>
  </property>
  <property>
    <name>yarn.nodemanager.pmem-check-enabled</name>
	<value>false</value>
  </property>
  <property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
	<value>false</value>
  </property>
</configuration>

配置 mapred-site.xml:

<configuration>
  <property>
    <name>mapreduce.framework.name</name>
	<value>yarn</value>
  </property>
  <property>
    <name>mapreduce.jobhistory.address</name>
	<value>master2ZH:10020</value>
  </property>
  <property>
    <name>mapreduce.jobhistory.webapp.address</name>
	<value>master2ZH:19888</value>
  </property>
  <property>
    <name>yarn.app.mapreduce.am.staging-dir</name>
	<value>/user</value>
  </property>
  <property>
    <name>mapred.child.java.opts</name>
	<value>-Xmx2g</value>
  </property>
  <property>
    <name>io.sort.mb</name>
	<value>512</value>
  </property>
  <property>
    <name>io.sort.factor</name>
	<value>20</value>
  </property>
  <property>
    <name>mapred.job.reuse.jvm.num.tasks</name>
	<value>-1</value>
  </property>
  <property>
    <name>mapreduce.reduce.shuffle.parallelcopies</name>
	<value>20</value>
  </property>
</configuration>

配置slaves

vi /opt/hadoop/2.7.2/etc/hadoop

内容如下:

slave1ZH
slave2ZH
slave3ZH

配置 hadoop-env.sh

vi ./hadoop-env.sh

内容如下:

export JAVA_HOME=/usr/local/java/jdk1.8.0_151
export HADOOP_SSH_OPTS= "-p 22"
export HADOOP_LOG_DIR=/opt/hadoop/2.7.2/logs

配置 yarn-env.sh

vi ./yarn-env.sh

内容如下:

export JAVA_HOME=/usr/local/java/jdk1.8.0_151
export YARN_LOG_DIR=/opt/hadoop/2.7.2/logs

设置环境变量

vi /etc/profile

内容如下:

export HADOOP_HOME=/opt/hadoop/2.7.2
exportPATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
source /etc/profile

环境变量立即生效:

source /etc/profile

分发到各个节点:

在master2ZH的ha.id修改为rm2:

<property>
  <name>yarn.resourcemanager.ha.id</name>
  <value>rm2</value>
</property>

启动hadoop

开启zookeeper后,执行:

在slave1ZH、slave2ZH、slave3ZH上分别启动journalnode:

sbin/hadoop-daemon.sh start journalnode

格式化zk(仅安装或者重装时候执行):

在masterZH上执行:

bin/hdfs zkfc -formatZK

格式化masterZH和启动:

bin/hdfs namenode -format #仅安装或者重装时候执行
sbin/hadoop-daemon.sh start namenode

格式化master2ZH和启动:

bin/hdfs namenode -bootstrapStandby #仅安装或者重装时候执行
sbin/hadoop-daemon.sh start namenode

在masterZH与master2ZH上启动zkfc服务:

sbin/hadoop-daemon.sh start zkfc

masterZH启动datanode:

sbin/hadoop-daemons.sh start datanode

masterZH与master2ZH启动yarn与resourcemanager:

sbin/start-yarn.sh
sbin/yarn-daemon.sh start resourcemanager

master2ZH启动historyserver:

sbin/mr-jobhistory-daemon.sh start historyserver

集群主从节点的hadoop进程:

masterZH: master2ZH: slave*ZH:

hadoop命令测试与端口查看

创建并修改hdfs文件:

hadoop fs -mkdir /tmp
hadoop fs -chmod -R 777 /tmp

查看集群状态信息:

masterZH:

master2ZH: 可以看到standby信息:

查看集群任务信息:

mysq-5.7.19安装

卸载已安装的mysql和mariadb

查询出来安装的mariadb和mariadb:

rpm -qa | grep mariadb
rpm -qa | grep mysql

卸载mariadb/mysql:

rpm -e --nodeps filename
rm /etc/my.cnf

配置安装mysql

解压安装包:

tar -xvf mysql-5.7.19-1.el6.x86_64.rpm-bundle.tar
rpm -ivh mysql-community-common-5.7.19-1.el6.x86_64.rpm
rpm -ivh mysql-community-libs-5.7.19-1.el6.x86_64.rpm
rpm -ivh mysql-community-client-5.7.19-1.el6.x86_64.rpm
rpm -ivh mysql-community-server-5.7.19-1.el6.x86_64.rpm

免密登录:

vi /etc/my.cnf

加一行

skip-grant-tables

启动:service mysqld restart 登陆:mysql 创建hive库及hive用户及权限:

meta store之一:masterZH(10.20.8.164)

set global validate_password_policy=0;
ALTER USER root@localhost IDENTIFIED BY '12345678';
create database hive_metadata DEFAULT CHARSET utf8 COLLATE utf8_general_ci;
use hive_metadata;
create user hive@10.20.8.164 IDENTIFIED by '12345678';
revoke all privileges on *.* from hive@10.20.8.164;
revoke grant option on *.* from hive@10.20.8.164;
grant all on hive_metadata.* to hive@hive_metadata IDENTIFIED by '12345678';
grant all on hive_metadata.* to hive@'%' IDENTIFIED by '12345678';
flush privileges;

meta store之二:master2ZH(10.20.8.165)

create user hive@10.20.8.165 IDENTIFIED by '12345678';
revoke all privileges on *.* from hive@10.20.8.165;
revoke grant option on *.* from hive@10.20.8.165;
grant all on hive_metadata.* to hive@hive_metadata IDENTIFIED by '12345678';
grant all on hive_metadata.* to hive@'%' IDENTIFIED by '12345678';
flush privileges;

修改hive的字符集:

alter database hive_metadata character set latin1;
quit;

hive-1.2.2安装

下载 hive-1.2.2

下载地址:

http://www-eu.apache.org/dist/hive/stable/

下载以下三个文件:

apache-hive-1.2.2-bin.tar.gz
apache-hive-1.2.2-bin.tar.gz.asc
apache-hive-1.2.2-bin.tar.gz.md5

hive 配置

创建 hive 存放目录:

mkdir /opt/hive/

解压到该目录下:

tar -xzvf ./tar -zxvf apache-hive-1.2.2-bin.tar.gz -C /opt/hive/
mv apache-hive-1.2.2-bin 1.2.2

配置环境变量:

cd /etc/profile.d
touche hive-1.2.2.sh
vi hive-1.2.2.sh

配置内容如下:

# set hive environment
HIVE_HOME=/opt/hive/1.2.2
PATH=$HIVE_HOME/bin:$PATH
CLASSPATH=$CLASSPATH:$HIVE_HOME/lib
export HIVE_HOME
export PATH
export CLASSPATH

环境变量立即生效:

source /etc/profile

配置hive-env.sh

进入配置目录:

cd /opt/hive/1.2.2/conf

复制得到hive-env.sh并打开:

cp hive-env.sh.template hive-env.sh
vi hive-env.sh

配置如下:

# set HADOOP_HOME to point specific hadoop install directory
HADOOP_HOME=/opt/hadoop/2.7.2
# hive configure directory can be controlled by:
export HIVE_CONF_DIR=/opt/hive/1.2.2/conf

配置hive log4j:

复制得到:

cp hive-exec-log4j.properties.template hive-exec-log4j.properties
cp hive-log4j.properties.template hive-log4j.properties

修改上面两个文件的配置:

hive.log.dir=/opt/hive/1.2.2/logs
log4j.appender.EventCounter=org.apache.hadoop.log.metrics.EventCounter

hive配置文件要用到hdfs的一些路径,需要手动创建:

hdfs dfs -mkdir -p /usr/hive/warehouse
hdfs dfs -mkdir -p /usr/hive/tmp
hdfs dfs -mkdir -p /usr/hive/log
hdfs dfs -chmod 777 /usr/hive/warehouse
hdfs dfs -chmod 777 /usr/hive/tmp

服务端hive-site.xml:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://10.20.8.169:3306/hive_metadata?createDatabaseIfNotExist=true&amp;characterEncoding=UTF-8&amp;useSSL=false</value>
    <decription>the URL of the MySql database</decription>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <decription>Driver class name for a JDBC metastore</decription>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>12345678</value>
  </property>
  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/usr/hive/warehouse</value>
  </property>
  <property>
    <name>hive.exec.scratchdir</name>
    <value>/usr/hive/tmp</value>
  </property>
  <property>
    <name>hive.querylog.location</name>
    <value>/usr/hive/log</value>
  </property>
</configuration>

客户端hive-site.xml:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://masterZH:9083,master2ZH:9083</value>
    <decription>IP address(or fully-qualified domain name) and port of the metastore host</decription>
  </property>
  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/usr/hive/warehouse</value>
  </property>
  <property>
    <name>hive.exec.scratchdir</name>
    <value>/usr/hive/tmp</value>
  </property>
  <property>
    <name>hive.querylog.location</name>
    <value>/usr/hive/log</value>
  </property>
</configuration>

元数据库驱动包及路径

tar -zxvf ./mysql-connector-java-5.1.44.tar.gz -C /opt/hive/1.2.2/lib
cd /opt/hive/1.2.2/lib
mv mysql-connector-java-5.1.44/mysql-connector-java-5.1.44-bin.jar ./

分发配置完成的hive

分发hive目录及环境变量到其余节点(两个服务端与三个客户端,其hie-site.xml分别对应服务端配置与客户端配置) 其余节点环境变量生效:

source /etc/profile

格式化数据库:

/opt/hive/1.2.2/bin/schematool -dbType mysql -initSchema

启动服务端程序

在metastore服务端执行:

hive --service metastore&

至此安装完成!

建表测试

hdfs dfs -mkdir -p /data/test/
hdfs dfs -copyFromLocal users.dat /data/test/

users.data数据如下:

1::F::1::10::48067
2::M::56::16::70072

进入hive client

hive

交互式下输入:

show databases;
create database hivetest;
use hivetest;
create table users(UserID BigInt, Gender String, Age Int, Occuption String, Zipcode String) partitioned by (dt String) row format delimited fields terminated by '::';
load data inpath '/data/test/users.dat' into table users partition(dt=20171214);
select count(1) from users;

如下如所示:

spark安装

下载spark-1.5.1

下载地址:

http://archive.apache.org/dist/spark/spark-1.5.1/

下载以下三个文件:

spark-1.5.1-bin-without-hadoop.tgz
spark-1.5.1-bin-without-hadoop.tgz.asc
spark-1.5.1-bin-without-hadoop.tgz.md5.txt

spark 配置

创建 spark 存放目录:

mkdir /opt/spark/

解压到该目录下:

tar -zxf spark-1.5.1-bin-without-hadoop.tgz -C /opt/spark/
mv spark-1.5.1-bin-without-hadoop 1.5.1

复制得到配置文件:

cd conf
cp spark-env.sh.template spark-env.sh
cp spark-defaults.conf.template spark-defaults.conf
cp log4j.properties.template log4j.properties

配置spark-env.sh:

vi conf/spark-env.sh

文件内容如下:

export HADOOP_HOME=/opt/hadoop/2.7.2
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_DIST_CLASSPATH=${hadoop classpath}

配置slaves

vi slaves

文件内容如下:

slave1ZH
slave2ZH
slave3ZH

配置环境变量:

cd /etc/profile.d
touch spark-1.5.1.sh
vi spark-1.5.1.sh

文件内容如下:

# set spark environment
SPARK_HOME=/opt/spark/1.5.1
PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export SPARK_HOME
export PATH

环境变量立即生效:

source /etc/profile

修改控制台日志级别:

vi log4j.properties

文件内容如下

log4j.rootCategory=WARN, console

配置history-server:

vi spark-defaults.conf

增加如下内容:

# Turns on logging for applications submitted from this machine
spark.eventLog.dir /opt/spark/1.5.1/events #也可以是hdfs路径
spark.eventLog.enabled true
# Sets the logging directory for the history server
spark.history.fs.logDirectory /opt/spark/1.5.1/events #与上面的路径保持一样

spark.history.fs.cleaner.enabled true
spark.history.fs.cleaner.interval 5d
spark.history.fs.cleaner.maxAge 15d

各节点新建日志目录:

mkdir /opt/spark/1.5.1/events

分发到各个节点:

各个节点执行:

mkdir -p /opt/spark/1.5.1

负责分发的节点执行:

scp -r ./* root@masterZH:/opt/spark/1.5.1
scp -r ./* root@slave1ZH:/opt/spark/1.5.1
...
scp /etc/profile.d/spark-1.5.1.sh root@masterZH:/etc/profile.d/
scp /etc/profile.d/spark-1.5.1.sh root@slave1ZH:/etc/profile.d/
...

各个节点执行:

source /etc/profile

启动spark

启动master:

start-master.sh

启动slaves:

start-slaves.sh

查看主从节点进程:

masterZH: master2ZH: slave*ZH:

测试spark

测试自带的程序:

执行成功即可: ./run-example SparkPi 10

查看master节点状态:

masterZH:

master2ZH:

hbase-1.0.2安装

下载 hbase-1.0.2

下载地址:

http://archive.apache.org/dist/hbase/hbase-1.0.2/ 下载以下三个文件:

apache-habse-1.0.2-bin.tar.gz
apache-habse-1.0.2-bin.tar.gz.asc
apache-habse-1.0.2-bin.tar.gz.mds

hbase 配置

创建 hbase 存放目录:

mkdir /opt/hbase/

解压到该目录下:

tar -zvxf hbase-1.0.2-bin.tar.gz -C /opt/hbase/
mv hbase-1.0.2 1.0.2

配置环境变量:

cd /etc/profile.d
touch hbase-1.0.2.sh
vi hbase-1.0.2.sh

内容如下:

# set hbase environment
HBASE_HOME=/opt/hbase/1.0.2
PATH=$PATH:$HBASE_HOME/bin
export HBASE_HOME
export PATH

环境变量立即生效:

source /etc/profile

配置hbase-env.sh

cd /opt/hbase/1.0.2/conf
vi hbase-env.sh

相应的配置项为:

export JAVA_HOME=/usr/local/java/jdk1.8.0_151
export HBASE_MANAGES_ZK=false

配置hbase-site.xml:

vi hbase-site.xml

内容如下:

<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://ns1/hbase</value>
  </property>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>
  <property>
    <name>hbase.master</name>
    <value>60000</value>
  </property>
  <property>
    <name>hbase.tmp.dir</name>
    <value>file:///opt/hbase/1.0.2/tmp</value>
  </property>
  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>slave1ZH,slave2ZH,slave3ZH</value>
  </property>
  <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/var/lib/zookeeper</value>
  </property>
  <property>
    <name>hbase.zookeeper.property.clientPort</name>
    <value>2181</value>
  </property>
  <property>
    <name>zookeeper.session.timeout</name>
    <value>120000</value>
  </property>
  <property>
    <name>hbase.regionserver.restart.on.zk.expire</name>
    <value>true</value>
  </property>
</configuration>

修改regionservers:

vi regionservers

内容如下:

slave1ZH
slave2ZH
slave3ZH

hadoop的配置文件core-site.xml、hdfs-site.xml拷贝到hbase的conf目录下

cp ${HADOOP_HOME}/etc/hadoop/hdfs-site.xml  ${HBASE_HOME}/conf/
cp ${HADOOP_HOME}/etc/hadoop/core-site.xml  ${HBASE_HOME}/conf/

替换lib下的hadoop-common jar包:

cp ${HADOOP_HOME}/share/hadoop/common/hadoop-common-2.7.2.jar ${HBASE_HOME}/lib/

分发到各个节点:

其余节点执行:

mkdir -p /opt/hbase/1.0.2

负责分发的节点执行:

scp -r /opt/hbase/1.0.2/ root@masterZH:/opt/hbase/
scp -r /opt/hbase/1.0.2/ root@slave1ZH:/opt/hbase/
…
scp /opt/profile.d/hbase-1.0.2.sh root@masterZH:/opt/profile.d/
scp /opt/profile.d/ hbase-1.0.2.sh root@slave1ZH:/opt/profile.d/
…
source /etc/profile

检验hbase:

在一台master执行:

start-hbase.sh

在另一台master执行:

hbase-daemon.sh start master

在其中一个zk机器上执行zkCli.sh

ls /hbase/backup-masters

可以看到备份的master信息:

[masterzh,16020,1513668117269]

更详细的信息:

http://master2zh:16010/master-status

查看hdfs上hbase目录结构

hadoop fs -ls /hbase

执行hbase shell

hbase shell

输入:

create 'member','member_id','address','info'
describe 'member'
list
exit

查看主从节点进程:

masterZH: master2ZH: slave*ZH:

查看hbase管理界面:

masterZH:

master2ZH:

solr-5.3.1安装

solr配置

使用自带的安装脚本安装solr:

tar xvf solr-5.3.1.tgz solr-5.3.1/bin/install_solr_service.sh --strip-components=2 #解压

./install_solr_service.sh solr-5.3.1.tgz -i /opt -d /var/solr -u root -s solr -p 8983 #安装

修改solr.in.sh zookeeper配置:

vi /opt/solr-5.3.1/bin/solr.in.sh
vi /var/solr/solr.in.sh

二者中相应配置修改如下:

ZK_HOST="slave1ZH:2181,slave2ZH:2181,slave3ZH:2181"

重启服务:

service solr restart

创建测试集合:

pwd
cd /opt/solr
pwd
bin/solr create -c testcollection -d data_driven_schema_configs -s 3 -rf 2 -n myconf

查看solr log目录:

cd /opt/zookeeper/3.5.1/
bin/zkCli.sh
cd /var/solr/logs
ll

查看端口状态:

netstat -nplt | grep 8983

查看solr管理页面:

http://slave1zh:8983/solr/#/~cloud

分发到各个节点(略)

kafka安装略(超字数了) 如有问题,欢迎交流~ 本周末抽空在社区记录下大数据下关于图(graph)的一个比较实用的基础算法。


#2

非常赞的文档,供大家学习、参考!


#3

满满的干货啊,赞!!!:grin:


#4
  • 感谢社区两位辛勤的大佬 fishexpert 与 DeepCoder 的鼓励
  • 加粗的文字由于前后无换行,没显示出来,已重新编辑;修正了部分拼写
  • 这个文档提到的集群是自己在公司搭建的测试集群,目前在使用中;随着实践的进行,集群会逐步完善配置,我也会同步更新这篇文档
  • Haroopad编辑器是不错的离线markdown编辑器,比word清爽许多,大家都用起来吧