这个系列写了好几篇文章,这是相关文章的索引,仅供参考:
- 深度学习主机攒机小记
- 深度学习主机环境配置: Ubuntu16.04+Nvidia GTX 1080+CUDA8.0
- 深度学习主机环境配置: Ubuntu16.04+GeForce GTX 1080+TensorFlow
- 深度学习服务器环境配置: Ubuntu17.04+Nvidia GTX 1080+CUDA 9.0+cuDNN 7.0+TensorFlow 1.3
- 从零开始搭建深度学习服务器:硬件选择
- 从零开始搭建深度学习服务器: 基础环境配置(Ubuntu + GTX 1080 TI + CUDA + cuDNN)
- 从零开始搭建深度学习服务器: 深度学习工具安装(TensorFlow + PyTorch + Torch)
- 从零开始搭建深度学习服务器: 深度学习工具安装(Theano + MXNet)
- 从零开始搭建深度学习服务器: 1080TI四卡并行(Ubuntu16.04+CUDA9.2+cuDNN7.1+TensorFlow+Keras)
最近公司又弄了一套4卡1080TI机器,配置基本上和之前是一致的,只是显卡换成了技嘉的伪公版1080TI:技嘉GIGABYTE GTX1080Ti 涡轮风扇108TTURBO-11GD
部件 型号 价格 链接 备注 CPU 英特尔(Intel)酷睿六核i7-6850K 盒装CPU处理器 4599 http://item.jd.com/11814000696.html 散热器 美商海盗船 H55 水冷 449 https://item.jd.com/10850633518.html 主板 华硕(ASUS)华硕 X99-E WS/USB 3.1工作站主板 4759 内存 美商海盗船(USCORSAIR) 复仇者LPX DDR4 3000 32GB(16Gx4条) 2799 * 2 https://item.jd.com/1990572.html SSD 三星(SAMSUNG) 960 EVO 250G M.2 NVMe 固态硬盘 599 https://item.jd.com/3739097.html 硬盘 希捷(SEAGATE)酷鱼系列 4TB 5900转 台式机机械硬盘 * 2 629 * 2 https://item.jd.com/4220257.html 电源 美商海盗船 AX1500i 全模组电源 80Plus金牌 3699 https://item.jd.com/10783917878.html 机箱 美商海盗船 AIR540 USB3.0 949 http://item.jd.com/12173900062.html 显卡 技嘉(GIGABYTE) GTX1080Ti 11GB 非公版高端游戏显卡深度学习涡轮 * 4 7400 * 4 https://item.jd.com/10583752777.html
这台深度学习主机大概是这样的:
安装完Ubuntu16.04之后,我又开始了CUDA、cuDnn等深度学习环境和工具的安装之旅,时隔大半年,又有了很多变化,特别是CUDA9.x和cuDnn7.x已经成了标配,这里记录一下。
安装CUDA9.x
注:如果还需要安装Tensorflow1.8,建议这里安装CUDA9.0,我在另一台机器上遇到了一点问题,怀疑和我这台机器先安装CUDA9.0,再安装CUDA9.2有关。
依然从英伟达官方下载当前的CUDA版本,我选择了最新的CUDA9.2:
点选完对应Ubuntu16.04的CUDA9.2 deb版本之后,英伟达官方主页会给出安装提示:
Installation Instructions:
`sudo dpkg -i cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64.deb`
`sudo apt-key add /var/cuda-repo-/7fa2af80.pub`
`sudo apt-get update`
`sudo apt-get install cuda`
在下载完大概1.2G的cuda deb版本之后,实际安装命令是这样的:
sudo dpkg -i cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64.deb sudo apt-key add /var/cuda-repo-9-2-local/7fa2af80.pub sudo apt-get update sudo apt-get install cuda
官方CUDA下载下载页面还附带了一个cuBLAS 9.2 Patch更新,官方强烈建议安装:
This update includes fix to cublas GEMM APIs on V100 Tensor Core GPUs when used with default algorithm CUBLAS_GEMM_DEFAULT_TENSOR_OP. We strongly recommend installing this update as part of CUDA Toolkit 9.2 installation.
可以用如下方式安装这个Patch更新:
sudo dpkg -i cuda-repo-ubuntu1604-9-2-local-cublas-update-1_1.0-1_amd64.deb sudo apt-get update sudo apt-get upgrade cuda
CUDA9.2安装完毕之后,1080TI的显卡驱动也附带安装了,可以重启机器,然后用 nvidia-smi 命令查看一下:
最后在在 ~/.bashrc 中设置环境变量:
export PATH=/usr/local/cuda/bin${PATH:+:${PATH}} export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} export CUDA_HOME=/usr/local/cuda
运行 source ~/.bashrc 使其生效。
安装cuDNN7.x
同样去英伟达官网的cuDNN下载页面:https://developer.nvidia.com/rdp/cudnn-download,最新版本是cuDNN7.1.4,有三个版本可以选择,分别面向CUDA8.0, CUDA9.0, CUDA9.2:
下载完cuDNN7.1的压缩包之后解压,然后将相关文件拷贝到cuda的系统路径下即可:
tar -zxvf cudnn-9.2-linux-x64-v7.1.tgz sudo cp cuda/include/cudnn.h /usr/local/cuda/include/ sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64/ -d sudo chmod a+r /usr/local/cuda/include/cudnn.h sudo chmod a+r /usr/local/cuda/lib64/libcudnn*
安装TensorFlow 1.8
TensorFlow的安装变得越来越简单,现在TensorFlow的官网也有中文安装文档了:https://www.tensorflow.org/install/install_linux?hl=zh-cn , 我们Follow这个文档,用Virtualenv的安装方式进行TensorFlow的安装,不过首先要配置一下基础环境。
首先在Ubuntu16.04里安装 libcupti-dev 库:
这是 NVIDIA CUDA 分析工具接口。此库提供高级分析支持。要安装此库,请针对 CUDA 工具包 8.0 或更高版本发出以下命令:
$ sudo apt-get install cuda-command-line-tools
并将其路径添加到您的 LD_LIBRARY_PATH 环境变量中:$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/extras/CUPTI/lib64
对于 CUDA 工具包 7.5 或更低版本,请发出以下命令:$ sudo apt-get install libcupti-dev
然而我运行“sudo apt-get install cuda-command-line-tools”命令后得到的却是:
E: 无法定位软件包 cuda-command-line-tools
Google后发现其实在安装CUDA9.2的时候,这个包已经安装了,在CUDA的路径下这个库已经有了:
/usr/local/cuda/extras/CUPTI/lib64$ ls libcupti.so libcupti.so.9.2 libcupti.so.9.2.88
现在只需要将其加入到环境变量中,在~/.bashrc中添加如下声明并令source ~/.bashrc另其生效即可:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/extras/CUPTI/lib64
剩下的就更简单了:
sudo apt-get install python-pip python-dev python-virtualenv virtualenv --system-site-packages tensorflow1.8 source tensorflow1.8/bin/activate easy_install -U pip pip install -i https://pypi.tuna.tsinghua.edu.cn/simple --upgrade tensorflow-gpu
强烈建议将清华的pip源写到配置文件里,这样就更方便快捷了。
最后测试一下TensorFlow1.8:
Python 2.7.12 (default, Dec 4 2017, 14:50:18) [GCC 5.4.0 20160609] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import tensorflow as tf >>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True)) 2018-06-17 12:15:34.158680: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2018-06-17 12:15:34.381812: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:05:00.0 totalMemory: 10.91GiB freeMemory: 5.53GiB 2018-06-17 12:15:34.551451: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 1 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:06:00.0 totalMemory: 10.92GiB freeMemory: 5.80GiB 2018-06-17 12:15:34.780350: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 2 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:09:00.0 totalMemory: 10.92GiB freeMemory: 5.80GiB 2018-06-17 12:15:34.959199: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 3 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:0a:00.0 totalMemory: 10.92GiB freeMemory: 5.80GiB 2018-06-17 12:15:34.966403: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0, 1, 2, 3 2018-06-17 12:15:36.373745: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-06-17 12:15:36.373785: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0 1 2 3 2018-06-17 12:15:36.373798: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N Y Y Y 2018-06-17 12:15:36.373804: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 1: Y N Y Y 2018-06-17 12:15:36.373808: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 2: Y Y N Y 2018-06-17 12:15:36.373814: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 3: Y Y Y N 2018-06-17 12:15:36.374516: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5307 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0, compute capability: 6.1) 2018-06-17 12:15:36.444426: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 5582 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:06:00.0, compute capability: 6.1) 2018-06-17 12:15:36.506340: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 5582 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1) 2018-06-17 12:15:36.614736: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 5582 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:0a:00.0, compute capability: 6.1) Device mapping: /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0, compute capability: 6.1 /job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:06:00.0, compute capability: 6.1 /job:localhost/replica:0/task:0/device:GPU:2 -> device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1 /job:localhost/replica:0/task:0/device:GPU:3 -> device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:0a:00.0, compute capability: 6.1 2018-06-17 12:15:36.689345: I tensorflow/core/common_runtime/direct_session.cc:284] Device mapping: /job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0, compute capability: 6.1 /job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:06:00.0, compute capability: 6.1 /job:localhost/replica:0/task:0/device:GPU:2 -> device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1 /job:localhost/replica:0/task:0/device:GPU:3 -> device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:0a:00.0, compute capability: 6.1 |
安装Keras2.1.x
Keras的后端支持TensorFlow, Theano, CNTK,在安装完TensorFlow GPU版本之后,继续安装Keras非常简单,在TensorFlow的虚拟环境中,直接"pip install keras"即可,安装的版本是Keras2.1.6:
Installing collected packages: h5py, scipy, pyyaml, keras
Successfully installed h5py-2.7.1 keras-2.1.6 pyyaml-3.12 scipy-1.1.0
测试一下:
Python 2.7.12 (default, Dec 4 2017, 14:50:18) [GCC 5.4.0 20160609] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import keras Using TensorFlow backend. |
注:原创文章,转载请注明出处及保留链接“我爱自然语言处理”:https://www.52nlp.cn
本文链接地址:从零开始搭建深度学习服务器: 1080TI四卡并行(Ubuntu16.04+CUDA9.2+cuDNN7.1+TensorFlow+Keras) https://www.52nlp.cn/?p=10334
tensorflow 1.8.0 好像和CUDA9.2不兼容呀。。。tensorflow 报错了
[回复]
52nlp 回复:
18 6 月, 2018 at 22:08
报什么错误?我这边没有问题。另外抱歉下午我看到你的评论了,用的是这个cuDNN版本:cudnn-9.2-linux-x64-v7.1.tgz ,不过网站那个时候出了一点问题,我做了一次恢复处理,你的那个评论丢失了。
[回复]
huajin 回复:
19 6 月, 2018 at 18:28
那就好像是我的之前的CUDA9.0没删干净吧
[回复]
huajin 回复:
20 6 月, 2018 at 14:10
>>> import tensorflow as tf
Traceback (most recent call last):
File "", line 1, in
File "/home/cvnlp/tensorflow1.8/local/lib/python2.7/site-packages/tensorflow/__init__.py", line 24, in
from tensorflow.python import pywrap_tensorflow # pylint: disable=unused-import
File "/home/cvnlp/tensorflow1.8/local/lib/python2.7/site-packages/tensorflow/python/__init__.py" , line 49, in
from tensorflow.python import pywrap_tensorflow
File "/home/cvnlp/tensorflow1.8/local/lib/python2.7/site-packages/tensorflow/python/pywrap_tenso rflow.py", line 74, in
raise ImportError(msg)
ImportError: Traceback (most recent call last):
File "/home/cvnlp/tensorflow1.8/local/lib/python2.7/site-packages/tensorflow/python/pywrap_tenso rflow.py", line 58, in
from tensorflow.python.pywrap_tensorflow_internal import *
File "/home/cvnlp/tensorflow1.8/local/lib/python2.7/site-packages/tensorflow/python/pywrap_tenso rflow_internal.py", line 28, in
_pywrap_tensorflow_internal = swig_import_helper()
File "/home/cvnlp/tensorflow1.8/local/lib/python2.7/site-packages/tensorflow/python/pywrap_tenso rflow_internal.py", line 24, in swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory
52nlp 回复:
20 6 月, 2018 at 14:27
第一,检查一下 :/usr/local/cuda/lib64下是否有libcublas.so的文件?我这个下面是这样的,问题是你的为什么还要找9.0呢?
libcublas.so
libcublas.so.9.2
第二是否已经在 ~/.bashrc 中设置了环境变量:
export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export CUDA_HOME=/usr/local/cuda
运行 source ~/.bashrc 使其生效。
博主您好,请问下不同品牌的1080Ti可以组成多路并行GPU服务器吗,手头有公版的,索泰的和大雕的1080Ti。。。
[回复]
52nlp 回复:
29 6 月, 2018 at 18:02
应该可以,我混搭了1080TI和titan xp,没什么问题
[回复]
I7 的总线带不动4个1080Ti吧
[回复]
52nlp 回复:
15 8 月, 2018 at 23:07
目前用着没问题
[回复]
感謝樓主,安裝成功!
[回复]
你好,我的问题也是出在ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory,目录也有libcublas.so和 libcublas.so.9.2,之前的步骤全部是按这篇文章来的,tensorflow安装的是1.8版本,我是不是应该安装cuda9.0而不应该安装9.2?
此外还有一个问题,关于ubuntu默认python3版本为3.5,我用3.6然后改了python默认版本会有影响吗?
谢谢!
[回复]
52nlp 回复:
2 9 月, 2018 at 22:25
关于第一个问题,文章中也提升了一下,可能还是需要安装CUDA9.0;关于第二个问题,3.6问题不大。
[回复]
老板好,你的4块卡装在一起不烫吗?我把四块卡也像你这样插在一起,待机温度就有五十多度,跑一个小模型,温度瞬间飙到90多,不敢接着跑了。。
另外我的主板也是7条PCI-E,最底下那条插上显卡的话,主板电源就没法插了 T T,这是怎么解决的呢?
先谢一发~
[回复]
52nlp 回复:
17 10 月, 2018 at 10:47
额,谈不上老板。不知道你用的是什么显卡?公版还是伪公版?我这边跑机器翻译模型训练的时候温度一般也到90度左右,不过问题不大,持续训练一周也没有问题;夏天的时候要注意一下散热。
“另外我的主板也是7条PCI-E,最底下那条插上显卡的话,主板电源就没法插了 T T,这是怎么解决的呢?" 这个问题具体我不太清楚,无法解答。
[回复]
楼主我是单独安装了显卡驱动,nvidia-smi只显示一张2080ti,检测不到双卡,还有楼主安装cuda,cudnn之前没有禁用ubuntu自带的显卡驱动吗?
[回复]
52nlp 回复:
26 4 月, 2019 at 14:04
有点忘了,没有禁用,这个你可以再检查一下另外一张卡是否在主板上放好了
[回复]
老板,你的显卡出2手吗
[回复]
52nlp 回复:
28 8 月, 2019 at 17:50
额,不出
[回复]