GPU CUDA安装

GPU

https://developer.nvidia.com/cuda-gpus

版本

下载 https://www.nvidia.cn/geforce/drivers/

1
2
3
4
$  uname -srm
Linux 5.10.0-21-amd64 x86_64

# dpkg-query -s linux-headers-$(uname -r)

Driver Version 要是 51x

1
2
$ lspci | grep -i vga
01:00.0 VGA compatible controller: NVIDIA Corporation GP106M [GeForce GTX 1060 Mobile] (rev a1)

Nvidia 卡信息的末尾是 rev a1,表示已经开启。

末尾是 rev ff,表示独显已经关闭

支持列表 https://developer.nvidia.com/cuda-gpus

X server

1
2
3
4
5
#切换到文本界面
/sbin/init 3

#切换到图形界面
/sbin/init 5

Using: nvidia-installer ncurses v6 user interface
-> Detected 8 CPUs online; setting concurrency level to 8.
-> The file ‘/tmp/.X0-lock’ exists and appears to contain the process ID ‘773’ of a running X server.
ERROR: You appear to be running an X server; please exit X before installing. For further details, please see the section INSTALLING THE NVIDIA DRIVER in the README available on the Linux driver download page at www.nvidia.com.
ERROR: Installation has failed. Please see the file ‘/var/log/nvidia-installer.log’ for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

driver

1
2
3
4
5
6
7
8
9
10
cat <<EOF | sudo tee /usr/lib/modprobe.d/dist-blacklist.conf
blacklist nouveau
options nouveau modeset=0
EOF


sudo update-initramfs -u

#reboot
lsmod | grep nouveau

ERROR: The Nouveau kernel driver is currently in use by your system. This driver is incompatible with the NVIDIA driver, and must be disabled before proceeding. Please consult the NVIDIA driver README and your Linux distribution’s documentation for details on how to correctly disable the Nouveau kernel driver.

kernel-source

会提示安装

1
2
#根据/var/log/nvidia-installer.log 报错情况安装缺失包
sudo apt install dkms

–kernel-source-path 问题 安装linux-headers

1
2
3
#关闭X server 图形界面后执行
sh NVIDIA-Linux-x86_64-*.run

驱动版本

1
cat /proc/driver/nvidia/version

显卡编号

1
ls -l /dev/nvidia*

多块

2

1
sudo nvidia-smi

温度

1
sudo nvidia-smi -q -d TEMPERATURE

10s 一次

1
watch -n 10 nvidia-smi

卸载

1
sh  NVIDIA*.run  --uninstall  

高版本可以卸载低版本的 525卸载515

CUDA

https://developer.nvidia.com/cuda-toolkit-archive

1
2
3
4
5
sudo sh cuda_*.run

#accept/decline/quit:accept
#Install NVIDIA Accelerated Graphics Driver no不选择安装
nvidia-fs 如果你需要在容器中使用NVIDIA GPU资源,需要安装nvidia-fs

===========

= Summary =

===========

Driver: Not Selected
Toolkit: Installed in /usr/local/cuda-11.7/

Please make sure that

  • PATH includes /usr/local/cuda-11.7/bin
  • LD_LIBRARY_PATH includes /usr/local/cuda-11.7/lib64, or, add /usr/local/cuda-11.7/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-11.7/bin
***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 515.00 is required for CUDA 11.7 functionality to work.
To install the driver using this installer, run the following command, replacing with the name of this run file:
sudo .run –silent –driver

Logfile is /var/log/cuda-installer.log

nvidia-fs

https://github.com/NVIDIA/gds-nvidia-fs

[INFO]: previous version of nvidia-fs is not installed, nvidia-fs version: 2.14.14 will be installed.
[INFO]: getting mofed Status
[INFO]: installation status shows that mofed is not installed,please install mofed before continuing nvidia_fs install.
[ERROR]: Install of nvidia-fs failed, quitting

download https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/

nvcc

1
2
3
4
5
6
$ /usr/local/cuda-11.7/bin/nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0
1
2
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64"
export CUDA_HOME="/usr/local/cuda"

Check failed: s.ok() could not find cudnnCreate in cudnn DSO xxxxx undefined symbol: cudnnCreate

CUDNN

下载地址:https://developer.nvidia.com/cudnn

1
2
3
4
5
6
tar -zxvf cudnn-8.0-linux-*.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda/include/
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64/
sudo chmod a+r /usr/local/cuda/include/cudnn.h
sudo chmod a+r /usr/local/cuda/lib64/libcudnn*

a+r所有人加上可执行权限,包括所有者,所属组,和其他人
o+x 只是给其他人加上可执行权限

卸载内核

1
dpkg --get-selections |grep linux-image

linux-image-5.10.0-21-amd64 install
linux-image-6.1.0-0.deb11.6-amd64 install
linux-image-amd64 install

1
dpkg -l 'linux-image-*' | grep '^ii'

ii linux-image-5.10.0-21-amd64 5.10.162-1 amd64 Linux 5.10 for 64-bit PCs (signed)
ii linux-image-6.1.0-0.deb11.6-amd64 6.1.15-1bpo11+1 amd64 Linux 6.1 for 64-bit PCs (signed)
ii linux-image-amd64 6.1.15-1
bpo11+1 amd64 Linux for 64-bit PCs (meta-package)

1
sudo apt-get purge  linux-image-5.10.0-21-amd64

正在清除 linux-image-5.10.0-21-amd64 (5.10.162-1) 的配置文件 …
rmdir: 删除 ‘/lib/modules/5.10.0-21-amd64’ 失败: 目录非空

/lib/modules/5.10.0-21-amd64/

docker

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

1
2
3
4
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

1
❯gpg --dearmor -o-   /home/cs/oss/k8s-1.26/nvidia/gpgkey  >/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
nvidia-container-toolkit.list
/etc/apt/sources.list.d/nvidia-container-toolkit.list
deb https://nvidia.github.io/libnvidia-container/stable/deb/$(ARCH) /
#deb https://nvidia.github.io/libnvidia-container/experimental/deb/$(ARCH) /  
1
❯ sudo apt-get install -y nvidia-container-toolkit

正在读取软件包列表… 完成
正在分析软件包的依赖关系树… 完成
正在读取状态信息… 完成
将会同时安装下列软件:
libnvidia-container-tools nvidia-container-toolkit-base
下列【新】软件包将被安装:
libnvidia-container-tools nvidia-container-toolkit nvidia-container-toolkit-base
升级了 0 个软件包,新安装了 3 个软件包,要卸载 0 个软件包,有 197 个软件包未被升级。
需要下载 0 B/4,662 kB 的归档。
解压缩后会消耗 23.4 MB 的额外空间。

nvidia-docker.list
/etc/apt/sources.list.d/nvidia-docker.list
deb https://nvidia.github.io/libnvidia-container/stable/debian10/$(ARCH) /
#deb https://nvidia.github.io/libnvidia-container/experimental/debian10/$(ARCH) /
deb https://nvidia.github.io/nvidia-container-runtime/stable/debian10/$(ARCH) /
#deb https://nvidia.github.io/nvidia-container-runtime/experimental/debian10/$(ARCH) /
deb https://nvidia.github.io/nvidia-docker/debian10/$(ARCH) /  
1
❯ sudo apt-get install -y nvidia-docker2

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey

https://nvidia.github.io/nvidia-docker/debian11/nvidia-docker.list 暂时不支持debian11,12(2024-06)

1
2
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
❯ docker run -it --gpus=all --name tomcat  tomcat:9.0.90-jre8  bash
root@29b552d5f72b:/usr/local/tomcat# nvidia-smi
Tue Jul 23 01:03:53 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| N/A 52C P8 6W / 88W | 403MiB / 6144MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
root@29b552d5f72b:/usr/local/tomcat#

点击打赏
文章目录
  1. 1. GPU
    1. 1.1. 版本
    2. 1.2. X server
    3. 1.3. driver
    4. 1.4. kernel-source
    5. 1.5. 驱动版本
    6. 1.6. 显卡编号
    7. 1.7. 卸载
  2. 2. CUDA
    1. 2.1. nvidia-fs
    2. 2.2. nvcc
  3. 3. CUDNN
  4. 4. 卸载内核
  5. 5. docker
载入天数...载入时分秒... ,