GPU CUDA安装

GPU

https://developer.nvidia.com/cuda-gpus

版本

下载 https://www.nvidia.cn/geforce/drivers/

1
2
3
4
$  uname -srm
Linux 5.10.0-21-amd64 x86_64

# dpkg-query -s linux-headers-$(uname -r)

Driver Version 要是 51x

1
2
$ lspci | grep -i vga
01:00.0 VGA compatible controller: NVIDIA Corporation GP106M [GeForce GTX 1060 Mobile] (rev a1)

Nvidia 卡信息的末尾是 rev a1,表示已经开启。

末尾是 rev ff,表示独显已经关闭

支持列表 https://developer.nvidia.com/cuda-gpus

X server

1
2
3
4
5
#切换到文本界面
/sbin/init 3

#切换到图形界面
/sbin/init 5

Using: nvidia-installer ncurses v6 user interface
-> Detected 8 CPUs online; setting concurrency level to 8.
-> The file ‘/tmp/.X0-lock’ exists and appears to contain the process ID ‘773’ of a running X server.
ERROR: You appear to be running an X server; please exit X before installing. For further details, please see the section INSTALLING THE NVIDIA DRIVER in the README available on the Linux driver download page at www.nvidia.com.
ERROR: Installation has failed. Please see the file ‘/var/log/nvidia-installer.log’ for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

driver

1
2
3
4
5
6
7
8
9
10
cat <<EOF | sudo tee /usr/lib/modprobe.d/dist-blacklist.conf
blacklist nouveau
options nouveau modeset=0
EOF


sudo update-initramfs -u

#reboot
lsmod | grep nouveau

ERROR: The Nouveau kernel driver is currently in use by your system. This driver is incompatible with the NVIDIA driver, and must be disabled before proceeding. Please consult the NVIDIA driver README and your Linux distribution’s documentation for details on how to correctly disable the Nouveau kernel driver.

kernel-source

会提示安装

1
2
#根据/var/log/nvidia-installer.log 报错情况安装缺失包
sudo apt install dkms

–kernel-source-path 问题 安装linux-headers

1
2
3
#关闭X server 图形界面后执行
sh NVIDIA-Linux-x86_64-*.run

驱动版本

1
cat /proc/driver/nvidia/version

显卡编号

1
ls -l /dev/nvidia*

多块

2

1
sudo nvidia-smi

温度

1
sudo nvidia-smi -q -d TEMPERATURE

10s 一次

1
watch -n 10 nvidia-smi

卸载

1
sh  NVIDIA*.run  --uninstall  

高版本可以卸载低版本的 525卸载515

CUDA

https://developer.nvidia.com/cuda-toolkit-archive

1
2
3
4
5
sudo sh cuda_*.run

#accept/decline/quit:accept
#Install NVIDIA Accelerated Graphics Driver no不选择安装
nvidia-fs 如果你需要在容器中使用NVIDIA GPU资源,需要安装nvidia-fs

===========

= Summary =

===========

Driver: Not Selected
Toolkit: Installed in /usr/local/cuda-11.7/

Please make sure that

  • PATH includes /usr/local/cuda-11.7/bin
  • LD_LIBRARY_PATH includes /usr/local/cuda-11.7/lib64, or, add /usr/local/cuda-11.7/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-11.7/bin
***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 515.00 is required for CUDA 11.7 functionality to work.
To install the driver using this installer, run the following command, replacing with the name of this run file:
sudo .run –silent –driver

Logfile is /var/log/cuda-installer.log

nvidia-fs

https://github.com/NVIDIA/gds-nvidia-fs

[INFO]: previous version of nvidia-fs is not installed, nvidia-fs version: 2.14.14 will be installed.
[INFO]: getting mofed Status
[INFO]: installation status shows that mofed is not installed,please install mofed before continuing nvidia_fs install.
[ERROR]: Install of nvidia-fs failed, quitting

download https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/

nvcc

1
2
3
4
5
6
$ /usr/local/cuda-11.7/bin/nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0
1
2
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64"
export CUDA_HOME="/usr/local/cuda"

Check failed: s.ok() could not find cudnnCreate in cudnn DSO xxxxx undefined symbol: cudnnCreate

CUDNN

下载地址:https://developer.nvidia.com/cudnn

1
2
3
4
5
6
tar -zxvf cudnn-8.0-linux-*.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda/include/
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64/
sudo chmod a+r /usr/local/cuda/include/cudnn.h
sudo chmod a+r /usr/local/cuda/lib64/libcudnn*

a+r所有人加上可执行权限,包括所有者,所属组,和其他人
o+x 只是给其他人加上可执行权限

卸载内核

1
dpkg --get-selections |grep linux-image

linux-image-5.10.0-21-amd64 install
linux-image-6.1.0-0.deb11.6-amd64 install
linux-image-amd64 install

1
dpkg -l 'linux-image-*' | grep '^ii'

ii linux-image-5.10.0-21-amd64 5.10.162-1 amd64 Linux 5.10 for 64-bit PCs (signed)
ii linux-image-6.1.0-0.deb11.6-amd64 6.1.15-1bpo11+1 amd64 Linux 6.1 for 64-bit PCs (signed)
ii linux-image-amd64 6.1.15-1
bpo11+1 amd64 Linux for 64-bit PCs (meta-package)

1
sudo apt-get purge  linux-image-5.10.0-21-amd64

正在清除 linux-image-5.10.0-21-amd64 (5.10.162-1) 的配置文件 …
rmdir: 删除 ‘/lib/modules/5.10.0-21-amd64’ 失败: 目录非空

/lib/modules/5.10.0-21-amd64/

点击打赏
文章目录
  1. 1. GPU
    1. 1.1. 版本
    2. 1.2. X server
    3. 1.3. driver
    4. 1.4. kernel-source
    5. 1.5. 驱动版本
    6. 1.6. 显卡编号
    7. 1.7. 卸载
  2. 2. CUDA
    1. 2.1. nvidia-fs
    2. 2.2. nvcc
  3. 3. CUDNN
  4. 4. 卸载内核
载入天数...载入时分秒... ,