GPU
https://developer.nvidia.com/cuda-gpus
版本
下载 https://www.nvidia.cn/geforce/drivers/
1 | $ uname -srm |
Driver Version 要是 51x
1 | $ lspci | grep -i vga |
Nvidia 卡信息的末尾是 rev a1,表示已经开启。
末尾是 rev ff,表示独显已经关闭
支持列表 https://developer.nvidia.com/cuda-gpus
X server
1 | #切换到文本界面 |
Using: nvidia-installer ncurses v6 user interface
-> Detected 8 CPUs online; setting concurrency level to 8.
-> The file ‘/tmp/.X0-lock’ exists and appears to contain the process ID ‘773’ of a running X server.
ERROR: You appear to be running an X server; please exit X before installing. For further details, please see the section INSTALLING THE NVIDIA DRIVER in the README available on the Linux driver download page at www.nvidia.com.
ERROR: Installation has failed. Please see the file ‘/var/log/nvidia-installer.log’ for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
driver
1 | cat <<EOF | sudo tee /usr/lib/modprobe.d/dist-blacklist.conf |
ERROR: The Nouveau kernel driver is currently in use by your system. This driver is incompatible with the NVIDIA driver, and must be disabled before proceeding. Please consult the NVIDIA driver README and your Linux distribution’s documentation for details on how to correctly disable the Nouveau kernel driver.
kernel-source
会提示安装
1 | #根据/var/log/nvidia-installer.log 报错情况安装缺失包 |
–kernel-source-path 问题 安装linux-headers
1 | #关闭X server 图形界面后执行 |
驱动版本
1 | cat /proc/driver/nvidia/version |
显卡编号
1 | ls -l /dev/nvidia* |
多块
1 | sudo nvidia-smi |
温度
1 | sudo nvidia-smi -q -d TEMPERATURE |
10s 一次
1 | watch -n 10 nvidia-smi |
卸载
1 | sh NVIDIA*.run --uninstall |
高版本可以卸载低版本的 525卸载515
CUDA
https://developer.nvidia.com/cuda-toolkit-archive
1 | sudo sh cuda_*.run |
===========
= Summary =
===========
Driver: Not Selected
Toolkit: Installed in /usr/local/cuda-11.7/Please make sure that
- PATH includes /usr/local/cuda-11.7/bin
- LD_LIBRARY_PATH includes /usr/local/cuda-11.7/lib64, or, add /usr/local/cuda-11.7/lib64 to /etc/ld.so.conf and run ldconfig as root
To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-11.7/bin
***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 515.00 is required for CUDA 11.7 functionality to work.
To install the driver using this installer, run the following command, replacingwith the name of this run file:
sudo.run –silent –driver Logfile is /var/log/cuda-installer.log
nvidia-fs
https://github.com/NVIDIA/gds-nvidia-fs
[INFO]: previous version of nvidia-fs is not installed, nvidia-fs version: 2.14.14 will be installed.
[INFO]: getting mofed Status
[INFO]: installation status shows that mofed is not installed,please install mofed before continuing nvidia_fs install.
[ERROR]: Install of nvidia-fs failed, quitting
download https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/
nvcc
1 | $ /usr/local/cuda-11.7/bin/nvcc -V |
1 | export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64" |
Check failed: s.ok() could not find cudnnCreate in cudnn DSO xxxxx undefined symbol: cudnnCreate
CUDNN
下载地址:https://developer.nvidia.com/cudnn
1 | tar -zxvf cudnn-8.0-linux-*.tgz |
a+r所有人加上可执行权限,包括所有者,所属组,和其他人
o+x 只是给其他人加上可执行权限
卸载内核
1 | dpkg --get-selections |grep linux-image |
linux-image-5.10.0-21-amd64 install
linux-image-6.1.0-0.deb11.6-amd64 install
linux-image-amd64 install
1 | dpkg -l 'linux-image-*' | grep '^ii' |
ii linux-image-5.10.0-21-amd64 5.10.162-1 amd64 Linux 5.10 for 64-bit PCs (signed)
ii linux-image-6.1.0-0.deb11.6-amd64 6.1.15-1bpo11+1 amd64 Linux 6.1 for 64-bit PCs (signed)bpo11+1 amd64 Linux for 64-bit PCs (meta-package)
ii linux-image-amd64 6.1.15-1
1 | sudo apt-get purge linux-image-5.10.0-21-amd64 |
正在清除 linux-image-5.10.0-21-amd64 (5.10.162-1) 的配置文件 …
rmdir: 删除 ‘/lib/modules/5.10.0-21-amd64’ 失败: 目录非空
/lib/modules/5.10.0-21-amd64/
docker
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
1 | curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ |
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
1 | ❯gpg --dearmor -o- /home/cs/oss/k8s-1.26/nvidia/gpgkey >/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg |
nvidia-container-toolkit.list
/etc/apt/sources.list.d/nvidia-container-toolkit.list
deb https://nvidia.github.io/libnvidia-container/stable/deb/$(ARCH) /
#deb https://nvidia.github.io/libnvidia-container/experimental/deb/$(ARCH) /
1 | ❯ sudo apt-get install -y nvidia-container-toolkit |
正在读取软件包列表… 完成
正在分析软件包的依赖关系树… 完成
正在读取状态信息… 完成
将会同时安装下列软件:
libnvidia-container-tools nvidia-container-toolkit-base
下列【新】软件包将被安装:
libnvidia-container-tools nvidia-container-toolkit nvidia-container-toolkit-base
升级了 0 个软件包,新安装了 3 个软件包,要卸载 0 个软件包,有 197 个软件包未被升级。
需要下载 0 B/4,662 kB 的归档。
解压缩后会消耗 23.4 MB 的额外空间。
nvidia-docker.list
/etc/apt/sources.list.d/nvidia-docker.list
deb https://nvidia.github.io/libnvidia-container/stable/debian10/$(ARCH) /
#deb https://nvidia.github.io/libnvidia-container/experimental/debian10/$(ARCH) /
deb https://nvidia.github.io/nvidia-container-runtime/stable/debian10/$(ARCH) /
#deb https://nvidia.github.io/nvidia-container-runtime/experimental/debian10/$(ARCH) /
deb https://nvidia.github.io/nvidia-docker/debian10/$(ARCH) /
1 | ❯ sudo apt-get install -y nvidia-docker2 |
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey
https://nvidia.github.io/nvidia-docker/debian11/nvidia-docker.list 暂时不支持debian11,12(2024-06)
1 | sudo nvidia-ctk runtime configure --runtime=docker |
1 | ❯ docker run -it --gpus=all --name tomcat tomcat:9.0.90-jre8 bash |