GPU
https://developer.nvidia.com/cuda-gpus
版本
下载 https://www.nvidia.cn/geforce/drivers/
1 | $ uname -srm |
Driver Version 要是 51x
1 | $ lspci | grep -i vga |
Nvidia 卡信息的末尾是 rev a1,表示已经开启。
末尾是 rev ff,表示独显已经关闭
支持列表 https://developer.nvidia.com/cuda-gpus
X server
1 | #切换到文本界面 |
Using: nvidia-installer ncurses v6 user interface
-> Detected 8 CPUs online; setting concurrency level to 8.
-> The file ‘/tmp/.X0-lock’ exists and appears to contain the process ID ‘773’ of a running X server.
ERROR: You appear to be running an X server; please exit X before installing. For further details, please see the section INSTALLING THE NVIDIA DRIVER in the README available on the Linux driver download page at www.nvidia.com.
ERROR: Installation has failed. Please see the file ‘/var/log/nvidia-installer.log’ for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
driver
1 | cat <<EOF | sudo tee /usr/lib/modprobe.d/dist-blacklist.conf |
ERROR: The Nouveau kernel driver is currently in use by your system. This driver is incompatible with the NVIDIA driver, and must be disabled before proceeding. Please consult the NVIDIA driver README and your Linux distribution’s documentation for details on how to correctly disable the Nouveau kernel driver.
kernel-source
会提示安装
1 | #根据/var/log/nvidia-installer.log 报错情况安装缺失包 |
–kernel-source-path 问题 安装linux-headers
1 | #关闭X server 图形界面后执行 |
驱动版本
1 | cat /proc/driver/nvidia/version |
显卡编号
1 | ls -l /dev/nvidia* |
多块
1 | sudo nvidia-smi |
温度
1 | sudo nvidia-smi -q -d TEMPERATURE |
10s 一次
1 | watch -n 10 nvidia-smi |
卸载
1 | sh NVIDIA*.run --uninstall |
高版本可以卸载低版本的 525卸载515
CUDA
https://developer.nvidia.com/cuda-toolkit-archive
1 | sudo sh cuda_*.run |
===========
= Summary =
===========
Driver: Not Selected
Toolkit: Installed in /usr/local/cuda-11.7/Please make sure that
- PATH includes /usr/local/cuda-11.7/bin
- LD_LIBRARY_PATH includes /usr/local/cuda-11.7/lib64, or, add /usr/local/cuda-11.7/lib64 to /etc/ld.so.conf and run ldconfig as root
To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-11.7/bin
***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 515.00 is required for CUDA 11.7 functionality to work.
To install the driver using this installer, run the following command, replacingwith the name of this run file:
sudo.run –silent –driver Logfile is /var/log/cuda-installer.log
nvidia-fs
https://github.com/NVIDIA/gds-nvidia-fs
[INFO]: previous version of nvidia-fs is not installed, nvidia-fs version: 2.14.14 will be installed.
[INFO]: getting mofed Status
[INFO]: installation status shows that mofed is not installed,please install mofed before continuing nvidia_fs install.
[ERROR]: Install of nvidia-fs failed, quitting
download https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/
nvcc
1 | $ /usr/local/cuda-11.7/bin/nvcc -V |
1 | export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64" |
Check failed: s.ok() could not find cudnnCreate in cudnn DSO xxxxx undefined symbol: cudnnCreate
CUDNN
下载地址:https://developer.nvidia.com/cudnn
1 | tar -zxvf cudnn-8.0-linux-*.tgz |
a+r所有人加上可执行权限,包括所有者,所属组,和其他人
o+x 只是给其他人加上可执行权限
卸载内核
1 | dpkg --get-selections |grep linux-image |
linux-image-5.10.0-21-amd64 install
linux-image-6.1.0-0.deb11.6-amd64 install
linux-image-amd64 install
1 | dpkg -l 'linux-image-*' | grep '^ii' |
ii linux-image-5.10.0-21-amd64 5.10.162-1 amd64 Linux 5.10 for 64-bit PCs (signed)
ii linux-image-6.1.0-0.deb11.6-amd64 6.1.15-1bpo11+1 amd64 Linux 6.1 for 64-bit PCs (signed)bpo11+1 amd64 Linux for 64-bit PCs (meta-package)
ii linux-image-amd64 6.1.15-1
1 | sudo apt-get purge linux-image-5.10.0-21-amd64 |
正在清除 linux-image-5.10.0-21-amd64 (5.10.162-1) 的配置文件 …
rmdir: 删除 ‘/lib/modules/5.10.0-21-amd64’ 失败: 目录非空
/lib/modules/5.10.0-21-amd64/