View previous topic :: View next topic |
Author |
Message |
SkunkMyrddyn n00b
Joined: 25 Dec 2024 Posts: 7
|
Posted: Wed Dec 25, 2024 4:04 am Post subject: Nvidia Datacenter Driver |
|
|
I'm adding a Nvidia Tesla A2 card to my server to support Cuda / Tensor flow / other AI and compute node acceleration. (the card does not have video out connections)
I am having a difficult time installing the correct driver for the system. The general "nvidia-drivers" package 1) requires X (this is a headless server), and 2) does not list this card as supported (if I'm reading the documentation correctly).
Does anyone know how to get the correct driver(s) installed so that pytorch can recognize the nvidia compute nodes for acceleration? |
|
Back to top |
|
|
tiffany n00b
Joined: 04 May 2008 Posts: 11
|
Posted: Wed Dec 25, 2024 9:57 am Post subject: |
|
|
NVidia's site has a separate section for datacenter drivers. Have you seen them?
I see that they support RHEL, Debian and others. |
|
Back to top |
|
|
SkunkMyrddyn n00b
Joined: 25 Dec 2024 Posts: 7
|
Posted: Wed Dec 25, 2024 4:38 pm Post subject: |
|
|
I checked those out and wasn't sure how to convince gentoo to handle one of the other packaging formats. So I did grab the tarballs they have, which have a nvidia-installer binary; but I can't get that to run either.
I found that it has a --no-x-check that will bypass seeing if X (of some kind) is installed or not.
However, the installer errors out saying it cannot figure out my initramfs. Which makes sense as I am not using an initramfs at all on this system. Nor do I see an option to inform that installer to bypass it.
I feel like I'm missing something basic. |
|
Back to top |
|
|
Banana Moderator
Joined: 21 May 2004 Posts: 1811 Location: Germany
|
|
Back to top |
|
|
SkunkMyrddyn n00b
Joined: 25 Dec 2024 Posts: 7
|
Posted: Thu Dec 26, 2024 10:11 am Post subject: |
|
|
The nvidia-cuda-toolkit doesn't install a driver, so python torch does not find any cuda devices.
With -X set as a USE flag blocks the x11-drivers/nvidia-drivers from installing. |
|
Back to top |
|
|
Hu Administrator
Joined: 06 Mar 2007 Posts: 22887
|
Posted: Thu Dec 26, 2024 12:02 pm Post subject: |
|
|
SkunkMyrddyn wrote: | With -X set as a USE flag blocks the x11-drivers/nvidia-drivers from installing. | Please show the output that led to this statement. I do not see that result here: Code: | # USE=-X emerge -pv nvidia-drivers
These are the packages that would be merged, in order:
Calculating dependencies... done!
Dependency resolution took 2.59 s (backtrack: 0/20).
...
[ebuild N ] x11-drivers/nvidia-drivers-550.135:0/550::gentoo USE="modules strip tools -X -dist-kernel -kernel-open -modules-compress -modules-sign -persistenced -powerd -static-libs -wayland" ABI_X86="(64) -32" 314787 KiB
|
|
|
Back to top |
|
|
Ionen Developer
Joined: 06 Dec 2018 Posts: 2887
|
Posted: Thu Dec 26, 2024 5:00 pm Post subject: |
|
|
For nvidia-drivers on a headless setup, usually you'll want USE="persistenced -X -static-libs -wayland -tools" on it (and enable persistenced w/ systemd or openrc, this prevent the card from getting uninitialized when there isn't a display constantly using it).
wrt USE=-tools, that's for nvidia-settings which is a GUI application, so likely don't want that either. It does have some command line usage but is very limited without X given it uses it to talk to the card (imagine nvidia plans to migrate its feature to rely on NVML in the future).
As for USE=-static-libs, that's for libXNVCtrl.a which requires xorg headers at build time. Library is not useful if not using X. If another package depends on nvidia-drivers having static-libs enabled, may want to try USE=-video_cards_nvidia on that package, the feature won't be useful headless.
Should let you avoid about all X/wayland stuff, albeit I wouldn't overly stress about these even if unused, it's pretty small dependencies as long as don't start pulling the bigger GUI toolkits. |
|
Back to top |
|
|
SkunkMyrddyn n00b
Joined: 25 Dec 2024 Posts: 7
|
Posted: Thu Dec 26, 2024 6:19 pm Post subject: |
|
|
Hu wrote: | SkunkMyrddyn wrote: | With -X set as a USE flag blocks the x11-drivers/nvidia-drivers from installing. | Please show the output that led to this statement. I do not see that result here: Code: | # USE=-X emerge -pv nvidia-drivers
These are the packages that would be merged, in order:
Calculating dependencies... done!
Dependency resolution took 2.59 s (backtrack: 0/20).
...
[ebuild N ] x11-drivers/nvidia-drivers-550.135:0/550::gentoo USE="modules strip tools -X -dist-kernel -kernel-open -modules-compress -modules-sign -persistenced -powerd -static-libs -wayland" ABI_X86="(64) -32" 314787 KiB
|
|
USE=-X emerge -pv nvidia-drivers
These are the packages that would be merged, in order:
Calculating dependencies... done!
Dependency resolution took 8.09 s (backtrack: 0/20).
[ebuild N ] x11-themes/hicolor-icon-theme-0.17::gentoo 0 KiB
[ebuild N ] x11-libs/libXv-1.0.13::gentoo USE="-doc" ABI_X86="(64) -32 (-x 32)" 275 KiB
[ebuild N ] x11-libs/libXcomposite-0.4.6::gentoo USE="-doc" ABI_X86="(64) -32 (-x32)" 0 KiB
[ebuild N ] x11-libs/libXcursor-1.2.3::gentoo USE="-doc" ABI_X86="(64) -32 (-x32)" 286 KiB
[ebuild N ] x11-libs/libXdamage-1.1.6::gentoo ABI_X86="(64) -32 (-x32)" 0 KiB
[ebuild N ] dev-libs/jansson-2.14-r2:0/4::gentoo USE="-doc -static-libs" 0 KiB
[ebuild N ] dev-util/gdbus-codegen-2.82.4::gentoo PYTHON_SINGLE_TARGET="py thon3_12 -python3_10 -python3_11 -python3_13" 0 KiB
[ebuild N ] dev-lang/vala-0.56.17:0.56::gentoo USE="-test -valadoc" 0 KiB
[ebuild N ] virtual/linux-sources-3-r8::gentoo USE="-firmware" 0 KiB
[ebuild N ] x11-libs/gdk-pixbuf-2.42.12:2::gentoo USE="gif introspection j peg -gtk-doc -test -tiff" ABI_X86="(64) -32 (-x32)" 0 KiB
[ebuild N ] sys-apps/dbus-1.15.8::gentoo USE="-X -debug -doc -elogind (-se linux) -static-libs -systemd -test -valgrind" ABI_X86="(64) -32 (-x32)" 0 KiB
[ebuild N ] dev-libs/fribidi-1.0.13::gentoo USE="-doc -test" ABI_X86="(64) -32 (-x32)" 0 KiB
[ebuild N ] x11-libs/libvdpau-1.5::gentoo USE="-doc -dri -test" ABI_X86="( 64) -32 (-x32)" 0 KiB
[ebuild N ] media-libs/libepoxy-1.5.10-r3::gentoo USE="X -test" ABI_X86="( 64) -32 (-x32)" 0 KiB
[ebuild R ] x11-libs/cairo-1.18.2-r1::gentoo USE="X* glib (-aqua) (-debug) -gtk-doc -test" ABI_X86="(64) -32 (-x32)" 0 KiB
[ebuild N ] x11-libs/pango-1.52.2::gentoo USE="introspection -X -debug -sy sprof -test" ABI_X86="(64) -32 (-x32)" 0 KiB
[ebuild N ] app-accessibility/at-spi2-core-2.52.0:2::gentoo USE="introspec tion -X -dbus-broker -gtk-doc -systemd -test" ABI_X86="(64) -32 (-x32)" 0 KiB
[ebuild N ] dev-util/gtk-update-icon-cache-3.24.42::gentoo 0 KiB
[ebuild N ] gnome-base/librsvg-2.58.5:2::gentoo USE="introspection vala -d ebug -gtk-doc" ABI_X86="(64) -32 (-x32)" 6246 KiB
[ebuild N ] x11-libs/gtk+-3.24.42-r1:3::gentoo USE="X introspection (-aqua ) -broadway -cloudproviders -colord -cups -examples -gtk-doc -sysprof -test -vim -syntax -wayland -xinerama" ABI_X86="(64) -32 (-x32)" 0 KiB
[ebuild N ] x11-themes/adwaita-icon-theme-legacy-46.2::gentoo 0 KiB
[ebuild N ] x11-themes/adwaita-icon-theme-46.2::gentoo USE="-branding" 0 K iB
[ebuild N ] dev-util/vulkan-headers-1.3.296.0::gentoo 0 KiB
[ebuild N ] dev-util/pahole-1.27-r1::gentoo USE="-debug -verify-sig" PYTHO N_SINGLE_TARGET="python3_12 -python3_10 -python3_11 -python3_13" 0 KiB
[ebuild N ] x11-drivers/nvidia-drivers-565.77:0/565::gentoo USE="modules s tatic-libs strip tools -X -dist-kernel -kernel-open -modules-compress -modules-s ign -persistenced -powerd -wayland" ABI_X86="(64) -32" 347766 KiB
Total: 25 packages (24 new, 1 reinstall), Size of downloads: 354572 KiB
The following USE changes are necessary to proceed:
(see "package.use" in the portage(5) man page for more details)
# required by x11-drivers/nvidia-drivers-565.77::gentoo[tools]
# required by nvidia-drivers (argument)
>=x11-libs/gtk+-3.24.42-r1 X
# required by x11-libs/gtk+-3.24.42-r1::gentoo
# required by x11-themes/adwaita-icon-theme-legacy-46.2::gentoo
# required by x11-themes/adwaita-icon-theme-46.2::gentoo
>=media-libs/libepoxy-1.5.10-r3 X
# required by x11-libs/gtk+-3.24.42-r1::gentoo
# required by x11-themes/adwaita-icon-theme-legacy-46.2::gentoo
# required by x11-themes/adwaita-icon-theme-46.2::gentoo
>=x11-libs/cairo-1.18.2-r1 X
emerge: there are no ebuilds built with USE flags to satisfy "x11-libs/gtk+:3[X] ".
!!! One of the following packages is required to complete your request:
- x11-libs/gtk+-3.24.41-r1::gentoo (Change USE: +X)
(dependency required by "x11-drivers/nvidia-drivers-565.77::gentoo[tools]" [ebui ld])
(dependency required by "nvidia-drivers" [argument])
[Administrator edit: unchecked Disable BBCode in this post so that OP's quote tags work. -Hu] |
|
Back to top |
|
|
SkunkMyrddyn n00b
Joined: 25 Dec 2024 Posts: 7
|
Posted: Thu Dec 26, 2024 6:21 pm Post subject: |
|
|
Ionen wrote: | For nvidia-drivers on a headless setup, usually you'll want USE="persistenced -X -static-libs -wayland -tools" on it (and enable persistenced w/ systemd or openrc, this prevent the card from getting uninitialized when there isn't a display constantly using it).
wrt USE=-tools, that's for nvidia-settings which is a GUI application, so likely don't want that either. It does have some command line usage but is very limited without X given it uses it to talk to the card (imagine nvidia plans to migrate its feature to rely on NVML in the future).
As for USE=-static-libs, that's for libXNVCtrl.a which requires xorg headers at build time. Library is not useful if not using X. If another package depends on nvidia-drivers having static-libs enabled, may want to try USE=-video_cards_nvidia on that package, the feature won't be useful headless.
Should let you avoid about all X/wayland stuff, albeit I wouldn't overly stress about these even if unused, it's pretty small dependencies as long as don't start pulling the bigger GUI toolkits. |
USE="persistenced -X -static-libs -wayland -tools" emerge -pv nvidia-drivers
These are the packages that would be merged, in order:
Calculating dependencies... done!
Dependency resolution took 2.86 s (backtrack: 0/20).
[ebuild N ] acct-user/nvpd-0-r2::gentoo 0 KiB
[ebuild N ] dev-util/pahole-1.27-r1::gentoo USE="-debug -verify-sig" PYTHON_SINGLE_TARGET="python3_12 -python3_10 -python3_11 -python3_13" 0 KiB
[ebuild N ] virtual/linux-sources-3-r8::gentoo USE="-firmware" 0 KiB
[ebuild N ] x11-drivers/nvidia-drivers-565.77:0/565::gentoo USE="modules persistenced strip -X -dist-kernel -kernel-open -modules-compress -modules-sign -powerd -static-libs -tools -wayland" ABI_X86="(64) -32" 347766 KiB
Total: 4 packages (4 new), Size of downloads: 347766 KiB
Looks like that set is allowing it to build. Running it and will see if pytorch will see the card.
[Administrator edit: unchecked Disable BBCode in this post so that OP's quote tags work. -Hu] |
|
Back to top |
|
|
SkunkMyrddyn n00b
Joined: 25 Dec 2024 Posts: 7
|
Posted: Sat Dec 28, 2024 7:53 pm Post subject: |
|
|
Thanks for the assistances.
I wound up adding a keyword to the USE flag to ensure that any 32bit app might have access (just in case)
USE="abi_x86_32 persistenced -X -static-libs -wayland -tools" emerge -v nvidia-drivers
This installed the drivers, and it looks like my system is attempting to load them. But the load of the module is throwing its own error. Doesn't look to be a gentoo error, but a limitation of my particular system and the nvidia kernel driver seems to require the Resizable Bar enabled; but I can't enable that on my system.
My understanding is that resizeable bar was just a performance change; and not a blocker on loading the driver.
You wouldn't happen to know how to configure the driver to not require it?
I can post the module's logs once I get my system booting again. Attempting to enable the Resizeable bar causes it to fail to post with an invalid opcode error, and I've already updated the BIOS to see if that could fix the error. Which took a day to get going.
If it helps, my system specs are:
HP Proliant DL360p G8 (P71), Dual Xeon E5-2697v2's. |
|
Back to top |
|
|
SkunkMyrddyn n00b
Joined: 25 Dec 2024 Posts: 7
|
Posted: Sat Dec 28, 2024 8:17 pm Post subject: |
|
|
May 23 19:18:26 starbase_one kernel: nvidia: loading out-of-tree module taints kernel.
May 23 19:18:26 starbase_one kernel: nvidia: module license 'NVIDIA' taints kernel.
May 23 19:18:26 starbase_one kernel: Disabling lock debugging due to kernel taint
May 23 19:18:26 starbase_one kernel: nvidia: module license taints kernel.
May 23 19:18:26 starbase_one kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 243
May 23 19:18:26 starbase_one kernel:
May 23 19:18:26 starbase_one kernel: nvidia 0000:07:00.0: enabling device (0040 -> 0042)
May 23 19:18:26 starbase_one kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:\x0aNVRM: BAR1 is 0M @ 0x0 (PCI:0000:07:00.0)
May 23 19:18:26 starbase_one kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:\x0aNVRM: BAR2 is 0M @ 0x0 (PCI:0000:07:00.0)
May 23 19:18:26 starbase_one kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:\x0aNVRM: BAR5 is 0M @ 0x0 (PCI:0000:07:00.0)
May 23 19:18:26 starbase_one kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 565.77 Wed Nov 27 23:33:08 UTC 2024
May 23 19:18:26 starbase_one kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 565.77 Wed Nov 27 22:53:48 UTC 2024
May 23 19:18:26 starbase_one kernel: [drm] [nvidia-drm] [GPU ID 0x00000700] Loading driver
May 23 19:18:26 starbase_one kernel: [drm] Initialized nvidia-drm 0.0.0 for 0000:07:00.0 on minor 2 |
|
Back to top |
|
|
Ionen Developer
Joined: 06 Dec 2018 Posts: 2887
|
|
Back to top |
|
|
|