Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Nvidia Datacenter Driver
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
SkunkMyrddyn
n00b
n00b


Joined: 25 Dec 2024
Posts: 8

PostPosted: Wed Dec 25, 2024 4:04 am    Post subject: Nvidia Datacenter Driver Reply with quote

I'm adding a Nvidia Tesla A2 card to my server to support Cuda / Tensor flow / other AI and compute node acceleration. (the card does not have video out connections)

I am having a difficult time installing the correct driver for the system. The general "nvidia-drivers" package 1) requires X (this is a headless server), and 2) does not list this card as supported (if I'm reading the documentation correctly).

Does anyone know how to get the correct driver(s) installed so that pytorch can recognize the nvidia compute nodes for acceleration?
Back to top
View user's profile Send private message
tiffany
n00b
n00b


Joined: 04 May 2008
Posts: 11

PostPosted: Wed Dec 25, 2024 9:57 am    Post subject: Reply with quote

NVidia's site has a separate section for datacenter drivers. Have you seen them?

I see that they support RHEL, Debian and others.
Back to top
View user's profile Send private message
SkunkMyrddyn
n00b
n00b


Joined: 25 Dec 2024
Posts: 8

PostPosted: Wed Dec 25, 2024 4:38 pm    Post subject: Reply with quote

I checked those out and wasn't sure how to convince gentoo to handle one of the other packaging formats. So I did grab the tarballs they have, which have a nvidia-installer binary; but I can't get that to run either.

I found that it has a --no-x-check that will bypass seeing if X (of some kind) is installed or not.
However, the installer errors out saying it cannot figure out my initramfs. Which makes sense as I am not using an initramfs at all on this system. Nor do I see an option to inform that installer to bypass it.

I feel like I'm missing something basic.
Back to top
View user's profile Send private message
Banana
Moderator
Moderator


Joined: 21 May 2004
Posts: 1848
Location: Germany

PostPosted: Thu Dec 26, 2024 9:12 am    Post subject: Reply with quote

I'm not an expert in this, but there is a nvdia-cuda-toolkit package available: https://packages.gentoo.org/packages/dev-util/nvidia-cuda-toolkit
Maybe this can help.

Also, what happens if you install https://packages.gentoo.org/packages/x11-drivers/nvidia-drivers and have set -X as a useflag?
_________________
Forum Guidelines

PFL - Portage file list - find which package a file or command belongs to.
My delta-labs.org snippets do expire


Last edited by Banana on Thu Dec 26, 2024 9:39 pm; edited 1 time in total
Back to top
View user's profile Send private message
SkunkMyrddyn
n00b
n00b


Joined: 25 Dec 2024
Posts: 8

PostPosted: Thu Dec 26, 2024 10:11 am    Post subject: Reply with quote

The nvidia-cuda-toolkit doesn't install a driver, so python torch does not find any cuda devices.

With -X set as a USE flag blocks the x11-drivers/nvidia-drivers from installing.
Back to top
View user's profile Send private message
Hu
Administrator
Administrator


Joined: 06 Mar 2007
Posts: 23037

PostPosted: Thu Dec 26, 2024 12:02 pm    Post subject: Reply with quote

SkunkMyrddyn wrote:
With -X set as a USE flag blocks the x11-drivers/nvidia-drivers from installing.
Please show the output that led to this statement. I do not see that result here:
Code:
# USE=-X emerge -pv nvidia-drivers

These are the packages that would be merged, in order:

Calculating dependencies... done!
Dependency resolution took 2.59 s (backtrack: 0/20).

...
[ebuild  N     ] x11-drivers/nvidia-drivers-550.135:0/550::gentoo  USE="modules strip tools -X -dist-kernel -kernel-open -modules-compress -modules-sign -persistenced -powerd -static-libs -wayland" ABI_X86="(64) -32" 314787 KiB
Back to top
View user's profile Send private message
Ionen
Developer
Developer


Joined: 06 Dec 2018
Posts: 2891

PostPosted: Thu Dec 26, 2024 5:00 pm    Post subject: Reply with quote

For nvidia-drivers on a headless setup, usually you'll want USE="persistenced -X -static-libs -wayland -tools" on it (and enable persistenced w/ systemd or openrc, this prevent the card from getting uninitialized when there isn't a display constantly using it).

wrt USE=-tools, that's for nvidia-settings which is a GUI application, so likely don't want that either. It does have some command line usage but is very limited without X given it uses it to talk to the card (imagine nvidia plans to migrate its feature to rely on NVML in the future).

As for USE=-static-libs, that's for libXNVCtrl.a which requires xorg headers at build time. Library is not useful if not using X. If another package depends on nvidia-drivers having static-libs enabled, may want to try USE=-video_cards_nvidia on that package, the feature won't be useful headless.

Should let you avoid about all X/wayland stuff, albeit I wouldn't overly stress about these even if unused, it's pretty small dependencies as long as don't start pulling the bigger GUI toolkits.
Back to top
View user's profile Send private message
SkunkMyrddyn
n00b
n00b


Joined: 25 Dec 2024
Posts: 8

PostPosted: Thu Dec 26, 2024 6:19 pm    Post subject: Reply with quote

Hu wrote:
SkunkMyrddyn wrote:
With -X set as a USE flag blocks the x11-drivers/nvidia-drivers from installing.
Please show the output that led to this statement. I do not see that result here:
Code:
# USE=-X emerge -pv nvidia-drivers

These are the packages that would be merged, in order:

Calculating dependencies... done!
Dependency resolution took 2.59 s (backtrack: 0/20).

...
[ebuild  N     ] x11-drivers/nvidia-drivers-550.135:0/550::gentoo  USE="modules strip tools -X -dist-kernel -kernel-open -modules-compress -modules-sign -persistenced -powerd -static-libs -wayland" ABI_X86="(64) -32" 314787 KiB


USE=-X emerge -pv nvidia-drivers

These are the packages that would be merged, in order:

Calculating dependencies... done!
Dependency resolution took 8.09 s (backtrack: 0/20).

[ebuild N ] x11-themes/hicolor-icon-theme-0.17::gentoo 0 KiB
[ebuild N ] x11-libs/libXv-1.0.13::gentoo USE="-doc" ABI_X86="(64) -32 (-x 32)" 275 KiB
[ebuild N ] x11-libs/libXcomposite-0.4.6::gentoo USE="-doc" ABI_X86="(64) -32 (-x32)" 0 KiB
[ebuild N ] x11-libs/libXcursor-1.2.3::gentoo USE="-doc" ABI_X86="(64) -32 (-x32)" 286 KiB
[ebuild N ] x11-libs/libXdamage-1.1.6::gentoo ABI_X86="(64) -32 (-x32)" 0 KiB
[ebuild N ] dev-libs/jansson-2.14-r2:0/4::gentoo USE="-doc -static-libs" 0 KiB
[ebuild N ] dev-util/gdbus-codegen-2.82.4::gentoo PYTHON_SINGLE_TARGET="py thon3_12 -python3_10 -python3_11 -python3_13" 0 KiB
[ebuild N ] dev-lang/vala-0.56.17:0.56::gentoo USE="-test -valadoc" 0 KiB
[ebuild N ] virtual/linux-sources-3-r8::gentoo USE="-firmware" 0 KiB
[ebuild N ] x11-libs/gdk-pixbuf-2.42.12:2::gentoo USE="gif introspection j peg -gtk-doc -test -tiff" ABI_X86="(64) -32 (-x32)" 0 KiB
[ebuild N ] sys-apps/dbus-1.15.8::gentoo USE="-X -debug -doc -elogind (-se linux) -static-libs -systemd -test -valgrind" ABI_X86="(64) -32 (-x32)" 0 KiB
[ebuild N ] dev-libs/fribidi-1.0.13::gentoo USE="-doc -test" ABI_X86="(64) -32 (-x32)" 0 KiB
[ebuild N ] x11-libs/libvdpau-1.5::gentoo USE="-doc -dri -test" ABI_X86="( 64) -32 (-x32)" 0 KiB
[ebuild N ] media-libs/libepoxy-1.5.10-r3::gentoo USE="X -test" ABI_X86="( 64) -32 (-x32)" 0 KiB
[ebuild R ] x11-libs/cairo-1.18.2-r1::gentoo USE="X* glib (-aqua) (-debug) -gtk-doc -test" ABI_X86="(64) -32 (-x32)" 0 KiB
[ebuild N ] x11-libs/pango-1.52.2::gentoo USE="introspection -X -debug -sy sprof -test" ABI_X86="(64) -32 (-x32)" 0 KiB
[ebuild N ] app-accessibility/at-spi2-core-2.52.0:2::gentoo USE="introspec tion -X -dbus-broker -gtk-doc -systemd -test" ABI_X86="(64) -32 (-x32)" 0 KiB
[ebuild N ] dev-util/gtk-update-icon-cache-3.24.42::gentoo 0 KiB
[ebuild N ] gnome-base/librsvg-2.58.5:2::gentoo USE="introspection vala -d ebug -gtk-doc" ABI_X86="(64) -32 (-x32)" 6246 KiB
[ebuild N ] x11-libs/gtk+-3.24.42-r1:3::gentoo USE="X introspection (-aqua ) -broadway -cloudproviders -colord -cups -examples -gtk-doc -sysprof -test -vim -syntax -wayland -xinerama" ABI_X86="(64) -32 (-x32)" 0 KiB
[ebuild N ] x11-themes/adwaita-icon-theme-legacy-46.2::gentoo 0 KiB
[ebuild N ] x11-themes/adwaita-icon-theme-46.2::gentoo USE="-branding" 0 K iB
[ebuild N ] dev-util/vulkan-headers-1.3.296.0::gentoo 0 KiB
[ebuild N ] dev-util/pahole-1.27-r1::gentoo USE="-debug -verify-sig" PYTHO N_SINGLE_TARGET="python3_12 -python3_10 -python3_11 -python3_13" 0 KiB
[ebuild N ] x11-drivers/nvidia-drivers-565.77:0/565::gentoo USE="modules s tatic-libs strip tools -X -dist-kernel -kernel-open -modules-compress -modules-s ign -persistenced -powerd -wayland" ABI_X86="(64) -32" 347766 KiB

Total: 25 packages (24 new, 1 reinstall), Size of downloads: 354572 KiB

The following USE changes are necessary to proceed:
(see "package.use" in the portage(5) man page for more details)
# required by x11-drivers/nvidia-drivers-565.77::gentoo[tools]
# required by nvidia-drivers (argument)
>=x11-libs/gtk+-3.24.42-r1 X
# required by x11-libs/gtk+-3.24.42-r1::gentoo
# required by x11-themes/adwaita-icon-theme-legacy-46.2::gentoo
# required by x11-themes/adwaita-icon-theme-46.2::gentoo
>=media-libs/libepoxy-1.5.10-r3 X
# required by x11-libs/gtk+-3.24.42-r1::gentoo
# required by x11-themes/adwaita-icon-theme-legacy-46.2::gentoo
# required by x11-themes/adwaita-icon-theme-46.2::gentoo
>=x11-libs/cairo-1.18.2-r1 X

emerge: there are no ebuilds built with USE flags to satisfy "x11-libs/gtk+:3[X] ".
!!! One of the following packages is required to complete your request:
- x11-libs/gtk+-3.24.41-r1::gentoo (Change USE: +X)
(dependency required by "x11-drivers/nvidia-drivers-565.77::gentoo[tools]" [ebui ld])
(dependency required by "nvidia-drivers" [argument])

[Administrator edit: unchecked Disable BBCode in this post so that OP's quote tags work. -Hu]
Back to top
View user's profile Send private message
SkunkMyrddyn
n00b
n00b


Joined: 25 Dec 2024
Posts: 8

PostPosted: Thu Dec 26, 2024 6:21 pm    Post subject: Reply with quote

Ionen wrote:
For nvidia-drivers on a headless setup, usually you'll want USE="persistenced -X -static-libs -wayland -tools" on it (and enable persistenced w/ systemd or openrc, this prevent the card from getting uninitialized when there isn't a display constantly using it).

wrt USE=-tools, that's for nvidia-settings which is a GUI application, so likely don't want that either. It does have some command line usage but is very limited without X given it uses it to talk to the card (imagine nvidia plans to migrate its feature to rely on NVML in the future).

As for USE=-static-libs, that's for libXNVCtrl.a which requires xorg headers at build time. Library is not useful if not using X. If another package depends on nvidia-drivers having static-libs enabled, may want to try USE=-video_cards_nvidia on that package, the feature won't be useful headless.

Should let you avoid about all X/wayland stuff, albeit I wouldn't overly stress about these even if unused, it's pretty small dependencies as long as don't start pulling the bigger GUI toolkits.


USE="persistenced -X -static-libs -wayland -tools" emerge -pv nvidia-drivers

These are the packages that would be merged, in order:

Calculating dependencies... done!
Dependency resolution took 2.86 s (backtrack: 0/20).

[ebuild N ] acct-user/nvpd-0-r2::gentoo 0 KiB
[ebuild N ] dev-util/pahole-1.27-r1::gentoo USE="-debug -verify-sig" PYTHON_SINGLE_TARGET="python3_12 -python3_10 -python3_11 -python3_13" 0 KiB
[ebuild N ] virtual/linux-sources-3-r8::gentoo USE="-firmware" 0 KiB
[ebuild N ] x11-drivers/nvidia-drivers-565.77:0/565::gentoo USE="modules persistenced strip -X -dist-kernel -kernel-open -modules-compress -modules-sign -powerd -static-libs -tools -wayland" ABI_X86="(64) -32" 347766 KiB

Total: 4 packages (4 new), Size of downloads: 347766 KiB

Looks like that set is allowing it to build. Running it and will see if pytorch will see the card.

[Administrator edit: unchecked Disable BBCode in this post so that OP's quote tags work. -Hu]
Back to top
View user's profile Send private message
SkunkMyrddyn
n00b
n00b


Joined: 25 Dec 2024
Posts: 8

PostPosted: Sat Dec 28, 2024 7:53 pm    Post subject: Reply with quote

Thanks for the assistances.
I wound up adding a keyword to the USE flag to ensure that any 32bit app might have access (just in case)

USE="abi_x86_32 persistenced -X -static-libs -wayland -tools" emerge -v nvidia-drivers

This installed the drivers, and it looks like my system is attempting to load them. But the load of the module is throwing its own error. Doesn't look to be a gentoo error, but a limitation of my particular system and the nvidia kernel driver seems to require the Resizable Bar enabled; but I can't enable that on my system.

My understanding is that resizeable bar was just a performance change; and not a blocker on loading the driver.

You wouldn't happen to know how to configure the driver to not require it?

I can post the module's logs once I get my system booting again. Attempting to enable the Resizeable bar causes it to fail to post with an invalid opcode error, and I've already updated the BIOS to see if that could fix the error. Which took a day to get going.

If it helps, my system specs are:
HP Proliant DL360p G8 (P71), Dual Xeon E5-2697v2's.
Back to top
View user's profile Send private message
SkunkMyrddyn
n00b
n00b


Joined: 25 Dec 2024
Posts: 8

PostPosted: Sat Dec 28, 2024 8:17 pm    Post subject: Reply with quote

May 23 19:18:26 starbase_one kernel: nvidia: loading out-of-tree module taints kernel.
May 23 19:18:26 starbase_one kernel: nvidia: module license 'NVIDIA' taints kernel.
May 23 19:18:26 starbase_one kernel: Disabling lock debugging due to kernel taint
May 23 19:18:26 starbase_one kernel: nvidia: module license taints kernel.
May 23 19:18:26 starbase_one kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 243
May 23 19:18:26 starbase_one kernel:
May 23 19:18:26 starbase_one kernel: nvidia 0000:07:00.0: enabling device (0040 -> 0042)
May 23 19:18:26 starbase_one kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:\x0aNVRM: BAR1 is 0M @ 0x0 (PCI:0000:07:00.0)
May 23 19:18:26 starbase_one kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:\x0aNVRM: BAR2 is 0M @ 0x0 (PCI:0000:07:00.0)
May 23 19:18:26 starbase_one kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:\x0aNVRM: BAR5 is 0M @ 0x0 (PCI:0000:07:00.0)
May 23 19:18:26 starbase_one kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 565.77 Wed Nov 27 23:33:08 UTC 2024
May 23 19:18:26 starbase_one kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 565.77 Wed Nov 27 22:53:48 UTC 2024
May 23 19:18:26 starbase_one kernel: [drm] [nvidia-drm] [GPU ID 0x00000700] Loading driver
May 23 19:18:26 starbase_one kernel: [drm] Initialized nvidia-drm 0.0.0 for 0000:07:00.0 on minor 2
Back to top
View user's profile Send private message
Ionen
Developer
Developer


Joined: 06 Dec 2018
Posts: 2891

PostPosted: Sat Dec 28, 2024 10:27 pm    Post subject: Reply with quote

Not a problem I'm familiar with, but fwiw a quick search gave me:
https://forums.developer.nvidia.com/t/nvrm-this-pci-i-o-region-assigned-to-your-nvidia-device-is-invalid/229899/1

Maybe the pci=realloc that it mentions helps? But I have no idea really, don't know that option.

If didn't help, may want to try searching yourself -- nvidia's forums would likely have better info than here when it's not a Gentoo-specific problem.
Back to top
View user's profile Send private message
SkunkMyrddyn
n00b
n00b


Joined: 25 Dec 2024
Posts: 8

PostPosted: Thu Jan 09, 2025 6:28 pm    Post subject: Reply with quote

I've been discussing the driver load issue over on nvidia's forums. It's been suggested I play with the kernel-module's options, but I've not found out the best way to handle these within gentoo's configuration schema. Can you point me to the direction on how I'd tell the nvidia kernel module to load with the option
NVreg_EnableResizableBar=0


Thanks.
Back to top
View user's profile Send private message
pingtoo
Veteran
Veteran


Joined: 10 Sep 2021
Posts: 1433
Location: Richmond Hill, Canada

PostPosted: Thu Jan 09, 2025 6:50 pm    Post subject: Reply with quote

SkunkMyrddyn wrote:
I've been discussing the driver load issue over on nvidia's forums. It's been suggested I play with the kernel-module's options, but I've not found out the best way to handle these within gentoo's configuration schema. Can you point me to the direction on how I'd tell the nvidia kernel module to load with the option
NVreg_EnableResizableBar=0Thanks.


You can check /sys/modules/nvidia/parameters/* if there is a file name NVreg_EnableResizableBar. if does that mean nvidia module can accept parameters when load. Then go to /etc/modules-load.d/ create a file (for example name nvidia) with options nvidia NVreg_EnableResizableBar=<whatever that parameter need>
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum