Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
[solved by wonder] foldingathome nvidia gpu trouble
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Gentoo Chat
View previous topic :: View next topic  
Author Message
Vrenn
Guru
Guru


Joined: 15 Dec 2004
Posts: 327

PostPosted: Sun Apr 26, 2020 12:33 pm    Post subject: [solved by wonder] foldingathome nvidia gpu trouble Reply with quote

Der gentoo folders.
Times are special so I got my eyes on foldingathome.
My system is an old ASUS ROG laptop with a Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz and a dedicated GeForce GTX 980M (as only gpu, 16GB RAM, 4GB VRAM).

Installation was smooth. First I started with the standard config
Code:
<config>
  <!-- Folding Slot Configuration -->
  <cause v='COVID_19'/>

  <!-- Slot Control -->
  <power v='FULL'/>

  <!-- User Information -->
  <passkey v='private'/>
  <team v='private'/>
  <user v='Vrenn'/>

  <!-- Folding Slots -->
  <slot id='0' type='CPU'/>
  <slot id='1' type='GPU'/>
</config>

The CPU was working fine, looking at the http://client.foldingathome.org/ client. Green icon, working one to three hours each job.
Just the Geforce, titled as "GM204 [GeForce GTX 980M]..." stopped as yellow. It got a job once a while, never getting over 0,00%, always pointed out it would need 24h a day and dismisses it after a while.
The folding log:
Code:
23:18:55:       GPUs: 1
23:18:55:      GPU 0: Bus:1 Slot:0 Func:0 NVIDIA:5 GM204 [GeForce GTX 980M] 3189
23:18:55:       CUDA: Not detected: cuInit() returned 100
23:18:55:     OpenCL: Not detected: clGetDeviceIDs() returned -1
...
23:18:55:Enabled folding slot 00: READY cpu:7
23:18:55:Enabled folding slot 01: READY gpu:0:GM204 [GeForce GTX 980M] 3189
23:18:55:ERROR:No compute devices matched GPU #0 {
23:18:55:ERROR:  "vendor": 4318,
23:18:55:ERROR:  "device": 5079,
23:18:55:ERROR:  "type": 2,
23:18:55:ERROR:  "species": 5,
23:18:55:ERROR:  "description": "GM204 [GeForce GTX 980M] 3189"
23:18:55:ERROR:}.  You may need to update your graphics drivers.
23:18:55:WU00:FS01:Starting
23:18:55:ERROR:WU00:FS01:Failed to start core: OpenCL device matching slot 1 not found, make sure the OpenCL driver is installed or try setting 'opencl-index' manually


It is hard to find any online examples for the config.xml, but as the FAHControl GUI is missing for gentoo I tried a config from https://www.reddit.com/r/Folding/comments/fp1pjh/need_help_triple_gpus_none_being_used/

The merged config somehow works is following
Code:
<config>
  <!-- Folding Slot Configuration -->
  <cause v='COVID_19'/>

  <!-- Slot Control -->
  <power v='FULL'/>

  <!-- User Information -->
  <passkey v='private'/>
  <team v='private'/>
  <user v='Vrenn'/>

  <!-- Folding Slots -->
  <slot id='0' type='CPU'/>
  <slot id='1' type='GPU'>
   <cuda-index v='0'/>
   <gpu-index v='1'/>
   <opencl-index v='0'/>
   <paused v='true'/>
  </slot>
</config>

On the upside the GPU icon is now green (once it got an job), currently at 8,82% and needs about 16h. It seems to be working.
On the downside my GPU is now listend as "GPU:1:{ "VENDOR":0, "DEVICE": 0, "TYPE": 0, "SPECIE..."
The log is still bad!
Code:
09:55:46:       GPUs: 1
09:55:46:      GPU 0: Bus:1 Slot:0 Func:0 NVIDIA:5 GM204 [GeForce GTX 980M] 3189
09:55:46:       CUDA: Not detected: cuInit() returned 100
09:55:46:     OpenCL: Not detected: clGetPlatformIDs() returned -1001q
...
09:55:46:Enabled folding slot 00: READY cpu:7
09:55:46:ERROR:Exception: GPU 1 not found
09:55:46:ERROR:No compute devices matched GPU #1 {
09:55:46:ERROR:  "vendor": 0,
09:55:46:ERROR:  "device": 0,
09:55:46:ERROR:  "type": 0,
09:55:46:ERROR:  "species": 0,
09:55:46:ERROR:  "description": ""
09:55:46:ERROR:}.  You may need to update your graphics drivers.
09:55:46:WARNING:WU00:Slot ID 18446744073709551615 no longer exists, migrating to FS00
09:55:46:ERROR:Exception: Unit not found
09:55:46:WU01:FS00:Starting
09:55:46:WU01:FS00:Running FahCore: /opt/foldingathome/FAHCoreWrapper /opt/foldingathome/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 01 -suffix 01 -version 706 -lifeline 37298 -checkpoint 15 -np 7
09:55:46:WU01:FS00:Started FahCore on PID 37307
09:55:46:WU01:FS00:Core PID:37311
09:55:46:WU01:FS00:FahCore 0xa7 started

but also tells me later
Code:
09:55:58:WU02:FS01:Connecting to 18.218.241.186:80
09:55:59:WU02:FS01:Assigned to work server 155.247.164.213
09:55:59:WU02:FS01:Requesting new work unit for slot 01: READY gpu:1:{
09:55:59:WU02:FS01:  "vendor": 0,
09:55:59:WU02:FS01:  "device": 0,
09:55:59:WU02:FS01:  "type": 0,
09:55:59:WU02:FS01:  "species": 0,
09:55:59:WU02:FS01:  "description": ""
09:55:59:WU02:FS01:} from 155.247.164.213
09:55:59:WU02:FS01:Connecting to 155.247.164.213:8080
09:55:59:WU02:FS01:Downloading 2.82MiB
09:56:00:WU00:FS01:Upload complete
09:56:00:WU00:FS01:Server responded WORK_QUIT (404)
09:56:00:WARNING:WU00:FS01:Server did not like results, dumping

change gpu-index to 0 or the cuda/opencl-indexes to -1 makes the gpu stick to yellow again.
nvidia-drivers at 440.82-r1
Any hint what I am doing wrong or is there any full manual for the config.xml?
Is the gpu info of webcontroll even real? It seems to slow.
current foldingathome 7.6.9
Can I do the right thing with a wrong config?

cuda device query (/opt/cuda/extras/demo_suite/deviceQuery)
Code:
CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 980M"
  CUDA Driver Version / Runtime Version          10.2 / 10.2
  CUDA Capability Major/Minor version number:    5.2
  Total amount of global memory:                 4035 MBytes (4231331840 bytes)
  (12) Multiprocessors, (128) CUDA Cores/MP:     1536 CUDA Cores
  GPU Max Clock rate:                            1126 MHz (1.13 GHz)
  Memory Clock rate:                             2505 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 2097152 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.2, NumDevs = 1, Device0 = GeForce GTX 980M
Result = PASS

_________________
With nice greetings
Vrenn


Last edited by Vrenn on Wed May 06, 2020 5:39 pm; edited 1 time in total
Back to top
View user's profile Send private message
Vrenn
Guru
Guru


Joined: 15 Dec 2004
Posts: 327

PostPosted: Fri May 01, 2020 7:32 pm    Post subject: Reply with quote

I believe I learned something.
Current config.
Code:
<!-- Folding Slots -->
  <slot id='0' type='CPU'/>
  <slot id='1' type='GPU'>
   <cuda-index v='0'/>
   <gpu-index v='0'/>
   <opencl-index v='0'/>
  </slot>
Now my GPU is named correctly as gpu:0:GM204 [GeForce GTX 980M] 3189.
Its still slow, but now got a job.
I made two conclusions:
First, providing a wrong config (<gpu-index v='1'/> ) hides the gpu-name, giving my 5 years old hardware more jobs. (slower ones?)
Second, <opencl-index v='0'/> solves the opencl-index error.
I also added foldingathome-user to video group suggested at https://bugs.gentoo.org/715646

Anyway following logs still stay:
Code:
14:06:30:       CUDA: Not detected: cuInit() returned 999
14:06:30:     OpenCL: Not detected: clGetDeviceIDs() returned -1
Still a lot to learn.
_________________
With nice greetings
Vrenn
Back to top
View user's profile Send private message
axl
Veteran
Veteran


Joined: 11 Oct 2002
Posts: 1146
Location: Romania

PostPosted: Fri May 01, 2020 7:58 pm    Post subject: Reply with quote

long story short, just remove opencl from mesa. /etc/portage/package.use/mesa. in that file put: media-libs/mesa -opencl.

you should be fine after that.
Back to top
View user's profile Send private message
Vrenn
Guru
Guru


Joined: 15 Dec 2004
Posts: 327

PostPosted: Sat May 02, 2020 2:17 pm    Post subject: Reply with quote

I believe I tested that first, anyway I gave it today a try.
Same log-errors again:
14:04:46: CUDA: Not detected: cuInit() returned 999
14:04:46: OpenCL: Not detected: clGetPlatformIDs() returned -1001
As mesa is not used by nvidia-drivers I thought perhaps an discovery-function of foldingathome might cause the errors. eselect opencl nvidia or ocl-icd shows no direct effect.
I'm grateful for any hint.
Otherwise my geforce got a 50 000 job yesterday, with the no cuda/opencl error in Log.txt...
How is this to be rated?
_________________
With nice greetings
Vrenn
Back to top
View user's profile Send private message
axl
Veteran
Veteran


Joined: 11 Oct 2002
Posts: 1146
Location: Romania

PostPosted: Sat May 02, 2020 8:30 pm    Post subject: Reply with quote

A simple hint is to install clinfo and make sure you get only one opencl implementation. Not sure eselect is still doing anything. I think I've read some news recently about it becoming obsolete by a single icd loader. Not sure if that only applies to my ~unstable, or it got into stable as well. But clinfo is the way to find out.
Back to top
View user's profile Send private message
Vrenn
Guru
Guru


Joined: 15 Dec 2004
Posts: 327

PostPosted: Sat May 02, 2020 11:31 pm    Post subject: Reply with quote

Both:
Code:
Number of platforms                               2
  Platform Name                                   NVIDIA CUDA
  Platform Vendor                                 NVIDIA Corporation
  Platform Version                                OpenCL 1.2 CUDA 10.2.159
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics
  Platform Extensions function suffix             NV

  Platform Name                                   Clover
  Platform Vendor                                 Mesa
  Platform Version                                OpenCL 1.1 Mesa 19.3.5
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd
  Platform Extensions function suffix             MESA

  Platform Name                                   NVIDIA CUDA
Number of devices                                 1
  Device Name                                     GeForce GTX 980M
  Device Vendor                                   NVIDIA Corporation
  Device Vendor ID                                0x10de
  Device Version                                  OpenCL 1.2 CUDA
  Driver Version                                  440.82
  Device OpenCL C Version                         OpenCL C 1.2
  Device Type                                     GPU
  Device Topology (NV)                            PCI-E, 01:00.0
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               12
  Max clock frequency                             1126MHz
  Compute Capability (NV)                         5.2
  Device Partition                                (core)
    Max number of sub-devices                     1
    Supported partition types                     None
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x64
  Max work group size                             1024
  Preferred work group size multiple              32
  Warp size (NV)                                  32
  Preferred / native vector sizes                 
    char                                                 1 / 1       
    short                                                1 / 1       
    int                                                  1 / 1       
    long                                                 1 / 1       
    half                                                 0 / 0        (n/a)
    float                                                1 / 1       
    double                                               1 / 1        (cl_khr_fp64)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              4231331840 (3.941GiB)
  Error Correction support                        No
  Max memory allocation                           1057832960 (1009MiB)
  Unified memory for Host and Device              No
  Integrated memory (NV)                          No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       4096 bits (512 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        589824 (576KiB)
  Global Memory cache line size                   128 bytes
  Image support                                   Yes
    Max number of samplers per kernel             32
    Max size for 1D images from buffer            268435456 pixels
    Max 1D or 2D image array size                 2048 images
    Max 2D image size                             16384x16384 pixels
    Max 3D image size                             4096x4096x4096 pixels
    Max number of read image args                 256
    Max number of write image args                16
  Local memory type                               Local
  Local memory size                               49152 (48KiB)
  Registers per block (NV)                        65536
  Max number of constant args                     9
  Max constant buffer size                        65536 (64KiB)
  Max size of kernel argument                     4352 (4.25KiB)
  Queue properties                               
    Out-of-order execution                        Yes
    Profiling                                     Yes
  Prefer user sync for interop                    No
  Profiling timer resolution                      1000ns
  Execution capabilities                         
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Kernel execution timeout (NV)                 Yes
  Concurrent copy and kernel execution (NV)       Yes
    Number of async copy engines                  2
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                (n/a)
  Device Extensions                               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics

  Platform Name                                   Clover
Number of devices                                 0

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  NVIDIA CUDA
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [NV]
  clCreateContext(NULL, ...) [default]            Success [NV]
  clCreateContext(NULL, ...) [other]              P [ŠU
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  Invalid device type for platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  No platform
the clinfo output just differs in 5 lines.
the icd loader has additional
Code:
ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.2.12
  ICD loader Profile                              OpenCL 2.2
Spooky, following works:
First: install clinfo
Second: execute clinfo
Third: systemctl start foldingathome

Now...
Code:
cat /opt/foldingathome/log.txt | grep CUDA
23:13:23:CUDA Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:5.2 Driver:10.2
It doesn't matter what eselect opencl is set, but you must execute clinfo first before foldingathome...

Tested more than 3 times with restarts of the whole laptop, with icd (now running) and nvidia pure. This case is reproduceable on my system.
Getting a GPU workunit somehow fast. Might be a race condition?
_________________
With nice greetings
Vrenn
Back to top
View user's profile Send private message
Vrenn
Guru
Guru


Joined: 15 Dec 2004
Posts: 327

PostPosted: Sun May 03, 2020 12:12 am    Post subject: Reply with quote

(Answering to an deleted post?)

As I wrote before I tried your way: emerge mesa with -opencl and emerge --depclean unmerged libclc.
I even tested it now again.
Still: log.txt only gets error-free when executing clinfo first.
mesa +-opencl +- liblc, eselect ocl-icd/nvidia seems not to matter on my system.
Ugly workaround is now a script in /usr/local/sbin... but it works, thanks to your tip.

Now I have to go to sleep...
_________________
With nice greetings
Vrenn
Back to top
View user's profile Send private message
axl
Veteran
Veteran


Joined: 11 Oct 2002
Posts: 1146
Location: Romania

PostPosted: Sun May 03, 2020 9:08 pm    Post subject: Reply with quote

I'm sorry. I deleted my post because I realized we're talking about a laptop, which it's probable that it has a hybrid videocard. Those nvidia optimus things... and I thought I was wasting your time. I really don't know my way around those things. When it comes to pure nvidia drivers, it's usually mesa with opencl that is creating issues, but on optimus... I just don't know. I don't have one of those things and I never had experience with those things.

On the other hand I have a lot of experience with putting my foot in my mouth, just because I missed that one thing: optimus. So again, sorry to have wasted your time, and if you managed to make it work in any way... just don't fix it anymore.
Back to top
View user's profile Send private message
Vrenn
Guru
Guru


Joined: 15 Dec 2004
Posts: 327

PostPosted: Mon May 04, 2020 5:48 pm    Post subject: Reply with quote

You are not that wrong...
It is a ROG with Nvidia from 2015. That time, the powerful gaming-laptops have used nvidia-chips as single gpus. That's my case. There is no optimus. But I can't say if there might be optimus-leftovers in the firmware.
The opencl-index has to be set, no question, no big deal.
But running the clinfo first makes me wonder.
The "hack" works for me, but I find it ugly.
It seems foldingathome fails to detect the opencl/cuda capabilities at the start. It later on might get a job, but really late. The log.txt is written just at the start, so the Errors remain. Does it realize the opencl/cuda capabilities later in the runtime?
It would fit into the second specialness: running clinfo first, whitch has also the job to detect opencl/cuda makes foldingathome detect them right at the startup and writing an error-free log.txt.

It isn't perfect, but it is "working for me".

Time comes I'll solve this puzzle too, or get an all AMD top gaming PC :-)
Anyway, you got me to clinfo, speeding up gpu jobs really fast.
ps: I don't like optimus&co, might there be Allamd' with dedicated gpu's?
_________________
With nice greetings
Vrenn
Back to top
View user's profile Send private message
axl
Veteran
Veteran


Joined: 11 Oct 2002
Posts: 1146
Location: Romania

PostPosted: Tue May 05, 2020 9:37 pm    Post subject: Reply with quote

Vrenn wrote:
You are not that wrong...
It is a ROG with Nvidia from 2015. That time, the powerful gaming-laptops have used nvidia-chips as single gpus. That's my case. There is no optimus. But I can't say if there might be optimus-leftovers in the firmware.
The opencl-index has to be set, no question, no big deal.
But running the clinfo first makes me wonder.
The "hack" works for me, but I find it ugly.
It seems foldingathome fails to detect the opencl/cuda capabilities at the start. It later on might get a job, but really late. The log.txt is written just at the start, so the Errors remain. Does it realize the opencl/cuda capabilities later in the runtime?
It would fit into the second specialness: running clinfo first, whitch has also the job to detect opencl/cuda makes foldingathome detect them right at the startup and writing an error-free log.txt.

It isn't perfect, but it is "working for me".

Time comes I'll solve this puzzle too, or get an all AMD top gaming PC :-)
Anyway, you got me to clinfo, speeding up gpu jobs really fast.
ps: I don't like optimus&co, might there be Allamd' with dedicated gpu's?


my dad use to say that the worst enemy of "good", is "better".

if it's working... don't fix it. I mentioned that.

i don't know what's going on there, but at one time it worked. so stop fixing it.

the world such as it is, is held on by spit, hope and duct tape that holds it all together.

I've been waiting for a 3d print for 8 hours. terribly complicated model. not gonna fight against the current.

whatever works man. whatever works.
Back to top
View user's profile Send private message
Vrenn
Guru
Guru


Joined: 15 Dec 2004
Posts: 327

PostPosted: Wed May 06, 2020 5:37 pm    Post subject: Reply with quote

You are right. I'll take it as a working miracle.
I did choose gentoo for different reasons, one was to learn.
But at now, at this, I reached my destination.
_________________
With nice greetings
Vrenn
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Gentoo Chat All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum