Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
nvidia-driver crashes, how to debug?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
Iesos
n00b
n00b


Joined: 12 Jan 2007
Posts: 56

PostPosted: Sat Apr 21, 2012 4:39 pm    Post subject: nvidia-driver crashes, how to debug? Reply with quote

Hi,

I'm here to look for some advice and help.

I have a Dell L502X (XPS 15), with a optimus nvidia card, and running some 3D programs, the nvidia driver can crash.

I have been trying to debug this for some time now, and I'm _very_ sure it is not because of the optirun/bumblebee implementation, wine, or kernel configuration. (since the same crash exists also in windows.). What I need to fully determine is: Is this crash due to: Hardware failure, or a "feature" in both the linux and windows driver.
What I want help with is to determine which it is, and how to convince Dell or nVidia, that this is the case. So, here follows a description of my debugging efforts.

There are a bunch of things that seem to be happening during the crash. The first thing is that the nvidia driver spits out an error:

Quote:
Apr 21 16:51:39 localhost kernel: NVRM: Xid (0000:01:00): 13, 0003 00000000 00009197 00002480 0054a001 00000000
Apr 21 16:51:39 localhost kernel: NVRM: Xid (0000:01:00): 39, CCMDs 00000004 000090b5
Apr 21 16:51:41 localhost kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Apr 21 16:51:45 localhost kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context


the Xid errors are some internal debugging messages for nvidia, and they do not seem to answer, when asked, about what these numbers mean.

The next thing that happens is different on different kernels, but that is because of a bug in the i915 driver, since this driver seem to crash:

Quote:
Apr 21 16:52:03 localhost kernel: ------------[ cut here ]------------
Apr 21 16:52:03 localhost kernel: WARNING: at drivers/gpu/drm/i915/i915_irq.c:652 ironlake_irq_handler+0x4f2/0x500()
Apr 21 16:52:03 localhost kernel: Hardware name: Dell System XPS L502X
Apr 21 16:52:03 localhost kernel: Missed a PM interrupt
Apr 21 16:52:03 localhost kernel: Modules linked in: nvidia(P) coretemp bbswitch(O) rtc cdc_ether usbnet cdc_acm snd_hda_codec_hdmi snd_hda_codec_realtek dell_wmi sparse_keymap sg snd_hda_intel dcdbas snd_hda_codec xhci_hcd ehci_hcd thermal wmi [last unloaded: nvidia]
Apr 21 16:52:03 localhost kernel: Pid: 0, comm: swapper/0 Tainted: P O 3.2.12-gentoo #4


This does not seem to be a problem, and I can get rid of this message using kernel versions >=3.3.

Then I get

Quote:
Apr 21 16:52:07 localhost kernel: ACPI Exception: AE_TIME, Returned by Handler for [EmbeddedControl] (20110623/evregion-478)
Apr 21 16:52:07 localhost kernel: ACPI Error:


and another i915 warning. Then

Quote:
Apr 21 16:52:15 localhost kernel: Clocksource tsc unstable (delta = -1955188294 ns)
Apr 21 16:52:15 localhost kernel: Switching to clocksource hpet


I googled around for the "hpet" line, and found that I should switch to hpet at boot to get rid of this, but that only makes that message go away, not solve the nvidia crash.

Then there are some more i915 warnings and

Quote:
Apr 21 16:52:45 localhost kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Apr 21 16:52:49 localhost kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Apr 21 16:52:51 localhost kernel: NVRM: GPU at 0000:01:00.0 has fallen off the bus.


is the last thing I hear from the nvidia-driver. And then I have:

Quote:
Apr 21 16:53:51 localhost kernel: INFO: rcu_sched detected stall on CPU 3 (t=6000 jiffies)
Apr 21 16:53:51 localhost kernel: Pid: 23864, comm: SC2.exe Tainted: P W O 3.2.12-gentoo-jesus19 #4
Apr 21 16:53:51 localhost kernel: Call Trace:
Apr 21 16:53:51 localhost kernel: <IRQ> [<ffffffff810b0e0d>] ? __rcu_pending+0x1ed/0x400
Apr 21 16:53:51 localhost kernel: [<ffffffff810b12a7>] ? rcu_check_callbacks+0x57/0x110
Apr 21 16:53:51 localhost kernel: [<ffffffff81066fbf>] ? update_process_times+0x3f/0x80
Apr 21 16:53:51 localhost kernel: [<ffffffff81084deb>] ? tick_sched_timer+0x5b/0xb0
Apr 21 16:53:51 localhost kernel: [<ffffffff81079d90>] ? __run_hrtimer.clone.30+0x60/0x140
Apr 21 16:53:51 localhost kernel: [<ffffffff8107a660>] ? hrtimer_interrupt+0xd0/0x200
Apr 21 16:53:51 localhost kernel: [<ffffffff81024323>] ? smp_apic_timer_interrupt+0x63/0xa0
Apr 21 16:53:51 localhost kernel: [<ffffffff8161635e>] ? apic_timer_interrupt+0x6e/0x80
Apr 21 16:53:51 localhost kernel: <EOI> [<ffffffffa04ff3d0>] ? _nv009228rm+0xb5b/0xf50 [nvidia]
Apr 21 16:53:51 localhost kernel: [<ffffffffa04ff3a8>] ? _nv009228rm+0xb33/0xf50 [nvidia]
Apr 21 16:53:51 localhost kernel: [<ffffffffa0139e1e>] ? _nv002305rm+0x4a4/0x4d0 [nvidia]
Apr 21 16:53:51 localhost kernel: [<ffffffffa013a041>] ? _nv002010rm+0x1f7/0x20d [nvidia]
Apr 21 16:53:51 localhost kernel: [<ffffffffa039c522>] ? _nv005978rm+0x1415/0x1446 [nvidia]
Apr 21 16:53:51 localhost kernel: [<ffffffffa0397d4f>] ? _nv005739rm+0xf1a/0xfaa [nvidia]
Apr 21 16:53:51 localhost kernel: [<ffffffffa00adbc3>] ? _nv001104rm+0x9f2/0x1024 [nvidia]
Apr 21 16:53:51 localhost kernel: [<ffffffffa00adb62>] ? _nv001104rm+0x991/0x1024 [nvidia]
Apr 21 16:53:51 localhost kernel: [<ffffffffa00be230>] ? _nv000956rm+0xa4/0xff [nvidia]
Apr 21 16:53:51 localhost kernel: [<ffffffffa0781cb9>] ? _nv001100rm+0x579/0x731 [nvidia]
Apr 21 16:53:51 localhost kernel: [<ffffffffa078fb14>] ? rm_ioctl+0x6d/0x169 [nvidia]
Apr 21 16:53:51 localhost kernel: [<ffffffff810dad15>] ? __get_locked_pte+0x165/0x1d0
Apr 21 16:53:51 localhost kernel: [<ffffffffa07b0c12>] ? nv_kern_ioctl+0x152/0x450 [nvidia]
Apr 21 16:53:51 localhost kernel: [<ffffffffa07b0f2c>] ? nv_kern_compat_ioctl+0x1c/0x30 [nvidia]
Apr 21 16:53:51 localhost kernel: [<ffffffff81136677>] ? compat_sys_ioctl+0x87/0xf20
Apr 21 16:53:51 localhost kernel: [<ffffffff810df3c2>] ? sys_mmap_pgoff+0xf2/0x1a0
Apr 21 16:53:51 localhost kernel: [<ffffffff81617130>] ? sysenter_dispatch+0x7/0x2e


this rcu-messages here seem to say that SC.exe have occupied CPU #3 and according to the trace, it is nvidias fault.

The full crash log can be found here: http://pastebin.com/wYcFtGCY

In windows, the same crash exists. However, the only error I can get is "The nvidia driver stopped responding and has now been reloaded".

I have so far tried wine versions from 1.2.something to 1.5.something. Several versions of the nvidia-driver (in linux and windows), I have upgraded BIOS, I have reinstalled bumblebee.

So, more direct questions I have:
- What does the hpet message tell me? Does it matter?
- What is the "GPU has fallen of the bus"?
- What does the rcu-messages mean? Can they be circumvented?
- What else can I do to get more information about this crash?
- Is the crash due to the nvidia-driver or because of a hardware problem?
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9891
Location: almost Mile High in the USA

PostPosted: Sun Apr 22, 2012 12:34 am    Post subject: Reply with quote

If it also crashes in Windows, it's very likely it's a hardware issue.

HPET is a high precision event timer, it's generally something good unless your hardware doesn't work.

TimeStampCounter is an internal timer counter to the CPU, and when it's "unstable" it means software has detected it's not monotonically increasing and thus cannot be used to measure time.

The "falling off the bus" error was meant to be a serious but humorous error that the GPU seemed to have "disappeared" from the bus, and will no longer communicate. It could be due to it "disconnecting" for normal reasons but likely it's due to hardware issues.

Likely the stall was due to some bad software-hardware interaction. If the "GPU fell off the bus" sometimes bad software did not handle this "condition" properly and will hang the CPU, and the kernel detected this situation.

I'd probably first check the fan of the GPU and see if it's clean. Make sure it's not overheating.

If it's clean and not overheating I'd look into a RMA. And there's a reason why I like desktops over laptops, I can toss out the video card and get another one to test my assumptions... The closed source proprietary driver makes it that much harder to debug too, if you can use the OSS Nouveau driver, it could give more hints or even work properly because it uses a different method of accessing the hardware... but I suspect it won't work properly either.

I've found the Nvidia closed source driver very stable, and any issues with it tends to be hardware problems if it was not force-built with a kernel it was not meant to be linked with. I had two nvidia cards (a GeForce4 MX420 and a GeForce 8400GS), both worked perfectly until they were removed from their systems (one died due to fan failure, other I no longer needed after getting the onboard video working.)
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
MRJonnyH
n00b
n00b


Joined: 15 Nov 2014
Posts: 1
Location: United Kingdom

PostPosted: Sat Nov 15, 2014 4:10 pm    Post subject: NVIDIA Debug? Tools? Reply with quote

Hi, I just registered because I'm also looking for safe config settings for XPS l502x.

I have an AOC LCD upstairs, connecting via HDMI>DVI adaptor.

This exhibits problems
(screen goes blank except for a large, pixelated square....)
Sometimes it causes the OS to freeze (hard reset required),
sometimes it handles the exception more gracefully (system notification message ambiguously reports 'nvidia display driver service stopped/restarted' {sic}

We also have a HUGE Samsung HDMI tv. This one also has a bug, similar to above, except sometimes the OS freezes, then a hard reset is required, but there are no critical errors/warnings in event viewer.

So how to debug?

I have cpu-z (just downloaded gpu-z also).

Not really a gamer (except backgammon), just really want a bulletproof build.

Any suggestions welcome (I don't know how the above poster even got that level of detail... hence my registering and waffling....)

THANKS !
Back to top
View user's profile Send private message
Hu
Administrator
Administrator


Joined: 06 Mar 2007
Posts: 23089

PostPosted: Sat Nov 15, 2014 4:58 pm    Post subject: Re: NVIDIA Debug? Tools? Reply with quote

MRJonnyH wrote:
Sometimes it causes the OS to freeze (hard reset required),
sometimes it handles the exception more gracefully (system notification message ambiguously reports 'nvidia display driver service stopped/restarted' {sic}

I have cpu-z (just downloaded gpu-z also).
These statements make me wonder whether you are in the right place. The first hit I found for gpu-z is a Windows-only program, so unless you are dual-booting Windows with Gentoo Linux, I do not see how you could use gpu-z and also be running the right environment for us to help you. Similarly, the "system notification message" makes me suspect you are using Windows. Although some Linux desktops offer something like that, I am not aware of any which would use that terminology and resolve a problem in that way.

This forum is dedicated to Gentoo Linux in particular, though we can handle more general Linux questions in some cases. You might find someone here who can help you with a Windows problem, but that would be by luck. I do not mean to run you off, but if you need help with a Windows-specific problem, there are other places that are more likely to give you a timely and detailed answer.

If I am wrong and you are using some form of Linux, then I apologize and suggest you start with posting the output of dmesg | tail -n200 right after a hang recovery occurs.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum