Pages

04 April 2013

375. Bisecting the kernel -- looking for the commit behind the kworker/i915 issue.

This is an attempt to trace this issue: http://verahill.blogspot.com.au/2013/03/368-slow-mouse-and-keyboard-triggered.html.

See also
http://forums.gentoo.org/viewtopic-p-7278760.html
https://bbs.archlinux.org/viewtopic.php?pid=1248190

Update: I'm revising this post. Not quite done yet.

Another Update: Someone else has just bisected and ended up with a different commit:
https://bbs.archlinux.org/viewtopic.php?pid=1254285#p1254285
The patch they found makes a lot more sense than any of the ones I ended up with.

However, I do still see issues with the kernel I end up with, so I'm either starting from the wrong 'good', or there are several bad commits.

A third update: I just experienced slow-down with 1k interrupts per second on 3.7.10, which may indicate that I picked a bad starting point for bisecting. I'll leave the post up since it can serve as a guide for how to bisect in general.

A fourth update:
A fix is on the way to kernel 3.9 or 3.10.
https://patchwork.kernel.org/patch/2400621/
https://patchwork.kernel.org/patch/2402211/


Original post:

I didn't really want to do this, for several reasons:

* the bug only manifests itself on one of my computers, with intel graphics -- and that's a laptop (I'm not a huge fan of laptops, especially not with budding carpal tunnel syndrome, so testing the bisected version is a bit of a pain -- literally)

* the bug isn't consistently triggered, so you need to test the new kernel for an hour or more, and only by the absence of a specific behaviour can you see whether all is good i.e. testing is a bit of hit and miss.

* compiling the kernel takes a long time, and I don't currently have a suitable computer for it at home where my laptop (the one with the i915) is, so I'll build the kernel on a work computer, then install it at home, then repeat each day.

On the other hand, it's a learning experience, so here we go.

I'm looking at https://www.kernel.org/pub/software/scm/git/docs/git-bisect.html

I'm compiling the kernel debian style since it generates .deb files which are easy to install on other machines. Again, there are posts on how to compile the kernel using a more generic approach, or specifically for Arch, elsewhere on this blog.

NOTE: there's no sure-fire way of triggering this issue, but it seems to occur more frequently when the fan is on but approaching the thermal cut-off point where it will turn off. This makes troubleshooting somewhat more difficult.


1. Download/checkout
First get the sources -- it pulls everything from 2.6.12, so it's bigger than a normal kernel source download. I'll skip explicitly telling you what packages you need to have in order to compile the kernel or use git, since if you don't know, you're probably not really ready for this anyway. Besides, that information is available in other posts on this blog.

mkdir -p ~/tmp/kernel_bisect
cd ~/tmp/kernel_bisect/
git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git linux-git
Cloning into 'linux-git'... remote: Counting objects: 2942099, done. remote: Compressing objects: 100% (443946/443946), done. remote: Total 2942099 (delta 2471580), reused 2940624 (delta 2470325) Receiving objects: 100% (2942099/2942099), 608.66 MiB | 4.58 MiB/s, done. Resolving deltas: 100% (2471580/2471580), done. Checking out files: 100% (42425/42425), done.
cd linux-git/


2A. Bisect
I know that 3.8.5 is bad and that 3.7.10 may be good (seem to remember having issues with the Arch version of 3.7.10, but no issues with my own build on debian. Not sure I can trust my memory).

I know for a fact that 3.2 is good. I don't want to bisect everything from 3.8.5 to 3.2 though, so I'll take a leap of faith and presume that 3.7.10 is good (which I'll regret if that's not true) and 3.8.0 is bad.

I know that the issue is present on amd64.

I think it's an issue to do with i915.

git bisect start -- arch/amd64 drivers/gpu/drm/i915
git bisect bad v3.8
git bisect good v3.7
Bisecting: 160 revisions left to test after this (roughly 7 steps) [28d491df4c6b00f9148a9885dba1f36a078535dc] drm/i915: Bad pixel formats can't reach the sprite code
2B. Compile
I'm being a bit lazy here as well, possibly at the cost of not being able to identify my issue: I'll use silentoldconfig to accept all the default settings.
make oldconfig
time fakeroot make-kpkg -j3 --initrd kernel_image kernel_headers

where 3 in -j3 is the number of cores (see here and here and here for more about the proper value for -j) -- I have a triple core AMD. It took around 43 minutes to build.

This will generate two .deb files in the parent folder -- one kernel image and one with headers.
Install these on your testing system.


sudo dpkg -i ../linux-image-3.7.0-rc2+_3.7.0-rc2+-10.00.Custom_amd64.deb ../linux-headers-3.7.0-rc2+_3.7.0-rc2+-10.00.Custom_amd64.deb

Since you installed .deb packages it's pretty easy to roll back any changes later.

3. Testing
What happens now depends on whether the kernel is plagued by the issue you're trying to bisect for:

3A. The problem persists
You've now narrowed it down to half as many commits.
git bisect bad
make-kpkg clean
make oldconfig
time fakeroot make-kpkg -j3 --initrd kernel_image kernel_headers
sudo dpkg -i ../linux-image-3.7.0-rc2+_3.7.0-rc2+-10.00.Custom_amd64.deb ../linux-headers-3.7.0-rc2+_3.7.0-rc2+-10.00.Custom_amd64.deb


3B. The problem is not present
You've now narrowed it down to half as many commits.
git bisect good
make-kpkg clean
make oldconfig
time fakeroot make-kpkg -j3 --initrd kernel_image kernel_headers
sudo dpkg -i ../linux-image-3.7.0-rc4+_3.7.0-rc4+-10.00.Custom_amd64.deb ../linux-headers-3.7.0-rc4+_3.7.0-rc4+-10.00.Custom_amd64.deb

Repeat step 3 until you've isolated the commit that caused the issue.

For me
* the first bisect didn't cause kworker slow downs (bisect good)
* the second bisect immediately led to slowdown (bisect bad)
* the third one was fine (bisect good)
* the fourth one had no issues (bisect good)
* the fifth one was good (bisect good)
* the sixth one was definitely bad (bisect bad)

powertop gives the following when it's slowing down
 0 mW    423.5 ms/s       4.5        kWork          i915_hotplug_work_func

and watch cat /proc/interrupts:
The number of interrupts increases very, very rapidly for '49:  PCI-MSI-edge i915' during mouse slowdown. Normal rate is around 70 per two seconds.
This is where it gets less clear -- from what I can see:
* the seventh one was a little bit bad (bisect bad).
* the eight one had minor issues when the fan went off (bisect bad).

But the first time I did the bisect I came to the exact opposite conclusion for the last two bisects. I'm suspecting that it's really down to two or more commits that together cause bad behavior, but on their own are merely annoying.

Anyway, I've ended up with this commit as the current culprit (again, the first time around I ended up with a different commit):
607a6f7a6621f65706ff536b2615ee65b5c2f575 is the first bad commit
commit 607a6f7a6621f65706ff536b2615ee65b5c2f575
Author: Daniel Vetter 
Date:   Wed Nov 14 17:47:39 2012 +0100

    drm/i915: drop buggy write to FDI_RX_CHICKEN register
    
    Jani Nikula noticed that the parentheses are wrong and we & the bit
    with the register address instead of the read-back value. He sent a
    patch to correct that.
    
    On second look, we write the same register in the previous line, and
    the w/a seems to be to set FDI_RX_PHASE_SYNC_POINTER_OVR to enable the
    logic, then keep always set FDI_RX_PHASE_SYNC_POINTER_OVR and toggle
    FDI_RX_PHASE_SYNC_POINTER_EN before/after enabling the pc transcoder.
    
    So the right things seems to be to simply kill the 2nd write.
    
    Cc: Jani Nikula 
    Reviewed-by: Chris Wilson 
    [danvet: Dropped a bogus ~ from the commit message that somehow crept
    in.]
    Signed-off-by: Daniel Vetter 

:040000 040000 f789c6c199c9db5c9d0d7961760574b5f0b1ede9 9e0cd2a09cab610b437164b1a74f620e43686ef1 M      drivers

It's just really difficult to reproduce the issue consistently with the last couple of kernels. I am in no way confident that the above commit is what's causing all this.

The last confirmed troublesome bisect (#6):

Here's the log:
git bisect start '--' 'arch/amd64' 'drivers/gpu/drm/i915'
# bad: [19f949f52599ba7c3f67a5897ac6be14bfcb1200] Linux 3.8
git bisect bad 19f949f52599ba7c3f67a5897ac6be14bfcb1200
# good: [29594404d7fe73cd80eaa4ee8c43dcc53970c60e] Linux 3.7
git bisect good 29594404d7fe73cd80eaa4ee8c43dcc53970c60e
# good: [28d491df4c6b00f9148a9885dba1f36a078535dc] drm/i915: Bad pixel formats can't reach the sprite code
git bisect good 28d491df4c6b00f9148a9885dba1f36a078535dc
# bad: [b4a98e57fc27854b5938fc8b08b68e5e68b91e1f] drm/i915: Flush outstanding unpin tasks before pageflipping
git bisect bad b4a98e57fc27854b5938fc8b08b68e5e68b91e1f
# good: [12f3382bc0262e981a2e58aca900cbbdbbe66825] drm/i915: implement WaDisablePSDDualDispatchEnable on IVB & VLV
git bisect good 12f3382bc0262e981a2e58aca900cbbdbbe66825
# good: [b9e0bda3cd325b55f336efb751736163f62abded] drm/i915: Always calculate 8xx WM values based on a 32-bpp framebuffer
git bisect good b9e0bda3cd325b55f336efb751736163f62abded
# good: [1c8b46fc8c865189f562c9ab163d63863759712f] drm/i915: Use LRI to update the semaphore registers
git bisect good 1c8b46fc8c865189f562c9ab163d63863759712f

and here are the remaining commits:


commit b4a98e57fc27854b5938fc8b08b68e5e68b91e1f
Author: Chris Wilson 
Date:   Thu Nov 1 09:26:26 2012 +0000

    drm/i915: Flush outstanding unpin tasks before pageflipping
    
    If we accumulate unpin tasks because we are pageflipping faster than the
    system can schedule its workers, we can effectively create a
    pin-leak. The solution taken here is to limit the number of unpin tasks
    we have per-crtc and to flush those outstanding tasks if we accumulate
    too many. This should prevent any jitter in the normal case, and also
    prevent the hang if we should run too fast.
    
    Note: It is important that we switch from the system workqueue to our
    own dev_priv->wq since all work items on that queue are guaranteed to
    only need the dev->struct_mutex and not any modeset resources. For
    otherwise if we have a work item ahead in the queue which needs the
    modeset lock (like the output detect work used by both polling or
    hpd), this work and so the unpin work will never execute since the
    pageflip code already holds that lock. Unfortunately there's no
    lockdep support for this scenario in the workqueue code.
    
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=46991
    Reported-and-tested-by: Tvrtko Ursulin 
    Signed-off-by: Chris Wilson 
    [danvet: Added note about workqueu deadlock.]
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=56337
    Signed-off-by: Daniel Vetter 

 drivers/gpu/drm/i915/intel_display.c |   22 ++++++++++++++++------
 drivers/gpu/drm/i915/intel_drv.h     |    4 +++-
 2 files changed, 19 insertions(+), 7 deletions(-)

commit a726915cef1daab57aad4c5b5e4773822f0a4bf8
Author: Daniel Vetter 
Date:   Tue Nov 20 14:50:08 2012 +0100

    drm/i915: resurrect panel lid handling
    
    But disabled by default. This essentially reverts
    
    commit bcd5023c961a44c7149936553b6929b2b233dd27
    Author: Dave Airlie 
    Date:   Mon Mar 14 14:17:55 2011 +1000
    
        drm/i915: disable opregion lid detection for now
    
    but leaves the autodetect mode disabled. There's also the explicit lid
    status option added in
    
    commit fca874092597ef946b8f07031d8c31c58b212144
    Author: Chris Wilson 
    Date:   Thu Feb 17 13:44:48 2011 +0000
    
        drm/i915: Add a module parameter to ignore lid status
    
    Which overloaded the meaning for the panel_ignore_lid parameter even
    more. To fix up this mess, give the non-negative numbers 0,1 the
    original meaning back and use negative numbers to force a given state.
    So now we have
    
    1  - disable autodetect, return unknown
    0  - enable autodetect
    -1 - force to disconnected/lid closed
    -2 - force to connected/lid open
    
    v2: My C programmer license has been revoked ...
    
    v3: Beautify the code a bit, as suggested by Chris Wilson.
    
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=27622
    Tested-by: Andreas Sturmlechner 
    Reviewed-by: Chris Wilson 
    Signed-off-by: Daniel Vetter 

 drivers/gpu/drm/i915/i915_drv.c    |    6 +++---
 drivers/gpu/drm/i915/intel_panel.c |   25 +++++++++++--------------
 2 files changed, 14 insertions(+), 17 deletions(-)

commit 8fed6193736bf22e0e44c03ee783761e9cc37238
Author: Takashi Iwai 
Date:   Mon Nov 19 18:06:51 2012 +0100

    drm/i915: Enable DP audio for Haswell
    
    This patch adds the missing code to send ELD for Haswell DisplayPort,
    based on Xingchao's original patch.
    
    A test was performed with HSW-D machine and NEC EA232Wmi DP monitor.
    
    Cc: Xingchao Wang 
    Signed-off-by: Takashi Iwai 
    Reviewed-by: Paulo Zanoni 
    Signed-off-by: Daniel Vetter 

 drivers/gpu/drm/i915/intel_ddi.c |    9 +++++++++
 1 file changed, 9 insertions(+)

commit c9839303d186d6270f570ff3c5f56c2327958086
Author: Chris Wilson 
Date:   Tue Nov 20 10:45:17 2012 +0000

    drm/i915: Pin the object whilst faulting it in
    
    In order to prevent reaping of the object whilst setting it up to
    handle the pagefault, we need to mark it as pinned. This has the nice
    side-effect of eliminating some special cases from the pagefault handler
    as well!
    
    Signed-off-by: Chris Wilson 
    Signed-off-by: Daniel Vetter 

 drivers/gpu/drm/i915/i915_gem.c |   29 +++++++++--------------------
 1 file changed, 9 insertions(+), 20 deletions(-)

commit fbdda6fb5ee5da401af42226878880069a6b8615
Author: Chris Wilson 
Date:   Tue Nov 20 10:45:16 2012 +0000

    drm/i915: Guard pages being reaped by OOM whilst binding-to-GTT
    
    In the circumstances that the shrinker is allowed to steal the mutex
    in order to reap pages, we need to be careful to prevent it operating on
    the current object and shooting ourselves in the foot.
    
    Signed-off-by: Chris Wilson 
    Signed-off-by: Daniel Vetter 

 drivers/gpu/drm/i915/i915_gem.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

commit be7cb6347e0c3aa1956748a860a2465a7ea128c4
Author: Chris Wilson 
Date:   Mon Nov 19 15:30:42 2012 +0000

    drm/i915: Remove bogus test for a present execbuffer
    
    The intention of checking obj->gtt_offset!=0 is to verify that the
    target object was listed in the execbuffer and had been bound into the
    GTT. This is guarranteed by the earlier rearrangement to split the
    execbuffer operation into reserve and relocation phases and then
    verified by the check that the target handle had been processed during
    the reservation phase.
    
    However, the actual checking of obj->gtt_offset==0 is bogus as we can
    indeed reference an object at offset 0. For instance, the framebuffer
    installed by the BIOS often resides at offset 0 - causing EINVAL as we
    legimately try to render using the stolen fb.
    
    Signed-off-by: Chris Wilson 
    Reviewed-by: Eric Anholt 
    Signed-off-by: Daniel Vetter 

 drivers/gpu/drm/i915/i915_gem_execbuffer.c |    9 ---------
 1 file changed, 9 deletions(-)

commit b92fa839015f27ba0f5c7ef9812eba9ecff538c2
Author: Chris Wilson 
Date:   Fri Nov 16 11:43:21 2012 +0000

    drm/i915: Remove save/restore of physical HWS_PGA register
    
    Now that we always restore the HWS registers (both physical and GTT
    virtual addresses) when re-initialising the rings, we can eliminate the
    superfluous save/restore of the register across suspend and resume.
    
    Signed-off-by: Chris Wilson 
    Signed-off-by: Daniel Vetter 

 drivers/gpu/drm/i915/i915_drv.h     |    1 -
 drivers/gpu/drm/i915/i915_suspend.c |    8 --------
 2 files changed, 9 deletions(-)

commit d09105c66eb813ab3f57ba5e738f477f6ff92dec
Author: Ben Widawsky 
Date:   Thu Nov 15 12:06:09 2012 -0800

    drm/i915: Fix warning in i915_gem_chipset_flush
    
    drivers/gpu/drm/i915/i915_drv.h:1545:2: warning: '______f' is static but
    declared in inline function 'i915_gem_chipset_flush' which is not static
    
    Reported-by: kbuild test robot 
    dri-devel-Reference: <50a4d41c data-blogger-escaped-.586vhmwghpukzbkb="" data-blogger-escaped-fengguang.wu="" data-blogger-escaped-intel.com="">
    Signed-off-by: Ben Widawsky 
    Signed-off-by: Daniel Vetter 

 drivers/gpu/drm/i915/i915_drv.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

commit 42d42e7e4220753bab3eb7b857721f203a4cd821
Author: Damien Lespiau 
Date:   Wed Oct 31 19:23:16 2012 +0000

    drm/i915: Only check for valid PP_{ON, OFF}_DELAYS on pre ILK hardware
    
    ILK+ have this register on the PCH. This check was triggering unclaimed
    writes.
    
    Signed-off-by: Damien Lespiau 
    Reviewed-by: Paulo Zanoni 
    Signed-off-by: Daniel Vetter 

 drivers/gpu/drm/i915/intel_bios.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

commit 607a6f7a6621f65706ff536b2615ee65b5c2f575
Author: Daniel Vetter 
Date:   Wed Nov 14 17:47:39 2012 +0100

    drm/i915: drop buggy write to FDI_RX_CHICKEN register
    
    Jani Nikula noticed that the parentheses are wrong and we & the bit
    with the register address instead of the read-back value. He sent a
    patch to correct that.
    
    On second look, we write the same register in the previous line, and
    the w/a seems to be to set FDI_RX_PHASE_SYNC_POINTER_OVR to enable the
    logic, then keep always set FDI_RX_PHASE_SYNC_POINTER_OVR and toggle
    FDI_RX_PHASE_SYNC_POINTER_EN before/after enabling the pc transcoder.
    
    So the right things seems to be to simply kill the 2nd write.
    
    Cc: Jani Nikula 
    Reviewed-by: Chris Wilson 
    [danvet: Dropped a bogus ~ from the commit message that somehow crept
    in.]
    Signed-off-by: Daniel Vetter 

 drivers/gpu/drm/i915/intel_display.c |    3 ---
 1 file changed, 3 deletions(-)

No comments:

Post a Comment