See also
http://forums.gentoo.org/viewtopic-p-7278760.html
https://bbs.archlinux.org/viewtopic.php?pid=1248190
Update: I'm revising this post. Not quite done yet.
Another Update: Someone else has just bisected and ended up with a different commit:
https://bbs.archlinux.org/viewtopic.php?pid=1254285#p1254285
The patch they found makes a lot more sense than any of the ones I ended up with.
However, I do still see issues with the kernel I end up with, so I'm either starting from the wrong 'good', or there are several bad commits.
A third update: I just experienced slow-down with 1k interrupts per second on 3.7.10, which may indicate that I picked a bad starting point for bisecting. I'll leave the post up since it can serve as a guide for how to bisect in general.
A fourth update:
A fix is on the way to kernel 3.9 or 3.10.
https://patchwork.kernel.org/patch/2400621/
https://patchwork.kernel.org/patch/2402211/
Original post:
I didn't really want to do this, for several reasons:
* the bug only manifests itself on one of my computers, with intel graphics -- and that's a laptop (I'm not a huge fan of laptops, especially not with budding carpal tunnel syndrome, so testing the bisected version is a bit of a pain -- literally)
* the bug isn't consistently triggered, so you need to test the new kernel for an hour or more, and only by the absence of a specific behaviour can you see whether all is good i.e. testing is a bit of hit and miss.
* compiling the kernel takes a long time, and I don't currently have a suitable computer for it at home where my laptop (the one with the i915) is, so I'll build the kernel on a work computer, then install it at home, then repeat each day.
On the other hand, it's a learning experience, so here we go.
I'm looking at https://www.kernel.org/pub/software/scm/git/docs/git-bisect.html
I'm compiling the kernel debian style since it generates .deb files which are easy to install on other machines. Again, there are posts on how to compile the kernel using a more generic approach, or specifically for Arch, elsewhere on this blog.
NOTE: there's no sure-fire way of triggering this issue, but it seems to occur more frequently when the fan is on but approaching the thermal cut-off point where it will turn off. This makes troubleshooting somewhat more difficult.
1. Download/checkout
First get the sources -- it pulls everything from 2.6.12, so it's bigger than a normal kernel source download. I'll skip explicitly telling you what packages you need to have in order to compile the kernel or use git, since if you don't know, you're probably not really ready for this anyway. Besides, that information is available in other posts on this blog.
mkdir -p ~/tmp/kernel_bisect cd ~/tmp/kernel_bisect/ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git linux-gitCloning into 'linux-git'... remote: Counting objects: 2942099, done. remote: Compressing objects: 100% (443946/443946), done. remote: Total 2942099 (delta 2471580), reused 2940624 (delta 2470325) Receiving objects: 100% (2942099/2942099), 608.66 MiB | 4.58 MiB/s, done. Resolving deltas: 100% (2471580/2471580), done. Checking out files: 100% (42425/42425), done.cd linux-git/
2A. Bisect
I know that 3.8.5 is bad and that 3.7.10 may be good (seem to remember having issues with the Arch version of 3.7.10, but no issues with my own build on debian. Not sure I can trust my memory).
I know for a fact that 3.2 is good. I don't want to bisect everything from 3.8.5 to 3.2 though, so I'll take a leap of faith and presume that 3.7.10 is good (which I'll regret if that's not true) and 3.8.0 is bad.
I know that the issue is present on amd64.
I think it's an issue to do with i915.
git bisect start -- arch/amd64 drivers/gpu/drm/i915 git bisect bad v3.8 git bisect good v3.72B. CompileBisecting: 160 revisions left to test after this (roughly 7 steps) [28d491df4c6b00f9148a9885dba1f36a078535dc] drm/i915: Bad pixel formats can't reach the sprite code
I'm being a bit lazy here as well, possibly at the cost of not being able to identify my issue: I'll use silentoldconfig to accept all the default settings.
make oldconfig time fakeroot make-kpkg -j3 --initrd kernel_image kernel_headers
where 3 in -j3 is the number of cores (see here and here and here for more about the proper value for -j) -- I have a triple core AMD. It took around 43 minutes to build.
This will generate two .deb files in the parent folder -- one kernel image and one with headers.
Install these on your testing system.
sudo dpkg -i ../linux-image-3.7.0-rc2+_3.7.0-rc2+-10.00.Custom_amd64.deb ../linux-headers-3.7.0-rc2+_3.7.0-rc2+-10.00.Custom_amd64.deb
Since you installed .deb packages it's pretty easy to roll back any changes later.
3. Testing
What happens now depends on whether the kernel is plagued by the issue you're trying to bisect for:
3A. The problem persists
You've now narrowed it down to half as many commits.
git bisect bad make-kpkg clean make oldconfig time fakeroot make-kpkg -j3 --initrd kernel_image kernel_headers sudo dpkg -i ../linux-image-3.7.0-rc2+_3.7.0-rc2+-10.00.Custom_amd64.deb ../linux-headers-3.7.0-rc2+_3.7.0-rc2+-10.00.Custom_amd64.deb
3B. The problem is not present
You've now narrowed it down to half as many commits.
git bisect good make-kpkg clean make oldconfig time fakeroot make-kpkg -j3 --initrd kernel_image kernel_headers sudo dpkg -i ../linux-image-3.7.0-rc4+_3.7.0-rc4+-10.00.Custom_amd64.deb ../linux-headers-3.7.0-rc4+_3.7.0-rc4+-10.00.Custom_amd64.deb
Repeat step 3 until you've isolated the commit that caused the issue.
For me
* the first bisect didn't cause kworker slow downs (bisect good)
* the second bisect immediately led to slowdown (bisect bad)
* the third one was fine (bisect good)
* the fourth one had no issues (bisect good)
* the fifth one was good (bisect good)
* the sixth one was definitely bad (bisect bad)
powertop gives the following when it's slowing down
0 mW 423.5 ms/s 4.5 kWork i915_hotplug_work_func
and watch cat /proc/interrupts:
The number of interrupts increases very, very rapidly for '49: PCI-MSI-edge i915' during mouse slowdown. Normal rate is around 70 per two seconds. |
* the seventh one was a little bit bad (bisect bad).
* the eight one had minor issues when the fan went off (bisect bad).
But the first time I did the bisect I came to the exact opposite conclusion for the last two bisects. I'm suspecting that it's really down to two or more commits that together cause bad behavior, but on their own are merely annoying.
Anyway, I've ended up with this commit as the current culprit (again, the first time around I ended up with a different commit):
607a6f7a6621f65706ff536b2615ee65b5c2f575 is the first bad commit commit 607a6f7a6621f65706ff536b2615ee65b5c2f575 Author: Daniel VetterDate: Wed Nov 14 17:47:39 2012 +0100 drm/i915: drop buggy write to FDI_RX_CHICKEN register Jani Nikula noticed that the parentheses are wrong and we & the bit with the register address instead of the read-back value. He sent a patch to correct that. On second look, we write the same register in the previous line, and the w/a seems to be to set FDI_RX_PHASE_SYNC_POINTER_OVR to enable the logic, then keep always set FDI_RX_PHASE_SYNC_POINTER_OVR and toggle FDI_RX_PHASE_SYNC_POINTER_EN before/after enabling the pc transcoder. So the right things seems to be to simply kill the 2nd write. Cc: Jani Nikula Reviewed-by: Chris Wilson [danvet: Dropped a bogus ~ from the commit message that somehow crept in.] Signed-off-by: Daniel Vetter :040000 040000 f789c6c199c9db5c9d0d7961760574b5f0b1ede9 9e0cd2a09cab610b437164b1a74f620e43686ef1 M drivers
It's just really difficult to reproduce the issue consistently with the last couple of kernels. I am in no way confident that the above commit is what's causing all this.
The last confirmed troublesome bisect (#6):
Here's the log:
git bisect start '--' 'arch/amd64' 'drivers/gpu/drm/i915' # bad: [19f949f52599ba7c3f67a5897ac6be14bfcb1200] Linux 3.8 git bisect bad 19f949f52599ba7c3f67a5897ac6be14bfcb1200 # good: [29594404d7fe73cd80eaa4ee8c43dcc53970c60e] Linux 3.7 git bisect good 29594404d7fe73cd80eaa4ee8c43dcc53970c60e # good: [28d491df4c6b00f9148a9885dba1f36a078535dc] drm/i915: Bad pixel formats can't reach the sprite code git bisect good 28d491df4c6b00f9148a9885dba1f36a078535dc # bad: [b4a98e57fc27854b5938fc8b08b68e5e68b91e1f] drm/i915: Flush outstanding unpin tasks before pageflipping git bisect bad b4a98e57fc27854b5938fc8b08b68e5e68b91e1f # good: [12f3382bc0262e981a2e58aca900cbbdbbe66825] drm/i915: implement WaDisablePSDDualDispatchEnable on IVB & VLV git bisect good 12f3382bc0262e981a2e58aca900cbbdbbe66825 # good: [b9e0bda3cd325b55f336efb751736163f62abded] drm/i915: Always calculate 8xx WM values based on a 32-bpp framebuffer git bisect good b9e0bda3cd325b55f336efb751736163f62abded # good: [1c8b46fc8c865189f562c9ab163d63863759712f] drm/i915: Use LRI to update the semaphore registers git bisect good 1c8b46fc8c865189f562c9ab163d63863759712f
and here are the remaining commits:
commit b4a98e57fc27854b5938fc8b08b68e5e68b91e1f Author: Chris WilsonDate: Thu Nov 1 09:26:26 2012 +0000 drm/i915: Flush outstanding unpin tasks before pageflipping If we accumulate unpin tasks because we are pageflipping faster than the system can schedule its workers, we can effectively create a pin-leak. The solution taken here is to limit the number of unpin tasks we have per-crtc and to flush those outstanding tasks if we accumulate too many. This should prevent any jitter in the normal case, and also prevent the hang if we should run too fast. Note: It is important that we switch from the system workqueue to our own dev_priv->wq since all work items on that queue are guaranteed to only need the dev->struct_mutex and not any modeset resources. For otherwise if we have a work item ahead in the queue which needs the modeset lock (like the output detect work used by both polling or hpd), this work and so the unpin work will never execute since the pageflip code already holds that lock. Unfortunately there's no lockdep support for this scenario in the workqueue code. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=46991 Reported-and-tested-by: Tvrtko Ursulin Signed-off-by: Chris Wilson [danvet: Added note about workqueu deadlock.] Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=56337 Signed-off-by: Daniel Vetter drivers/gpu/drm/i915/intel_display.c | 22 ++++++++++++++++------ drivers/gpu/drm/i915/intel_drv.h | 4 +++- 2 files changed, 19 insertions(+), 7 deletions(-) commit a726915cef1daab57aad4c5b5e4773822f0a4bf8 Author: Daniel Vetter Date: Tue Nov 20 14:50:08 2012 +0100 drm/i915: resurrect panel lid handling But disabled by default. This essentially reverts commit bcd5023c961a44c7149936553b6929b2b233dd27 Author: Dave Airlie Date: Mon Mar 14 14:17:55 2011 +1000 drm/i915: disable opregion lid detection for now but leaves the autodetect mode disabled. There's also the explicit lid status option added in commit fca874092597ef946b8f07031d8c31c58b212144 Author: Chris Wilson Date: Thu Feb 17 13:44:48 2011 +0000 drm/i915: Add a module parameter to ignore lid status Which overloaded the meaning for the panel_ignore_lid parameter even more. To fix up this mess, give the non-negative numbers 0,1 the original meaning back and use negative numbers to force a given state. So now we have 1 - disable autodetect, return unknown 0 - enable autodetect -1 - force to disconnected/lid closed -2 - force to connected/lid open v2: My C programmer license has been revoked ... v3: Beautify the code a bit, as suggested by Chris Wilson. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=27622 Tested-by: Andreas Sturmlechner Reviewed-by: Chris Wilson Signed-off-by: Daniel Vetter drivers/gpu/drm/i915/i915_drv.c | 6 +++--- drivers/gpu/drm/i915/intel_panel.c | 25 +++++++++++-------------- 2 files changed, 14 insertions(+), 17 deletions(-) commit 8fed6193736bf22e0e44c03ee783761e9cc37238 Author: Takashi Iwai Date: Mon Nov 19 18:06:51 2012 +0100 drm/i915: Enable DP audio for Haswell This patch adds the missing code to send ELD for Haswell DisplayPort, based on Xingchao's original patch. A test was performed with HSW-D machine and NEC EA232Wmi DP monitor. Cc: Xingchao Wang Signed-off-by: Takashi Iwai Reviewed-by: Paulo Zanoni Signed-off-by: Daniel Vetter drivers/gpu/drm/i915/intel_ddi.c | 9 +++++++++ 1 file changed, 9 insertions(+) commit c9839303d186d6270f570ff3c5f56c2327958086 Author: Chris Wilson Date: Tue Nov 20 10:45:17 2012 +0000 drm/i915: Pin the object whilst faulting it in In order to prevent reaping of the object whilst setting it up to handle the pagefault, we need to mark it as pinned. This has the nice side-effect of eliminating some special cases from the pagefault handler as well! Signed-off-by: Chris Wilson Signed-off-by: Daniel Vetter drivers/gpu/drm/i915/i915_gem.c | 29 +++++++++-------------------- 1 file changed, 9 insertions(+), 20 deletions(-) commit fbdda6fb5ee5da401af42226878880069a6b8615 Author: Chris Wilson Date: Tue Nov 20 10:45:16 2012 +0000 drm/i915: Guard pages being reaped by OOM whilst binding-to-GTT In the circumstances that the shrinker is allowed to steal the mutex in order to reap pages, we need to be careful to prevent it operating on the current object and shooting ourselves in the foot. Signed-off-by: Chris Wilson Signed-off-by: Daniel Vetter drivers/gpu/drm/i915/i915_gem.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) commit be7cb6347e0c3aa1956748a860a2465a7ea128c4 Author: Chris Wilson Date: Mon Nov 19 15:30:42 2012 +0000 drm/i915: Remove bogus test for a present execbuffer The intention of checking obj->gtt_offset!=0 is to verify that the target object was listed in the execbuffer and had been bound into the GTT. This is guarranteed by the earlier rearrangement to split the execbuffer operation into reserve and relocation phases and then verified by the check that the target handle had been processed during the reservation phase. However, the actual checking of obj->gtt_offset==0 is bogus as we can indeed reference an object at offset 0. For instance, the framebuffer installed by the BIOS often resides at offset 0 - causing EINVAL as we legimately try to render using the stolen fb. Signed-off-by: Chris Wilson Reviewed-by: Eric Anholt Signed-off-by: Daniel Vetter drivers/gpu/drm/i915/i915_gem_execbuffer.c | 9 --------- 1 file changed, 9 deletions(-) commit b92fa839015f27ba0f5c7ef9812eba9ecff538c2 Author: Chris Wilson Date: Fri Nov 16 11:43:21 2012 +0000 drm/i915: Remove save/restore of physical HWS_PGA register Now that we always restore the HWS registers (both physical and GTT virtual addresses) when re-initialising the rings, we can eliminate the superfluous save/restore of the register across suspend and resume. Signed-off-by: Chris Wilson Signed-off-by: Daniel Vetter drivers/gpu/drm/i915/i915_drv.h | 1 - drivers/gpu/drm/i915/i915_suspend.c | 8 -------- 2 files changed, 9 deletions(-) commit d09105c66eb813ab3f57ba5e738f477f6ff92dec Author: Ben Widawsky Date: Thu Nov 15 12:06:09 2012 -0800 drm/i915: Fix warning in i915_gem_chipset_flush drivers/gpu/drm/i915/i915_drv.h:1545:2: warning: '______f' is static but declared in inline function 'i915_gem_chipset_flush' which is not static Reported-by: kbuild test robot dri-devel-Reference: <50a4d41c data-blogger-escaped-.586vhmwghpukzbkb="" data-blogger-escaped-fengguang.wu="" data-blogger-escaped-intel.com=""> Signed-off-by: Ben Widawsky Signed-off-by: Daniel Vetter drivers/gpu/drm/i915/i915_drv.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) commit 42d42e7e4220753bab3eb7b857721f203a4cd821 Author: Damien Lespiau Date: Wed Oct 31 19:23:16 2012 +0000 drm/i915: Only check for valid PP_{ON, OFF}_DELAYS on pre ILK hardware ILK+ have this register on the PCH. This check was triggering unclaimed writes. Signed-off-by: Damien Lespiau Reviewed-by: Paulo Zanoni Signed-off-by: Daniel Vetter drivers/gpu/drm/i915/intel_bios.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) commit 607a6f7a6621f65706ff536b2615ee65b5c2f575 Author: Daniel Vetter Date: Wed Nov 14 17:47:39 2012 +0100 drm/i915: drop buggy write to FDI_RX_CHICKEN register Jani Nikula noticed that the parentheses are wrong and we & the bit with the register address instead of the read-back value. He sent a patch to correct that. On second look, we write the same register in the previous line, and the w/a seems to be to set FDI_RX_PHASE_SYNC_POINTER_OVR to enable the logic, then keep always set FDI_RX_PHASE_SYNC_POINTER_OVR and toggle FDI_RX_PHASE_SYNC_POINTER_EN before/after enabling the pc transcoder. So the right things seems to be to simply kill the 2nd write. Cc: Jani Nikula Reviewed-by: Chris Wilson [danvet: Dropped a bogus ~ from the commit message that somehow crept in.] Signed-off-by: Daniel Vetter drivers/gpu/drm/i915/intel_display.c | 3 --- 1 file changed, 3 deletions(-)