All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Patrick Palka <patrick@parcs.ath.cx>
To: Jeff King <peff@peff.net>
Cc: Patrick Palka <patrick@parcs.ath.cx>,
	Junio C Hamano <gitster@pobox.com>,
	git@vger.kernel.org
Subject: Re: [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks
Date: Thu, 18 Jun 2015 19:06:02 -0400 (EDT)	[thread overview]
Message-ID: <alpine.DEB.2.20.8.1506181845310.4322@idea> (raw)
In-Reply-To: <20150618204505.GD14550@peff.net>

On Thu, 18 Jun 2015, Jeff King wrote:

> On Thu, Jun 18, 2015 at 04:14:19PM -0400, Patrick Palka wrote:
>
>>> in a test script becomes more clear. But some of the output is not so
>>> great. For instance, the very commit under discussion has a
>>> confusing and useless highlight. Or take a documentation patch like
>>> 5c31acfb, where I find the highlights actively distracting. We are saved
>>> a little by the "if the whole line is different, do not highlight at
>>> all" behavior of 097128d1bc.
>>
>> To fix the useless highlights for both evenly and unevenly sized hunks
>> (like when all but a semicolon on a line changes), one can loosen the
>> criterion for not highlighting from "do not highlight if 0% of the
>> before and after lines are common between them" to, say, "do not
>> highlight if less than 10% of the before and after lines are common
>> between them".  Then most of these useless highlights are gone for both
>> evenly and unevenly sized hunks.
>
> Yeah, this is an idea I had considered but never actually experimented
> with. It does make some things better, but it also makes some a little
> worse. For example, in 8dbf3eb, the hunk:
>
> -               const char *plain = diff_get_color(ecb->color_diff,
> -                                                  DIFF_PLAIN);
> +               const char *context = diff_get_color(ecb->color_diff,
> +                                                    DIFF_CONTEXT);
>
> currently gets the plain/context change in the first line highlighted,
> as well as the DIFF_PLAIN/DIFF_CONTEXT in the second line. With a 10%
> limit, the second line isn't highlighted. That's correct by the
> heuristic, but it's a bit harder to read, because the highlight draws
> your eye to the first change, and it is easy to miss the second.

Good example, this actually exposes a "bug" in the heuristic.  Each line
is around 15 characters long and the common affix ");" is 2 characters
long, which is about 15% of 15.  So the DIFF_PLAIN/DIFF_CONTEXT pair
ought to be highlighted.

The patch was unintentionally comparing the lengths of the common
affixes against the length of the entire line, whitespace and color
codes and all.  In effect this meant that the 10% threshold was much
higher.  We should compare against the length of the non-boring parts of
the whole line when determining the percentage in common.  Attached is a
revised patch.

>
> Still, I think this is probably a minority case, and it may be
> outweighed by the improvements. The "real" solution is to consider the
> hunk as a whole and do an LCS diff on it, which would show that yes,
> it's worth highlighting both of those spots, as they are a small
> percentage of the total hunk.
>
>> Here is a patch that changes the criterion as mentioned.  Testing this
>> change on the documentation patch 5c31acfb, only two pairs of lines are
>> highlighted instead of six.  On my original patch, the useless highlight
>> is gone.  The useless semicolon-related highlights on e.g. commit
>> 99a2cfb are gone.
>
> Nice, the ones like 99a2cfb are definitely wrong (I had though to fix
> them eventually by treating some punctuation as uninteresting, but I
> suspect the percentage heuristic covers that reasonably well in
> practice).
>
>> Of course, these patches are both hacks but they seem to be surprisingly
>> effective hacks especially when paired together.
>
> The whole script is a (surprisingly effective) hack. ;)
>
>>> So I dunno. IMHO this does more harm than good, and I would not want to
>>> use it myself. But it is somewhat a matter of taste; I am not opposed to
>>> making it a configurable option.
>>
>> That is something I can do :)
>
> Coupled with the 10%-threshold patch, I think it would be OK to include
> it unconditionally. So far we've just been diffing the two outputs and
> micro-analyzing them. The real test to me will be using it in practice
> and seeing if it's helpful or annoying.

-- >8 --

Subject: [PATCH] diff-highlight: don't highlight lines that have little in
  common

---
  contrib/diff-highlight/diff-highlight | 21 +++++++++++++++++----
  1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/contrib/diff-highlight/diff-highlight b/contrib/diff-highlight/diff-highlight
index 85d2eb0..0cc2b60 100755
--- a/contrib/diff-highlight/diff-highlight
+++ b/contrib/diff-highlight/diff-highlight
@@ -218,8 +218,21 @@ sub is_pair_interesting {
  	my $suffix_a = join('', @$a[($sa+1)..$#$a]);
  	my $suffix_b = join('', @$b[($sb+1)..$#$b]);

-	return $prefix_a !~ /^$COLOR*-$BORING*$/ ||
-	       $prefix_b !~ /^$COLOR*\+$BORING*$/ ||
-	       $suffix_a !~ /^$BORING*$/ ||
-	       $suffix_b !~ /^$BORING*$/;
+	$prefix_a =~ s/^$COLOR*-$BORING*//;
+	$prefix_b =~ s/^$COLOR*\+$BORING*//;
+	$suffix_a =~ s/$BORING*$//;
+	$suffix_b =~ s/$BORING*$//;
+
+	my $whole_a = join ('', @$a);
+	$whole_a =~ s/^$COLOR*-$BORING*//;
+	$whole_a =~ s/$BORING*$//;
+
+	my $whole_b = join ('', @$b);
+	$whole_b =~ s/^$COLOR*\+$BORING*//;
+	$whole_b =~ s/$BORING*$//;
+
+	# Only bother highlighting if at least 10% of each line is common among
+	# the lines.
+	return ((length($prefix_a)+length($suffix_a))*100 >= length($whole_a)*10) &&
+	       ((length($prefix_b)+length($suffix_b))*100 >= length($whole_b)*10);
  }
-- 
2.4.4.410.g43ed522.dirty

  parent reply	other threads:[~2015-06-18 23:06 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-15 17:20 [PATCH] Improve contrib/diff-highlight to highlight unevenly-sized hunks Patrick Palka
2015-06-18 15:50 ` Junio C Hamano
2015-06-18 16:28   ` Patrick Palka
2015-06-18 18:08     ` Junio C Hamano
2015-06-18 19:04       ` Jeff King
2015-06-18 20:14         ` Patrick Palka
2015-06-18 20:45           ` Jeff King
2015-06-18 21:23             ` Jeff King
2015-06-18 21:39               ` Junio C Hamano
2015-06-18 22:25               ` Patrick Palka
2015-06-19  3:54               ` Jeff King
2015-06-19  4:49                 ` Junio C Hamano
2015-06-19  5:32                   ` Jeff King
2015-06-19  7:34                     ` Jeff King
2015-06-19 11:38                       ` Jeff King
2015-06-19 17:20                     ` Junio C Hamano
2015-06-18 23:06             ` Patrick Palka [this message]
2015-06-18 20:23       ` Patrick Palka
2015-06-18 19:08     ` Jeff King
2015-06-18 20:27       ` Patrick Palka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.DEB.2.20.8.1506181845310.4322@idea \
    --to=patrick@parcs.ath.cx \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.