Applying Git's pickaxe option across multiple lines of YAML using textconv

For this post, I wanted to talk about a trick I came up a few years ago with to apply Git's pickaxe option in a situation where the change I was interested in wasn't contained in a single line. I'm hoping that readers of this post take away not only the trick itself, but also a perspective on how we can leverage Git's flexibility to create even more powerful tools for searching history!

For those of you unfamiliar with it, git log -S, commonly known as the "pickaxe option", is a handy tool for searching Git history - it allows you to detect when a change was introduced (as opposed to git blame, which tells you the last commit to touch a line). It's very powerful, but it's fundamentally line-based, and doesn't have a notion of the structure of the files within a repository, which is important with formats like YAML.

If you're curious why I wanted to do this, read on through the "Background" section; otherwise, feel free to skip ahead to my exploration of Git's diff machinery in Getting Git to run a custom diff for YAML files, or to the solution I found in the Summary.

Background - Why Would You Want To Do This?

A few years ago, we were in the process of upgrading our Kubernetes clusters at work to Kubernetes 1.20, and while looking through the release notes, we found an interesting bug fix:

A bug was fixed in kubelet where exec probe timeouts were not respected. This may result in unexpected behavior since the default timeout (if not specified) is 1s which may be too small for some exec probes.

We didn't have a ton of workloads using exec probes, so as part of the upgrade process, we checked the handful of pods with containers using an exec probe without an explicit timeout. But I was also interested in exec probes that did have an explicit timeout, and discovering the reason they were introduced so I could judge whether or not the timeout might be problematic after we deployed the fixed version of Kubernetes.

Now, finding that information out for resources currently in the cluster is easy - it's just a kubectl away - but that might miss things like pod templates in deployments with replicas: 0, which would come as a nasty shock if developers re-enabled that deployment (not to mention it could happen far into the future, leading to serious potential confusion as the knowledge of the bug fix fades from memory). But more importantly, it doesn't really tell us the why behind the choice of timeout - so I wanted to do this as a search through Git history.

Now, you might be thinking that I could have used git blame - but that can run into issues where a commit is just changing formatting, or making a non-semantic change. However, more importantly, git blame tells us the commit that most recently changed a line - my goal is to find the commit that first introduced the line with timeoutSeconds!

As you might expect from the title of this post, the next Git feature we could look at for this task is git log -S (AKA the pickaxe). This does get us closer - but consider what an exec probe looks like in YAML:

livenessProbe:
  exec:
    command:
    - "true"
  timeoutSeconds: 5

Because it's spread out across several lines, just doing a git log -S timeoutSeconds might give us some false positives, such as probes not of type exec (which we have a lot more of), and it would be a lot of work to separate the signal from the noise there.

We could probably get away with writing some code to comb through the git log -S timeoutSeconds results and discard any matches that aren't liveness probes of type exec - but can we instead convince the pickaxe feature to be "multi-line" aware, or otherwise aware of YAML's structure, to achieve our goal?

Getting Git to run a custom diff for YAML files

Git's diff machinery provides a number of ways to configure it to account for all of the various types of files you can store in a version control system - since pickaxe uses Git's diff machinery at its core, I wondered if I could somehow leverage one of these features to accomplish my goal. The first feature I investigated was Git's custom diff drivers, which allows you to define a custom program to run to generate diffs for certain files. I don't know exactly how such a program would interact with git log -S, but first I needed to verify that it would even use a custom diff driver. So I used one of my favorite techniques in programming: I told Git to use a program that would deliberately blow up by specifying false as a diff driver via .git/config:

[diff "execprobes"]
  command = false

...and .gitattributes:

*.yaml diff=execprobes

Next, I ran a few commands to see if things failed:

So, back to the drawing board! I took a look around gitattributes(5) some more, and came across textconv. Now, I'd encountered this option before - a few years ago, feeling unsatisified with the performance of git log -S on our monorepo at work, I wrote a tool called git-ftl (for "faster than log"). I've since stopped using it since peformance improvements in Git have made it unnecessary, but I remember at the time that one of the drawbacks of git-ftl was that it did not respect textconv. Well, if that's a drawback in comparison to git log -S, that implies that git log -S respects textconv - I tested this out, and sure enough, it does! The only difference was instead of putting this into .git/config:

[diff "execprobes"]
  command = false

...I put this in:

[diff "execprobes"]
  textconv = false

So now, instead of writing a program to surface diffs in such a way that empowers git log -S to find our special occurrences of timeoutSeconds within YAML files, I needed to write a program to convert YAML to some representation that native diffing algorithms could use to pick those special occurrences out of.

Generating text for flattened probes

Ok, now that we know how to generate custom text for Git diff to work with, we need to figure out what text to generate. It doesn't really matter, as long as we're able to pick out instances of timeoutSeconds for exec-type probes, so what I did was just look through YAML files for all of the different probe types that have an exec field, and print a flattened JSON representation for those probes. An example of that in action is changing this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: starts-with-probes
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: start-with-probes
  template:
    metadata:
      labels:
        app.kubernetes.io/name: start-with-probes
    spec:
      containers:
      - name: main
        image: 'busybox:latest'
        command:
        - 'sleep'
        - '999m'
        livenessProbe:
          exec:
            command:
            - 'true'
          timeoutSeconds: 2

...into this:

{"exec":{"command":["true"]},"timeoutSeconds":2}

I wrote a little script using yq to convert the YAML to JSON, and jq to extract the probes yq can do a lot of what jq can, but in this case its dialect of jq's language wasn't up to the task 😞 yq can do a lot of what jq can, but in this case its dialect of jq's language wasn't up to the task 😞, and now all we need to do is change the diff driver textconv option from false to sh flatten-exec-probes.sh, and we're in business! I made a little example repository; here are the results:

$ git log -S timeoutSeconds
commit a134ad2
Author: Rob Hoelz <rob@hoelz.ro>
Date:   Sat May 4 19:44:54 2024 -0500

    Add example that always has probes

commit 0acbfda
Author: Rob Hoelz <rob@hoelz.ro>
Date:   Sat May 4 19:44:20 2024 -0500

    Add exec probe to example

(If you try this yourself, don't forget to git config diff.execprobes.textconv 'sh flatten-exec-probes.sh' to tell Git about our execprobes driver).

If you look at the commits in that repository, you'll see that other files that have non-exec probes using timeoutSeconds aren't picked up, just like we'd hoped!

What's more, you can use -G to look for changes to exec probes:

$ git log -G timeoutSeconds
commit 9ec0fcf
Author: Rob Hoelz <rob@hoelz.ro>
Date:   Mon May 6 09:13:42 2024 -0500

    Bump exec probe timeout to 3 seconds

commit a134ad2
Author: Rob Hoelz <rob@hoelz.ro>
Date:   Sat May 4 19:44:54 2024 -0500

    Add example that always has probes

commit 0acbfda
Author: Rob Hoelz <rob@hoelz.ro>
Date:   Sat May 4 19:44:20 2024 -0500

    Add exec probe to example

Summary

To get git log -S to work in a multi-line fashion:

All of the .git/config/.gitattributes management makes this technique a bit unwieldy, but it is pretty powerful - I was able to answer my question from our Git repository's history in a few minutes.

This feels like a pretty niche use-case, but I think this technique could be adapted for other things as well - to compensate for things like formatting, or other non-semantic changes. I could also imagine using it for things like discovering when links to certain pages entered into my personal wiki.

Hopefully this gives you ideas for different ways to make use of your own Git histories; if you find an interesting use case, please let me know!

Published on 2024-05-06