Semgrep: AutoFixes using LLMs

Sun, Mar 24, 2024

Semgrep:

Semgrep is an incredible tool that allows you to search code by matching against the Abstract Syntax Tree (AST). For instance, if you want to find all method calls named get_foo, you can write a pattern like this:

$A.get_foo(...)

Test your own patterns using the playground: https://semgrep.dev/playground/new

While there are other tools like this, semgrep is currently the most capable:

AutoFix:

Semgrep not only searches using patterns but also supports rewriting the matches. Here’s a simple rule definition from their documentation:

rules:
- id: use-sys-exit
  languages:
  - python
  message: |
    Use `sys.exit` over the python shell `exit` built-in. `exit` is a helper
    for the interactive shell and is not be available on all Python implementations.
    https://stackoverflow.com/a/6501134
  pattern: exit($X)
  fix: sys.exit($X)
  severity: WARNING

This can be invoked by running:

semgrep --config ./rule.yml --autofix

LLMs:

Although the built-in autofix feature is powerful, it’s limited to simple AST transforms. I’m currently exploring the idea of fixing semgrep matches using a Large Language Model (LLM). More specifically, each match is individually fed into the LLM and replaced with the response. To make this possible, I’ve created a tool called semgrepx, which can be thought of as xargs for semgrep. I then use semgrepx to rewrite the matches using the fantastic llm tool. Here’s how it works:

semgrep -l go --pattern 'log.$A(...)' --json > matches.json
semgrepx llm 'update this go to use log.Printf' < matches.json

Notes:

In my experience, Anthropic’s Claude 3 Opus model performs much better at this task compared to GPT4.
I tend to match a larger expression than necessary to provide the LLM with additional context.
I make heavy use of llm’s template feature.