Some preliminary findings from using AI to review my pull requests
This is not a post about my overall take on AI, I have neither the energy nor desire to organize those thoughts. The piece I’ve encountered so far that most closely represents my opinions and feelings is AI Ambivalence.
Instead this is about how I’ve been using AI in personal and work projects.
I’ve been taking notes on my own experiences with AI for several months, in a manner that veers between rigorous and desultory.
My notes on using AI in personal projects will probably never be compiled and published, they are not at all systematic, and mostly about my feeling of what it is like to use AI when coding “for fun”.
I’ve stopped taking notes on using AI for general purposes. I was doing this somewhat systematically (as in, for every single use, I would write a brief note) for a period of a few weeks, largely to get a sense of whether it would actually provide value, how I could use it more effectively, and so on. I stopped doing this when I got a good enough sense that yes, despite my reservations and its limitations, it did prove useful in various scenarios.
Lately, based on having a coworker review a pull request and discover a bug (albeit a minor one in some error handling logging) in my code with AI, I’ve developed an interest in seeing whether I can get any value from a workflow that involves using AI to review my code before opening a PR. One of the main benefits people ascribe to PR review is socializing knowledge of the code throughout the team. Obviously AI-based code review does not contribute to that, and as a trend almost certainly inhibits it, so the benefits I’m interested in here are: a) can it catch bugs, and b) can it suggest improvements (refactorings, optimizations, idiomatic syntax, etc).
I am aware that there are a number of SaaS tools that will integrate with Github to review your PRs. This is not what I’m talking about; we used one of those tools at work for a while and it almost immediately became noise that I learned to just ignore. Rather, the workflow I’m experimenting with now is basically:
- Write some code, mostly or exclusively by hand, since I still don’t enjoy nor seem to get much speed boost from using AI coding assistants, except in cases like “refactor this typescript/react code I’ve already written and has some obvious improvements” which is <5% of the work I do.
- Ask Claude Code and/or Augment (via VS code plugin) to review my commits before I open a PR, looking for bugs and improvements.
- Make any changes that seem legit, and write a brief note on the experience.
I also have a simple scoring system that classifies each experiment on a 0-10 scale and roughly captures how much time was wasted. A better system might also try to capture how much time was “saved” by making changes the AI suggested, but that seems much more fraught and involved. My goal with this approach is not to achieve a perfect understanding of precisely how useful AI could be to me as a code review tool, but to provide a rough guide as to whether I should use it at all, and how much credence to give its suggestions.
The 0-10 scale started as 1-10, but then some results were so bad that I thought a 0 value was needed.
- 0/10: actively dangerous suggestions that would introduce serious bugs or security problems
- 1/10: all suggestions are incorrect or misleading
- 5/10 suggestions are somewhat reasonable, occasionally helpful
- Arguably, a deficiency or limitation of this scoring system is that a review that would otherwise be 7 or 8 will be dragged down to a 4 or 5 by the inclusion of bad or irrelevant suggestions from the AI. But since I still have to deal with all the hay while looking for that needle, I think this approach is legitimate.
- 10/10 top tier suggestions, catches critical bugs, refactorings or improvements I would only expect to get from very experienced and skilled human colleagues, little to no noise (or the amount of noise is dramatically outweighed by the value of other suggestions and findings)
Occasionally in my work I will also encounter an actual bug (either an existing one, or one I have introduced in the course of developing a set of commits) or a case in which, if I didn’t do X would constitute a bug, and I will sometimes then use the same process to see if the AI is able to identify the bug.
Other limitations of this approach:
- I have not recorded which model or model version I’m using at the time. I’m trying to keep the barrier to maintaining this system low so that I will actually stick to it, and this seems like an extra annoyance that would impede that, but probably I should just bite the bullet and do it.
- I am not recording my prompts yet, since they are very basic, just “Review the last N commits for bugs or suggestions”. After gathering initial data for a week or so, I will experiment with improved prompts. I’m slightly optimistic that will improve the results at least a little bit.
Findings so far
I will update this section over time. Last updated June 28, 2025.
misc improvements
- AI is good at catching typos in my comments. 10/10
- AI is okay at suggesting additional comments and documentation. About half the time it is not worth it; and trying to use its own actual wording (via tab completion with Augment; it is just as unusably bad as Copilot was a year ago) is almost always wrong. 6/10 for the suggestions to add comments at all, 1/10 for the actual comment text suggested.
- AI is bad at suggesting the need for additional test coverage. It often suggests it when it is unnecessary. OTOH this sort of suggestion is fairly easily ignorable and not very enraging. 5/10
- It occasionally suggests good refactorings, but sometimes misidentifies the reasons for doing so, e.g. it will conflate applying “DRY” with an “optimization”. Most of its suggested improvements are wrong. 3/10.
- It seems to not understand deprecation, and suggests I delete things that I am deprecating. This is actually not surprising, but still disappointing, since many human developers also seem to not understand the difference between deleting something and marking it deprecated. This seems like a good concrete example of how AI generated code is basically “median developer code”. 2/10.
bugs
- AI is pretty good at finding bugs related to out of scope variables and typos. In a dynamic language like ruby, this is pretty valuable. 7/10
- AI is extremely bad at finding bugs that are even slightly more complicated than this. It, so far, has ALWAYS missed them. 1/10.
- AI is very bad with with falsely identifying issues as bugs. It frequently thinks things are bugs that are not bugs, or identifies things as bugs when they are actually just a bit confusing, or unusual. 3/10
subjective
- I really truly hate having to read through its bad suggestions, and it is about 90% bad suggestions. It feels like having a terrible but overconfident coworker who you dread having to interact with. Maybe this will get easier with time, like I have adapted to my various chronic pains. It is not wasting much of my time, when I sum up the time wasted estimates, but it is fairly demoralizing, which may have subtle negative effects on my productivity throughout the day.
- I contemplate adding a rage factor to my review notes, maybe just a little rage face emoji. Probably not necessary, safe to assume that every 0 or 1 out of 10 review induces mild feelings of rage. Thankfully I have a lifetime of practice controlling my rage and funneling it into humour and cynicism and my refusal to relinquish Canadian spelling and pronunciation.
Notes
Each dated entry represents a separate “experiment”. Sometimes I will use two separate AIs on the same commits, sometimes just one (usually Claude Code since it is handier via the CLI).
- June 26 claude code identified 4 issues with a PR. 1/10, 4 minutes wasted.
- 2 were false positives, not actually bugs, AI misunderstanding the behaviour
- 2 were suggestions that I did not end up following, either irrelevant or would have complicated the code with little benefit.
- Failed to catch a significant bug that I planted for it, which would have caused a data corruption.
- June 26 claude identified 0 issues, 1/10 0 minutes wasted.
- Failed to catch a bug with
merge!(...hash_params).merge(...other_hash_params)
that would have resulted inother_hash_params
being ignored. I caught this through testing in my own pre-PR submission review.
- Failed to catch a bug with
- June 27 asked claude code to look for some bugs, identified about 10, none of which were the actual bug, which was that an env var was not set. took about 10-15 mins to review these, all of which were a waste of time, other than the comment typos. 1/10, 15 mins wasted
- Two false positives, confused by use of Hashie::Mash
- Two false positives, not bugs, possible (but actually bad) refactorings
- Two false positives, actually just comment typo
- False positive, not a bug but code could be improved with a comment
- False positive, code raises an error in some cases, that is not a bug
- False positive, thinks query string needs to be encoded but it does not
- False positive, thinks phone number needs validation but it does not
- June 27 For the same commits as above, Augment found just one “bug” but it was also a false positive, confused by
Hashie::Mash
. At least this was much less noisy than Claude. 1/10, 2 minutes wasted - June 27 Asked claude code to review another PR for bugs and improvements, 3/10, 4 minutes wasted trying to find the code it was complaining about that did not exist.
- Identified an unused variable
- False positive bug: was somehow looking at the wrong git diff and complained about an orphaned variable in a method that no longer existed, recognized its mistake when I pointed that out.
- Did not find any other bugs
- Suggested some improvements that were not worth it.
- June 27 asked claude code to review another PR, found 0 bugs, claimed one thing was a bug when it wasn’t. 4/10 2 minutes wasted.
- False positive bug, it identified some slightly unusual test code as a bug.
- Suggested a couple pointless improvements.
- Suggested one good comment improvement.
- June 27 asked claude code to review another commit, found no bugs, suggested a completely bogus refactoring in which, via the “DRY principle”, we could cache the results of a method because the underlying data doesn’t change often. this has nothing to do with DRY, is a dangerously bad idea, and is completely opposite to the spirit of the commit which is to make the dates editable. 0/10, 2 minutes wasted.
- June 28 asked claude to review a short PR around some date logic changes. 0/10 1 minute wasted, some truly idiotic improvements suggested. If I was a junior developer and followed any of this advice without knowing better, I would cause errors in production.
- Claimed there was an inconsistency with a change that deprecated a field, it seems to not understand that sometimes the goal of deprecating a field is just to rename it to another field.
- Suggests we log deprecation warnings when those already happen automatically. I am not surprised it missed this, but it is still bad advice.
- Complains about hard-coded date logic “removing the complex payroll-based calculation without considering edge cases” when that is the entire point of, and improvement, made by this change, nor is this even really a classic case of “hard coding”.
- Another illegitimate complaint that I didn’t delete the deprecated field but that is pretty much the entire point of deprecation, you mark it deprecated, then LATER delete it when no clients are using it.
This post will be updated with more results as time passes. I suspect that the notes will become terser over time, with fewer examples of the good/bad feedback, but at least for now, I’m interested in collecting actual concrete examples.