Static Code Analysis Using Google Code Search
by Dug SongToday’s guest-blogger post is from Aaron Campbell, long-time Arbor hacker and one of Canada’s finest:
Lint first appeared (outside of Bell Labs) in the seventh version (V7) of the UNIX operating system in 1979. 27 years later, you’d think static code analysis would be dead. But nothing could be further from the truth. This much I’ve proven to myself today after toying with Google’s newest gift to the world, Google Labs Code Search.
Now, this isn’t exactly a new concept. Koders launched last year, and claims its database contains 225,816,744 lines of searchable open source code. Not to be outdone, The Goog has seriously one-upped the competition by providing regular expression matching. And not a hacked-up, watered down subset of regexp, but full POSIX extended regular expression syntax, as well as select Perl extensions. Kid -> candy store.
Ok, I admit it. Recalling a previous debate over profanity in the Linux kernel source, my inaugural search term for Google’s Code Search was a naughty word. Much to my amusement, the first page of results contained colourful language not only in code comments, but also variable and function names. Potty mouths, the whole lot of us.
Anyhow, as an OpenBSD developer (on extended hiatus though, it would seem, ahem), and having worked in the industry as a vulnerability researcher, I’ve come to know a thing or two about code correctness. The smallest of errors, I’ve learned, can bite you in the biggest of ways. For example,
if ((foo == bar) && (baz == qux));
party_on();
Looks innocuous enough, until you notice the superfluous semi-colon left dangling at the end of the first line, pre-maturely terminating the if () clause. The C compiler, being blissfully unconcerned with the extra whitespace in the following line, is no help to me. Quite the idiot I am, letting the party go on with foo not equalling bar and/or baz not equalling qux. Thank you Python for not allowing this to happen.
By now you know where I’m headed with this. We can use Google’s new tool to expose bogus, yet syntactically correct lines of source code. At the risk of boring all you fashion-forward programmers out there, I’m going to leave Ruby at the door and stick with C code analysis, for now. But let’s keep it interesting and start this exercise with a class of bugs that many of our readers have likely never encountered.
Goto http://www.google.com/codesearch and click Advanced Code Search. Select “Case-sensitive search”, and enter the following regular expression, or just click the provided link:
flags\ *&&\ *[A-Z_]+
See what is happening here? The query is not perfect, but some human filtering will quickly weed out the false positives. For the buggy lines of code shown in the results, the author intended to do a bitwise AND to test his flags variable for a bit constant, but has actually fat-fingered the keyboard and put a logical AND is in its place. Bad. After combing through the results, I came across one of these bugs in OpenSSL. The following diff was sent to the OpenSSL team, and has since been committed to the 0.9.8 source tree:
--- crypto/x509v3/pcy_tree.c.orig Thu Oct 5 12:20:10 2006
+++ crypto/x509v3/pcy_tree.c Thu Oct 5 12:20:22 2006
@@ -197,7 +197,7 @@
/* Any matching allowed if certificate is self
* issued and not the last in the chain.
*/
- if (!(x->ex_flags && EXFLAG_SS) || (i == 0))
+ if (!(x->ex_flags & EXFLAG_SS) || (i == 0))
level->flags |= X509_V_FLAG_INHIBIT_ANY;
}
else
Of course, the same style of bug may manifest as a bitwise OR vs logical OR botch-up. As well, the string “flags” as a variable name was hard-coded into this example query only for the purpose of clarity in demonstration– the same mistake could be applied to a variable of any name, obviously.
As I write this, I’m sure hundreds of bored teenagers are plugging away with queries like “strcpy”, “sprintf”, or other unsafe string functions. Not to say such a method will uncover 0 bugs, but it would have been more useful back in 1995. Finding flaws in the most popular software will require a little more creativity. Take the following regexp, for example:
\[sizeof\(.*\)\]\ *=\ *'?\\?0'?;$
This will reveal stuff like:
buf[sizeof(buf)] = '\0';
This is almost certainly wrong. Variable buf will be declared as something like “char buf[1024]“, therefore sizeof(buf) is 1024. But buf[1024] = ‘\0′ will overwrite one byte beyond buf will a null byte. Off-by-one heaven.
Format string bugs anyone?
^[\ \t]*printf\(getenv
Bad errno checking (assignment operator instead of equality operator):
"if (errno = E"
Back to a simpler, almost laughable example:
"<= 65553"
No, I didn’t typo USHRT_MAX. But someone else has. Try some more of your favourite power-of-2-minus-1 and you’ll have yourself some juicy 0 day BUGTRAQ fodder in no time. I feel I have to say it again. The tiniest of errors can have the most unintended of effects.
But why stick to decimal? How about:
0xfffffff[^0-9a-f]
Get it? Note the count of ‘f’ characters, just 7, not 8. It’s hard to visually distinguish 0xffffffff from 0xfffffff. This is far from an exact science; most of the hits from this query, I suspect, will not identify a bug. But someone has messed this up, somewhere in the 200 search results, guaranteed.
Check for non-sensical misuse of an API. For example,
getopt\ *\(argc,\ *argv,\ *\"[^\"]*;
According to the man page, a getopt(3) optstring may contain the following elements: individual characters, characters followed by a colon, and characters followed by two colons. In this sample query, we are looking for cases where a colon was mistyped as a semi-colon. Four results showing as of the time of this writing. getopt(3) is supposed to make command line parsing easy, but clearly some command-line options go completely untested.
Based on this research, I’ve filed a few bug reports to various open source projects on some flaws I’ve found over the past 24 hours using these techniques. However, as it turns out, these search queries have been turning up far too many bugs for this to be a one-man effort. Some of these bugs are harmless. Some of them are bound to be security holes. My hope is that I’ve provided enough ideas and examples that our readership can join in.
To wrap it up, what makes the Google tool so powerful is the instant search response– I’m still scratching my head about how they managed to pull it off. Any one of us could download thousands of open source packages, untar them, and run find/grep with the regular expressions I’ve shown here, but with far from the immediate gratification that Google Code Search supplies. Now if only they’d add multi-line matching…
Happy bug hunting.

[...] From the secure coding mailing list: Robert C. Seacord points to the arbor blog, which discusses static analysis using this service: http://asert.arbornetworks.com/2006/10/static-code-analysis-using-google-code-search/ [...]