Anonymous programmers can be identified by analyzing coding style

  • Interesting to see this formalized. When I was in grad school and graded undergrad homework/exams, I could most definitely recognize the students by their coding styles after just a few assignments. Every student ends up developing their own habits, and they're quite easy to spot in something as repetitive as code.

    I remember teaching a Matlab class for engineers and scientists that was about 50/50 male/female, and the women tended to have much neater code. Code written by males often had comments all jumbled up, inconsistent number of spaces between braces/operators/etc, incoherent variable names, worse names for functions, and so on.

  • My Code Jam solutions always shared a lot of code -- all the boilerplate for reading inputs, parsing integers, iterating over test cases, and writing out results.

    Because of that, it seems like Code Jam is an artificially easy test case for this sort of identification -- I'm pretty sure a human could look at my solutions and conclude they were obviously all written by the same person.

  • Not adhering to style guides is now a privacy issue.

  • Reminds me of how telegraph receivers used to be able to identify transmitters of the telegraph by their "fist" (cadence or rhythm with which they signaled).

    Here's a Schneier post about a Concordia University study about identifying e-mail authors. https://www.schneier.com/blog/archives/2011/08/identifying_p...

  • But can they tell us if TJ Holowaychuk is really only one person?

  • It's interesting that they only analyze the abstract syntax tree and ignore formatting. I would suspect that brace placement, tabs vs spaces, etc. would provide a useful fingerprint as well.

  • They have done a presentation at the 31C3 a few days ago on this presenting their findings in more details & Q/E (http://media.ccc.de/browse/congress/2014/31c3_-_6173_-_en_-_... ).

    What I understood from that - it worked quite well with code bases like from the Google Code Jam (large LoC, no style guides etc), but not that well with smaller amounts of code and I'm looking forward to some additional results e.g. with a codebase from a corporate development environment.

  • I think that kind of analysis wouldn't be possible with Go (http://golang.org/), since it is very strict/limited and uniform.

  • This doesn't surprise me. When I was in college we had a daily paper I worked at the photo dept. We has a box of "feature Photos". they were kind of filler (campus life, people playing hacky sack, feeding ducks , setting up for events etc..) I figured out one day could look at the photos and tell who took them by the style (the photographer name is on the back).

    At one job I had we called uncommented, poorly formatted code "curtis code" for some reason.....

  • Of these 3, at least 2 would depend on the language in use:

    "We used a combination of lexical features (e.g., variable name choices), layout features (e.g., spacing), and syntactic features (i.e., grammatical structure of source code)"

    In particular, "layout features" is a huge issue in some languages, and not at all in others. For instance, a language like Javascript, or PHP, give great flexibility about layout, so in those languages I can see each developer having a unique style (and I have been involved in style debates regarding those languages), however, a language like Python has a fairly fixed layout, since the whitespace is significant. And also, in Clojure, I think most programmers use Emacs and accept the Emacs clojure-mode indenting as the default.

    Variable name choices is another where some environments encourage similarity, and others allow for unpredictability and unique styles. Within the Ruby On Rails framework, for instance, there are norms about the creation of variable names.

    I would guess that syntactic features is perhaps the one characteristic that shows a great deal of uniqueness in every language. I am often surprised at the choices my fellow co-workers make, when it comes to how to solve a problem.

  • Could be identified in some cases of amateur code, like PHP or Javascript or Clojure.

    Good practice is to follow a very explicit coding style which makes code written by different developers indistinguishable - the more the better.

    Go ahead, identify which developer wrote which part of Linux kernel or, god forbid, jdk/src/

  • Something that would be interesting is to follow code styles across people who've pair-programmed. Kinda like the apprenticeship model, I wonder if you could detect specific styles that get adopted and evolve over time.

  • This is fascinating, my mind is immediately drawn to simple obfuscation programs that would turn tabs into spaces and change the formatting and so on, while still leaving it syntactically correct. Not obfuscation such as to hide the purpose of the code, just the identity of the author.

    Does anyone know of any such projects, or what they might be described as? None of the queries I've tried produce the intended results.

    You could even take it one step further, if you can identify the author of source code, can you not then forge that signature to make it look like they wrote something they didn't?

  • I have a question, now I've never decompiled anything but I was under the impression it would come out in machine code, so you wouldn't get programmers notes, tabs etc. Can someone explain how their doing this? I knowI should know this Haha, be gentle :-p

  • I'm pretty sure something similar could be done with shell history logs.

  • This stylometry analysis is 95% of a stylometry obfuscator/homogenizer.

  • The same way authors of text posts online can be identified.

  • gofmt ftw ;)

  • Can it help to find the real author of bitcoin?

  • "Prose authorship attribution that utilizes parse trees have been able to identify an anonymous text from 100,000 candidate authors 20% of the time."

    Color me unimpressed