The social side of science seen in the research on programming language quality

  • Programming and programming languages are in a big portion subject of social sciences anyway. Made for the needs of humans used by humans in groups with practices and methods addressing the particularities of humans and human groups. Not ordinary humans but humans indeed. Probably it is more of social science than mathematics or other human independent matters. Social science that is using other disciplines, similar to economics, but at least an interdisciplinary subject.

  • Wow, what a bad idea. Do Haskell and PHP get used for the same sort of projects? Or Erlang and Ruby? No, they get used for very different sorts of projects. You would need to only compare similar kinds of projects (e.g. web frameworks, or statistical analysis, or device drivers) to get anything useful out of this data source. But, that's probably not possible, because in most cases there aren't enough projects in very many languages. This also assumes that bugs are all equally likely to get FOUND in all languages. What if bugs are easier to find in language X than in language Y? Then you will get MORE, not fewer, commits that mention fixing them in language X, which is the opposite of what you would want the signal to indicate. I could go on, but there's no need, as even these points are enough to totally invalidate the premise of this research. Getting into the weeds of the analysis of the original study or the rebuttal, is not going to give you anything useful, if the original premise that you are measuring how language compares to bugs is wrong, and it almost certainly is.

  • At the very end of this grand tour:

    > Science has its problems, but it’s still the best we got.

    I would agree if you're doing particle physics and looking for p < 0.0000003 then absolutely. Science is a great tool for the kinds of investigations of the natural world which can be understood by science. If you're doing social science by scraping some data from the internet and looking for p < 0.05, that's a very different story.

    I'm not actually sure what aspect of this qualifies as "science", besides using p-values to analyze the data. Even if all the categorization and sampling and analyses were perfect, it's using commit messages and known bugs with known fixes, all of which are essentially self-reported.

  • I'm confused. So the original paper was shown to be flawed, but then the rebuttal was also shown to be flawed. But this article argues the flaws in the rebuttal was not as glaring as the flaws in the original paper?

    But the approach of counting commits on github just seems fundamentally flawed. The first step should be to show that the github dataset can be used to say anything about quality and productivity of things like language choice. Arguing about how to classify commit messages seem pointless without this foundation.

  • Number of commits as a metric is a poor idea, because it will immediately be gamed. I constantly advocate with D that people break up larger PRs into smaller self-contained ones, for several reasons. Undermining this with a metric that draws unfounded conclusions isn't helpful.

  • Interesting rundown, thanks!

    I'm a bit puzzled with the "not a dox" tweet, what is going on here? I know that this is probably the least interesting part of the article, but I'm confused and curious.

  • It's amazing the paper got published in the first place – the idea that comparing open source codebases could tell you all that much about the underlying language is sort of bizarre when comparing such different projects to begin with.

  • Related from a few months ago: https://news.ycombinator.com/item?id=21637411

  • Every project grows until it becomes too complex and hence unmanageably buggy. Some languages/static analyzers/linters/architectures/frameworks let a project become bigger and more complex before becoming unmanageably buggy.

    So if you look at two projects, and both have a bunch of bugs, that doesn't tell you much. You have to know bugs-per-unit-complexity, which is pretty damn close to unmeasurable.

    It's telling, though, that the companies managing the world's most complex systems run them in C++, Java, and TypeScript. And more recently, Go and Rust.

    With much finance research, if the results were real, the researchers would be making billions on Wall Street rather than be academics. So, too, with software research. If the results were real, the researchers could become gazillionaires advising FAANG instead of being academics.

  • It's astounding how discussing programming languages can devolve into flamewars, even when the discussion starts as actual academic research and the participants are dedicated scientific researchers.

  • This is a truly terrific post, but it says:

    > They’re underselling the effect here. While the dominant factor is number of commits, language still matters a lot, too. If you choose C++ over TypeScript, you’re going to have twice as many DFCs! That doesn’t necessarily mean twice as many bugs, but it is suggestive. Further, while they say the effect is “overwhelmingly dominated by […] project size, team size, and commit size”, that doesn’t actually bear out. Only the number of commits is a bigger factor in language choice.

    This is inaccurate. "Effect" is not the expected difference (i.e. difference between the means) but, roughly, the expected difference divided by variance. Just looking at the expected difference is insufficient to determine if the effect is large (and undersold) or small.

    Expected difference is not an interesting statistical property, just as the mean isn't (by itself). If you're looking at ten Clojure projects and ten C++ projects, and all ten Clojure projects have 10 DFCs, while eight C++ projects have 8 DFCs and two have 500, then the expected difference is huge, but the effect is small. Indeed, when looking at the variance, Clojure, the "best"-performing language in this dataset, and C++, the "worst"-performing language in this dataset, the two were not very distinguishable, supporting everyone's finding of a very small effect.

  • > the entire idea of measuring a language’s defect rate via GitHub commits just doesn’t make any sense. How the heck do you go from “these people have more DFCs” to “this language is more buggy”?

    If Github is used by ordinary programmers (what are the reasons to assume otherwise?) i.e., if the sample is unbiased then what is the issue with going from "these [random] people" to "this language"?

  • I'm surprised someone takes my github saved games seriously. It certainly wasn't me.

    I think this science project would be more fun if assumptions were discarded. Just gather the data, do it without a hypothesis. Something interesting might come up worthy of one.

    I liked the bit of crankology in the article. If you sound like a crank no one will take you seriously? I had one day pondered what if the cranks are right? How would we know? It follows that we can't know sounding crank equals being wrong and that our idea of "sounding serious" is based on a flawed data set, if it isn't noise entirely. The topic of the study here certainly doesn't inspire my confidence.

  • > But it still seems intuitive that language should matter.

    Am I the only one that thinks this is a naive assumption? They aren't really languages like English is a language--so much for intuition. Don't definitions matter to science? Does science carry a selfie stick? I'm suddenly keen on the antithesis of science or a less social science.

  • If all you cared about was fibonacci, or some other "quality", then fine, use that as benchmark. But others might care more about productivity, cost, etc. Then it's also a matter of taste and what is currently fashionable.

  • There's talk and comments about the measure ("DFCs"). What would be better? And is it possible to use this one to eke out a signal even though the measurement has some known problems?

  • Two comments:

    "FSE was a preprint: a paper that hadn’t necessarily gone through peer-review, editing, and publication in a journal. Academics share preprints because 1) the process of getting a paper journal-ready can take years, so this gets it out faster, and 2) academics can share their research with people who aren’t able to afford the ludicrous journal fees."

    To my knowledge, all ACM conferences including the "ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering" are peer-reviewed. In CS, in my experience, peer-review for conferences is more thorough than journal review. Conferences are more important. The idea that they are preprints is something from linguistics or some other field.

    "How did FSE take the rebuttal rebuttal? Not well.

    "So that’s technically not a dox, because he didn’t publish Berger’s private information, but still. That’s a really asshole thing to do!"

    I'm not sure I would call linking to someone's Facebook page from Twitter to be "doxxing" in any sense. Calling a fellow researcher a "donkey" is a bit of an asshole thing, too. (I have no link to said comment, but I know Emery Berger. He's very positive he's right, the smartest in the room, and very abrasive.)

    Tl;dr: Software engineering research is a garbage fire. But we knew that.