Hey CS Paper writers, please show your code
Unfortunately, Jacques needs to answer a pretty basic question in order to get his wish: why? Why should CS paper writers show Jacques (or anyone else) their code?
We have discussed this topic on HN a number of times, for example:
http://news.ycombinator.com/item?id=2735537 http://news.ycombinator.com/item?id=2006749
Many of the comments in those threads do a better job summing up than I ever could. However, briefly, literally all of the incentives are aligned against publishing code and data.
If a writer's code is wrong, they are embarrassed (and there is no culture of being embarrassed by not publishing code).
If a writer publishes their code and it is actually good, someone else can scoop their follow-on results.
If a writer does not publish their code, and it is actually any good, they can potentially commercialize it thanks to the Bayh-Dole Act.
If a writer publishes their code and people intend to use it, the writer needs to clean it up, check it for correctness, and handle support requests. These activities are probably more time consuming than writing the code in the first place.
If the writer publishes their code, and other people in the writer's field do not, the writer is usually at a disadvantage. Others will appear to have more publications, the basic currency of academia. (Many people have great reasons for not publishing their code or data, especially researchers embedded at large companies making changes to large proprietary systems.)
So overall, yes, it would be great if CS paper writers gave out their code. What they are doing is not reproducible science in the philosophy of science sense.
But what is Jacques (or anyone else) doing to fix this system of incentives, and what could anyone do?
I really don't think that it's easier to understand algorithm by reading its implementation than by reading its description. It's actually opposite -- frequently the idea behind algorithm is really hard to decipher if you only have access to code and not words. If you don't believe me, take an algorithm you don't know, read sample implementation (there are quite a lot on the internet these days) and try to understand how it works. I suggest trying KMP and Boyer-Moore for text matching, or Miller-Rabin and AKS for primality testing, or even RSA -- RSA is terribly simple in theory, but if you don't have a clue how it's supposed to achieve its goal, and you don't have background in number theory, then your chances of understanding it just by reading the code are infinitesimal.
"Generally, academic software is stapled together on a tight deadline; an expert user has to coerce it into running; and it's not pretty code. Academic code is about "proof of concept." These rough edges make academics reluctant to release their software. But, that doesn't mean they shouldn't. "
From:
Who says that we actually have code? I think it took most of a decade before anyone implemented the algorithm from my first paper.
I agree with the writer, in many ways. But it's going to be a huge uphill battle.
I'm on the other side of this, too, as a researcher in algorithms. Earlier this year, I submitted the first paper I've ever written, in which I would consider some related code to be an essential part of my research. Now, consider what this means.
- I had some spend a lot of time getting the code into shape. Documentation, general clean-up, etc. This was time I could have spent writing other papers. Publication output is a huge determining factor in academic tenure & promotion decisions. Seriously, the way things are now, you're asking people to risk not getting tenure, in order that you might have their code.
- The idea of a published paper, is a permanent record of the research. Repositories for code that are of a similar permanent nature, are few & far between. Some journals are starting to allow arbitrary attachments to papers, which can them be obtained online. I submitted my paper -- along with the code -- to such a journal. But this greatly restricted the list of journals I could submit to. And it pretty much left out all of the truly prestigious ones. Again, a problem.
Not to be harsh, but the code I've seen taken from "scientific" papers were the worst code I've ever seen. One function of a couple of thousand lines in C++ (But using a mix of C and C++ functions) using variables named with 1 or 2 letters.. without comments, gasp.. I still do nightmare about that (I had to refactor that code.)
But I do agree with Jacques that I so much prefer a graph with a summary than a couple pages of unreadable mathematical expression. A high-level pseudocode of the general idea might also be great. Sometime, I feel like the paper wasn't written for people to understand.. but as a way to make it seem difficult. It's not rare that I talk with an author of a scientific paper and can easily summarize an entire paragraph with a couple of simple sentences..
I would be interested to see more CS research happen in a more open source like way, like how we see projects work on github where it is easy for anyone with an interest in an area to get involved, run the code and contribute.
Obviously this wouldn't be possible in all branches of CS research and there is the danger that these kind of projects would get stuck local maximas without a few minds driving the research and the ability to understand it well enough to make large scale changes.
Still a lot of good could probably come out of it, plenty of smart CS minds that while they don't have time to commit to the reading and work required to drive their own research project have the expertise to contribute to small parts of other peoples.
This won't fly in the real world. If an academic writer includes a link to their code in their paper, when it goes through peer review before being published, it will no longer under be a blind review; the reviewer will know who wrote the code by virtue of the github account or domain it's uploaded to. Blind reviews are important, as if you know you're reviewing a paper by someone who rejected your paper, you're more inclined to reject it. That really happens, people are that petty.
Additionally there's the risk that the reviewer will reject the paper, and download the code and publish a paper about it quickly in some other journal/conference.
I would like more code simply to prove the things described in the papers. The exploration just feels incomplete to me if I can't run some code and hack on it to test how it changes behavior.
I read a lot of CS papers and am not particularly strong on formal math (that is, I have a lot of gaps) but I'd rather not have actual code to demonstrate concepts. Instead, I prefer it when they use pseudocode because it forces them to boil their implementation down to the simplest parts possible than cut and paste code into a paper.
undefined
I'd be happy with just pseudo-code.
It seems to me like most CS research projects wouldn't necessarily have much code behind them. Obviously, data structures and algorithms could, but implementations are generally trivial once you've got the academic paper.
Sorry. A peer-reviewed research paper in computer science is supposed to be an original contribution to knowledge meeting the usual criteria of new, correct, and significant.
Okay, how do we present knowledge? Text? Yes. Text with math as in, say, a book on math or mathematical physics? Yes. Code? Nope! Sorry 'bout that!
Yes, coming from the programming side, many programmers have accepted that code, especially with mnemonic identifers, actually is meaningful and a substitute for good text possibly with good math. Sorry, guys: In principle, it just ain't. In simple terms, the code is meaningless; code with mnemonic identifer names and careful indenting is still meaningless and at best a puzzle problem to solve to identify the intended meaning. Sorry 'bout that. Yes, often in practice, with enough context from outside the code, can get by with just mnemonic names and pretty printing. Still, it's just ain't knowledge.
Really, for being knowledge, the code should be like the math in, say, a physics text, that is, surrounded by text. The math does not, Not, NOT stand alone and by itself is just meaningless. E.g.. F = ma is no more meaningful than a = bc. For either equation, the meaning has to come from surrounding text where the equation becomes just a nice abbreviation for what is in the text. So, for code, the comments play the role of text in a physics book and the code plays the role of the math in a physics book. As in a physics book, the text (comments) are MORE important than the math (code). Again, code just AIN'T knowledge. Sorry guys.
Most often they don't have any code. Most of the action takes place inside the head and not inside the computer. What they have are proofs of correctness, invariant enforcement, complexity bounds etc. I doubt that reading the code will improve understanding of the algorithm anymore than a thorough description of the algorithm.
But... computer scientists don't write code! That's not what they're studying, it's not their job!
Writing code is what women do.
"Computer Science is no more about computers than astronomy is about telescopes."