How to make fake data look meaningful
A great method for spotting (or at least suspecting) fake data is to see if it follows Benford's law. (So remember to fit your fake data to conform!)
http://en.wikipedia.org/wiki/Benford's_law
> Benford's Law, also called the First-Digit Law, refers to the frequency distribution of digits in many (but not all) real-life sources of data. In this distribution, the number 1 occurs as the leading digit about 30% of the time, while larger numbers occur in that position less frequently: 9 as the first digit less than 5% of the time. Benford's Law also concerns the expected distribution for digits beyond the first, which approach a uniform distribution.
As one might imagine:
> A test of regression coefficients in published papers showed agreement with Benford's law. As a comparison group subjects were asked to fabricate statistical estimates. The fabricated results failed to obey Benford's law.
So keep that in mind, data fabricators!
I recently made a blog post on which universities have the most successful startup founders: http://minimaxir.com/2013/07/alma-mater-data/
In the post, I made it clear where the data came from, how I processed it, and additionally provided a raw copy of the data so that others could check against it.
It turns out the latter decision was especially a good idea because I received numerous comments that my data analysis was flawed, and I agree with them. Since then, I've taken additional steps to improve data accuracy, and I always provide the raw data, license-permitting.
I worked in retail when I was younger. This would be excellent advice for a sales person working with uneducated customers.
I was the go-to guy, resulting in a lot of freedom when dealing with clients. It allowed me to do lot of social experimentation, especially when selling custom services (like fixing "My PC is slow").
Explaining the in-and-outs of different options (goal, short/long term consequences, risks, up/downsides...), then saying it would cost 50$ would usually result in the guy becoming all suspicious, saying he wanted me to guarantee him stuff that I couldn't, and that it was expensive.
Say "Oh you'll need our performance rejuvenation(tm) package, that'll be 400$" and the guy happily pulls out his credit card!
> If you are trying to trick somebody with your data, never share the raw data; only share the conclusions. Bonus points if you can't share the data because it is somehow priviledged or a trade secret. The last thing you want is to allow other people to evaluate your analysis and possibly challenge your conclusions.
Of course, I'm not against sharing data. However, the satire here is slightly too optimistic that people, when given raw data, will attempt to verify it for themselves. When people are given plain narrative text, they can still be profoundly influenced by a skewed headline -- something which everyone here may not ever be familiar with :)
I guess I'm being curmudgeonly about this...We should all share the data behind our conclusions, but don't think that by doing so -- and an absence of outside observations -- that you were correct. Most people just don't have time to read beyond the summary, nevermind work with data.
There is of course a famous book on this topic:
I was hoping (by the title) that this was something about generating real-looking test data for your app, something very useful for UI devs before the API or backend is in place.
Raw data, and the source code which you used to arrive at your conclusion, or it didn't happen.
In today's world of github gists and python PyPI's and R's CRAN, there's no excuse to not document the entire process, in addition to raw data.
Some more sophisticated (but very readable and sometimes laugh-out-loud funny) articles about detecting fake data can be found at the faculty website[1] of Uri Simonsohn, who is known as the "data detective"[2] for his ability to detect research fraud just from reading published, peer-reviewed papers and thinking about the reported statistics in the papers. Some of the techniques for cheating on statistics that he has discovered and named include "p-hacking,"[3] and he has published a checklist of procedures to follow to ensure more honest and replicable results.[4]
From Jelte Wicherts writing in Frontiers of Computational Neuroscience (an open-access journal) comes a set of general suggestions[5] on how to make the peer-review process in scientific publishing more reliable. Wicherts does a lot of research on this issue to try to reduce the number of dubious publications in his main discipline, the psychology of human intelligence.
"With the emergence of online publishing, opportunities to maximize transparency of scientific research have grown considerably. However, these possibilities are still only marginally used. We argue for the implementation of (1) peer-reviewed peer review, (2) transparent editorial hierarchies, and (3) online data publication. First, peer-reviewed peer review entails a community-wide review system in which reviews are published online and rated by peers. This ensures accountability of reviewers, thereby increasing academic quality of reviews. Second, reviewers who write many highly regarded reviews may move to higher editorial positions. Third, online publication of data ensures the possibility of independent verification of inferential claims in published papers. This counters statistical errors and overly positive reporting of statistical results. We illustrate the benefits of these strategies by discussing an example in which the classical publication system has gone awry, namely controversial IQ research. We argue that this case would have likely been avoided using more transparent publication practices. We argue that the proposed system leads to better reviews, meritocratic editorial hierarchies, and a higher degree of replicability of statistical analyses."
[1] http://opim.wharton.upenn.edu/~uws/
[2] http://www.nature.com/news/the-data-detective-1.10937
[3] http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2160588
(this abstract link leads to a free download of the article)
[4] http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1850704
(this abstract link also leads to a free download of the full paper)
[5] http://www.frontiersin.org/Computational_Neuroscience/10.338...
Jelte M. Wicherts, Rogier A. Kievit, Marjan Bakker and Denny Borsboom. Letting the daylight in: reviewing the reviewers and other ways to maximize transparency in science. Front. Comput. Neurosci., 03 April 2012 doi: 10.3389/fncom.2012.00020
Tangentially, the article gets confidence intervals wrong:
Simple change increased conversion between -12% and 96%
That's not what a confidence interval is. A confidence interval is merely the set of null hypothesis you can't reject.
http://www.bayesianwitch.com/blog/2013/confidence_intervals....
A credible interval (which you only get from Bayesian statistics) is the interval that represents how much you increased the conversion by.
I see this all the time especially in social media blogs. They come to strange, convoluted conclusions about things like "the best time to tweet", without taking into account that 40% of tweets are from bots, and auto-tweets. Or that most of your followers are not human.
Standard methods of calculating confidence intervals are only applicable to parametric data. Most data I deal with on the Web is non-parametric, so I wouldn't take a lack of confidence intervals to necessarily insinuate non-meaningful data.
Honestly, anyone who isn't showing a 95% confidence interval is just being lazy. If you're not willing to run an experiment twenty or thirty times, you might as well not even look at the results and just publish it regardless of what it is. Why even research?
On the other hand, if you'd like to showcase actual insight and meaningful, surprising conclusions, it only makes sense to take the time to find the best dataset that supports it. Real researchers leave nothing to chance.
Reminds me of every single presentation I've ever seen at a games developer conference. "Look at this awesome thing we did! The graph goes up and to the right! No, I can't tell you scale, methodology, other contributing factors or talk to statistical significance. But it's AMAZING, right?!"
Another great checklist to consider when reading articles: Warning Signs in Experimental Design and Interpretation : http://norvig.com/experiment-design.html (found the link on HN a while back)
Are there any tips that would help when seeding two-sided markets with fake data, a la reddit getting started?
This isn't really on how to make fake data look meaningful, but rather how to make useless data look meaningful. If you can fake the data, then there's no need for these misleading analyses.
Meh. I mean... funny, but it just reads like yet another person who saw bogus numbers, got fed up, ranted.
I did notice a lot of misspellings on the graph legends :)
Nice one! Haha... But you need to be extra careful though because some people are really keen to details..