Correlation is usually not causation. But why not?

  • All I know about causation and correlation I learnt hunting bugs in large legacy software systems. In that environment I got the impression that correlation almost never equalled causation, but that's only because the hardest bugs, the ones I remembered, were hard because the obvious correlations did not help identify the root cause. A similar argument might be made for scientific studies: most of the easy causes that can be identified from correlation have already been found, leaving the majority of new studies with correlations that don't easily establish causation.

  • This question is bound to be a can of worms. There has been a great deal written about the matter, particularly in regard to observational studies.

    Drawing causal inference is overwhelmingly likely to be wrong when there is a good chance that unknown variables are influencing the correlates observed. In health-related sciences that more often than not is the case.

    A few years ago there was a study correlating hours of TV watched and ADHD symptoms in children. The news media picked up on these "findings" and of course the causal influence of TV on ADHD was reported.

    It was obvious that saying watching TV caused ADHD was absurd, that other variables weren't taken into account, e.g., some other characteristic of ADHD kids prompted watching TV more than other kids.

    There was a great article published in PLOS several years ago (ATM I don't have the link) showing mathematically that the odds were about 1 in a million that an observational study like the above would turn out to be a "true" causal relationship, and the author concluded most published studies were junk.

    In experimental studies, variables are limited and controlled as well as possible. With fewer and known variables, correlations would have a greater chance of revealing a reproducible causal relationship among events.

    The discussion gets tripped up when it comes to defining "cause" or "causal relationship". The theory is controlled as an experiment may be, there's a possibility that unknown variables were present and affected the phenomena occurring in the experiment. Conclusions can't be absolute, but only true to some probability.

    I think the history of science over the last 100 years or so has something to say about the nature of "causal relationships".

  • Because you can find things that are correlated, but have no practical relationship - many (frequently humorous) examples here:

    http://tylervigen.com

  • Bad winter weather can cause auto accidents, and we expect a positive correlation between bad winter storms and winter auto accidents. Okay, but in the northern hemisphere, living in more northern latitudes also correlates with winter auto accidents but does not cause them.

    For heart disease, we know that the main causes have to do with aging. Well, then, since now the audience for TV news is comparatively old, we can expect that watching TV news has positive correlation with heart disease. Still watching TV news does not cause heart disease.

    In the US NE, hurricanes are positively correlated with pretty leaves on the trees, but the leaves do not cause the hurricanes, and the hurricanes do not cause the colors in the leaves. Instead, hurricanes are caused by the surface waters of the Atlantic Ocean hot from summer sun with cooler air on top, and that situation is caused by the fall weather with also causes the colored leaves. So, colored leaves have a spurious correlation with hurricanes but do not cause them.

    Correlation is much more common and much easier to establish than causality. Usually the convincing evidence of causality if some basic physical connection.

  • Normally, causation has a couple confounders for correlation, such as

    - inverse causation - common cause - random correlations

    The first two should be relatively stable and reproducible, and we could then proceed to find out about causation with an intervention study (e.g. if "good education" directly causes "good job prospects", will it help if we give people good education that wouldn't normally get one? Or does it improve people's educational achievement when we give them better access to jobs?)

    The third isn't reproducible, but should normally be relatively rare. Why we're seeing more and more non-reproducible results is usually

    - People fishing around in data sets for correlations, or tweaking experiments until they find a correlation

    Because of this, we end up with a great deal of correlations that have nothing to do with causation.

    For a related but different perspective, see this article ("Language is never, ever random", Adam Kilgariff) http://www.kilgarriff.co.uk/Publications/2005-K-lineer.pdf

  • The article is a little dense for me, and presumes knowledge about Probabilistic Graphical Models (PGMs) and directed acyclic graphs / causal Bayesian networks (DAGs) without introducing the background knowledge - so, I expect my reading of it missed a lot of the details.

    But the one thing I came out wondering, is if you do a proper randomized double-blinded study with a large population sample, say, taking 1000 people with a particular illness, completely randomizing them, and then giving 500 of the sample a particular treatment, and 500 of the sample a placebo that is indistinguishable from the treatment. If an unbiased third-party, then observes the groups (still without knowledge of which group had which treatment), and if one group shows markedly different results (say, 490/500 of the treatment group recover, and only 10/500 placebo group recover), is it fair to say that in this particular scenario, that correlation of recovery with the treatment implied that the treatment caused the recovery?

  • Correlation vs causation is not actually a complex mathematical problem. The issue is more philosophical.

    Bayesian networks are a highly sophisticated and flexible framework for thinking about causation. And yet the essence of Bayesian networks can be captured in much simpler methods like Instrumental Variables or simply regressions with controls.

    In all cases, the true distribution of observables (which we can estimate from samples) combined with some assumptions about the possible nature of causality (either very sophisticated in the case of Bayesian networks, or very simple in the case of instrumental variables) lead to the actual causal relationships.

    Where do these assumptions come from? Consider a randomized trial. Even though physically, it is possible that some hidden cause influenced both the random number generator which selected patients into a trial, and also whether patients will get better. And yet people universally believe that there is no such mechanism, and so they believe that randomized trials prove the causal relationship between taking a drug and getting better.

    No physicist ever wrote down an equation proving this. It is simply something we deduced from our human understanding of how nature works.

    Put simply, causality is not a physical notion, neither is it a statistical notion. It is a part of our intuitive understanding of the physical world, in which a higher level notion of causality exists, beyond that described by special relativity.

  • The difference between correlation and causation is merely conventional. Does the striking of a match cause it to ignite? There is no way to prove that it does, only the correlation between the striking and the ignition makes us say they're causal. Correlation is the only way to determine what's causal and what's not.

    On the other hand, if you want to look at it philosophically, then the only sensible definition of "causation" is "anything necessary for a thing to exist." So what's necessary for a thing to exist? Every other thing in the universe that is not that thing! Pluto causes us to exist right now because it hasn't turned into a giant space goat and swallowed the Earth. Yes, the fact that something DIDN'T prevent the existence of a thing is also ultimately a cause of that thing's existence. Causation in its purest form can help us understand the nature of reality, but it can't help us predict anything. It's only when we draw a line between causes that we can control and causes that we can't that it becomes a practical tool (not that understanding reality isn't practical).

    So you can see, this whole correlation != causation discussion is pure nonsense. Get some philosophical skill before you try to discuss philosophical issues and stop embarrassing yourselves. Sheesh, it's ridiculous.

  • In reality, causal mechanisms for the phenomena we most care about (economic and social phenomena particularly) are so fantastically complicated that you never really "figure it out". Nevertheless, if you can exploit the hypothesized causation to optimize some function, you've won, regardless of the metaphysics.

    Even in the context of physical processes, like the biological processes Gwern mentions, the notion of "establishing causality" is much more of a regulatory artifact than anything else.

    You might call this the machine learning approach (optimize some objective function, regardless of mechanism) as opposed to the statistics approach (generate some measured claim about a process itself).

  • Statistics serves as a tool to overcome our cognitive biases. But what if these biases are at the center of learning?

    Take for example the Gambler's Fallacy where a player believes she can predict the outcome of a coin toss with greater certainty than is possible. Obviously she cannot. But if I had to design a Machine Learning algorithm, I would certainly want it to always assume that a pattern existed. That way, if the data were predictable, the algorithm would be able to take advantage of it.

  • Correlation is due to one of the 3 C's, coincidence, confounding, or causality. I just like the 3 C's.

  • Another cracking article from gwern.

    For those wondering, that snazzy looking `choose` combinatorial function is from clojure, and I assume other lisps:

    http://clojuredocs.org/incanter/incanter.core/choose

  • The problem is that causation is an oxymoron.

    The measurement problem is the same as the problem of induction.

    Max Planck understood this, read his quotes on matter.

    No amount of correlation increases the probability of one event followed by another. Therefore, all scientific analysis is unverifiable. Knowledge of the world is completely unjustified. Not to mention our immense presumption of consistency in world phenomena when we really have no basis for asserting uniformity of nature. Read Hume on causality.

    "Belief in the Causal Nexus is superstition" - Wittgenstein

    If knowledge requires certainty, and our method of determining certainty is an infinite regress of reduction, it is like a recursive function which never returns a value.

    Logic is inherently a comparison operator, and if logic is the only mechanism available to us, we are trapped in a purely relational analysis and it is impossible for the mind to conceive of certainty beyond a non-reasonable emotional preordained state of knowing. A feeling so powerful it is personally indubitable.

    The premise of a singularity is invalid since it is inconceivable and the idea that causation is at any time inferred via comparison is fundamentally incomprehensible. If the premise is unclear then the argument is invalid.

    If I say that an airplane functions on fairy dust and you make an argument about propulsion and lift, and you claim that my assertion is wrong because I have not properly performed a rational reduction and that a detail I do not understand could invalidate my entire theory, well guess what neither of us is technically correct to any degree since the test for invalidity applies equally to both of our assumptions. The concept of proof is also an oxymoron.

    Hence, the very concept of phenomenal causation is an oxymoron. The notion of knowledge about the world is a disguise for a coping mechanism based on hope and desire.

  • Hey Amarok, Image Occlusion addon author here. I'd like to meet you in person (can't reply to the comment in which you mentioned me). My email is in my profile.

  • Perceived correlation is usually not actual causation. That should help explain why not.

  • Disagree with the title. Correlation does imply casuation a lot of the times (especially for simple systems). But not always. Therefore, the caution is not to assume it apriori, but pursue further investigation to confirm or reject it. Even when it is rejected, a lot of those cases result in a third variable being the cause behind the correlated "effect" variables.

  • This article starts off with a common mistake.

    It is ok to create the 3 categories:

    > If I suspect that A→B, and I collect data and establish beyond doubt that A&B correlates r=0.7, how much evidence do I have that A→B?

    > you can divvy up the possibilities as: 1. A causes B 2. B causes A 3. both A and B are caused by a C

    So far so good, but here is the problem:

    > Even if we were guessing at random, you’d expect us to be right (at at least 33% of the time...

    No. It is not valid to assume each possibility is equally likely. If you do so, you are bringing your own assumptions to the problem.

    If you ever find yourself assuming a distribution, pause and consider testing your assumption.

  • undefined