The Risky Eclipse of Statisticians

  • In my experience, Data Science isn't replacing statistics, it's replacing Business Intelligence. And a lot of that is driven by PR and marketing departments that saw that business intelligence was boring and trapped them in the IT department, so they needed new buzz words. You can see the relationship on Google Trends as well, and it's stronger than comparing it to 'statistician', which looks flat more than declining:

    https://www.google.com/trends/explore#q=data%20science%2C%20...

  • Initially I was in the camp of reluctant to associate myself with "data science" because it might be just a passing fad. It still might be, but for now I have started using that label.

    I started in industry as "an applied or pragmatic statistician" that is someone trained in social science research with a strong quantitative methodology bias. As I went along I added focus group moderation, in-depth interviewing, competitive analysis, ROI analysis and strategy consulting... so I stared calling what I do "Research-based Consulting."

    But that label doesn't seem to quite capture building taxonomies and text indexing systems or doing latent semantic analysis. Nor does that "Research-based Consulting" capture teaching myself web development in order to create data-focused web applications. And, what about all the database work that I do in operational systems? Or, how do I fit in things like managing and validating data collection and aggregation systems that track prices for ~10K sku's across multiple retail websites, combine them in a weighted algorithm that reflects my client's business priorities and drives thousands of automated transactions every day?

    So even though I came from a background with a lot of grad level statistical training and even at one point somewhat identified as a statistician it feels like current definitions of "data scientist" captures more of what I actually do. So I have come to be at peace with the term.

    I totally agree with the points in the article about a mult-disciplinary team. I would love to recruit people who are better than me at each sub-discipline and figure out how to help them work together.

  • The joke that I think I saw here first:

    A data scientist is a statistician who lives in San Francisco.

  • It's just a name. "Data science" still requires statistics and anyone who calls themselves a data scientist were (hopefully) trained in statistics at one point or another, whether formally or not. For those who don't use statistics properly, well, they aren't doing data science properly either. It's all statistics.

  • Rob Tibshirani at Stanford had a funny slide about the this phenomenon when all the ML researchers started getting more grant funding than statisticians:

    http://statweb.stanford.edu/~tibs/stat315a/glossary.pdf

  • What does the job title actually matter where the t-test hits the SQL query? Nobody is morally opposed to well-educated professionals who can actually code to save their lives.

  • The guy quoted as saying "massive datasets ... has not yet become part of mainstream statistical science" has been teaching a statistical computing class[1] for PhD's for about 10 years now.

    It's worth reading the president of the American Statistical Association's take[2] on the stats - data science divide as well.

    [1]: http://stat.wharton.upenn.edu/~buja/STAT-541/ [2]: http://magazine.amstat.org/blog/2013/07/01/datascience/

  • \>___> I'm going for a master of statistic and thought the premise in the beginning was weak at best.

    And then toward the middle of it, it basically said data science is a marriage between stat and comp sci.

    Congrats I have comp sci as an undergraduate.

    I think it's weak at best. Data science is a jack of all trade and master of none.

    I disagree that it's just cs and stats, I know it might be pedantic but it also ML which involve math that is a bit more than stats. Math it's either right or wrong. Stats you can kinda bend that in such that it's close enough.

    Some may say what's the difference. This article doesn't address the difference but from what I've gathered Neural Network will tell you or categorize stuff likewise with KNN, but you won't gain insight into the WHY it is categorized that way and this is where statistic can tell you why. From lurking in the subreddit /r/statistic, Bayesian will tell you why but NN will not.

    You still need statistic. It's just that this is a new field and many people don't have the depth to grasp what's important.

    It's like hyping up a nosql database and promising many things and get people to adopt it. Eventually they'll realized that it's just broken promises and they're stuck with it. In this case, the industry can just get smarter and have better idea of what they really need.

  • > Why Didn’t Statisticians Own Big Data?

    Because they could not translate their sense of entitlement to actual results.

    > pure statisticians often scoff at the hype surrounding the rise of data scientists in the industry

    > some statisticians simply have no interest in carrying out scientific methods for business-oriented data science

    Statisticians are often too careful. They let tests decide if they should continue on a certain path. Machine learning researchers run blindfolded and trust cross-validation. The latter, though reckless, gets more impressive results.

    You can perfectly be a data scientist coming from a statistics or physics background. Adapt to it and use your knowledge to your advantage. You can't keep calling yourself a statistician and own data science at the same time. Start automating yourselves, like the rest of us are.

  • Reminds me of this talk at SciPy 2015: https://www.youtube.com/watch?v=TGGGDpb04Yc

  • There's a mistake in the infographic-like table named "Big Data Quantified".

    It says "72 hours of new video uploaded to YouTube every day".

    Actually, "300 hours of video are uploaded to YouTube every minute".[1]

    The source of the infographic got the "minute" part right. (And probably the matter of 72-->300 is growth since the original infographic was produced... Web citations are hard!)

      1. http://www.youtube.com/yt/press/statistics.html

  • Data science needs statisticians. Big data and machine learning courses do not teach about such things as endogeneity, confounding variables, sampling bias etc. and how to deal with them. Data scientists that do not understand these things could end up making big mistakes: https://youtu.be/0cizsKDn3TI