Online tracking: A 1-million-site measurement and analysis

  • Coauthor here. I lead the research team at Princeton working to uncover online tracking. Happy to answer questions.

    The tool we built to do this research is open-source https://github.com/citp/OpenWPM/ We'd love to work with outside developers to improve it and do new things with it. We've also released the raw data from our study.

  • As soon as I saw these APIs being added I immediately dropped into about:config and disabled them. How the hell do these people think this is a good idea to do without asking any permissions?

    Put these in your user prefs.js file on Firefox:

    user_pref("dom.battery.enabled", false);

    user_pref("device.sensors.enabled", false);

    user_pref("dom.vibrator.enabled", false);

    user_pref("dom.enable_performance", false);

    user_pref("dom.network.enabled", false);

    user_pref("toolkit.metrics.ping.enabled", false);

    user_pref("dom.gamepad.enabled", false);

    Here's my full firefox config currently:

    https://up1.ca/#nUSA1WtY13ECfmYC5c825w

    Privacy on the web keeps getting harder and harder. Of course this should only be used in conjunction with maxed out ad blockers, anti-anti-adblockers, privacy badger and disconnect.

    We need browsers to start asking permission. When you install an app on Android or iOS it says "here's what it's going to use, do you want this?". The mere presence of the popup would annoy people and prevent them from using these APIs.

  • Google has a vested interest in information leakage. I have a suspicion that the Chromium project expresses a strategic desire to shape the direction of browser development away from stopping those leaks. The idea of signing into the browser with an identity is a core feature and in Google's branded version, Chrome, the big idea is that the user is signed into Google's services.

    Google only pitches the idea of multiple identities in the context of sharing devices among several people: https://support.google.com/chrome/answer/2364824?hl=en and even then doesn't do much to surface the idea. https://www.google.com/search?hl=en&as_q=multiple+identities...

  • This is the kind of nonconsensual sureptitious user tracking that the EU privacy directive 2002/58/EC concerns itself with, not those redundant, stupid cookie consent overlays.

  • Although the emphasis on the actual abuse of newly-introduced APIs is much needed, it is probably important to note that they are not uniquely suited for fingerprinting, and that the existence of these properties is not necessarily a product of the ignorance of browser developers or standards bodies. For most part, these design decisions were made simply because the underlying features were badly needed to provide an attractive development platform - and introducing them did not make the existing browser fingerprinting potential substantially worse.

    Conversely, going after that small set of APIs and ripping them out or slapping permission prompts in front of them is unlikely to meaningfully improve your privacy when visiting adversarial websites.

    Few years back, we put together a less publicized paper that explored the fingerprintable "attack surface" of modern browsers:

    https://www.chromium.org/Home/chromium-security/client-ident...

    Overall, the picture is incredibly nuanced, and purely technical solutions to fingerprinting probably require breaking quite a few core properties of the web.

  • So... what we need is a browser, which says it supports these things but blocks or provides false data on request and looks as ordinary as possible for "regular" browser fingerprinting.

    Is anyone aware of the existence of one?

  • Colour me unsurprised. Disappointed though.

    I'm glad I disabled WebRTC when I first discovered it could be used to expose local IP on a VPN.

    These "extension" technologies should all be optional plugins. Preferably install on demand, but a simple, obvious way to disable would be acceptable. (ie more obvious than about:config)

    Not a great deal can be done about font metrics other than my belief that websites shouldn't be able to ferret around my fonts to see what I have. Not like it's a critical need for any site.

  • NoScript is an all-or-nothing approach. Are there any JS-blockers that allow API-level blocks?

  • All of this makes me wonder how some of these interfaces should be more closely guarded by the user agent.

    Perhaps instead of a site probing for capabilities, they should instead publish a list of what the site/page can leverage and what it absolutely needs to work. Maybe meta tags in the head or something like the robots.txt. Browsers can then pull the list and present it to the end user for white-listing.

    You could have a series of tags similar to noscript to decorate broken portions of sites if you wanted to advertise missing features to users and, based on what features they chose to enable/disable for the site, the browser would selectively render them.

  • So given this information, how can we poison the results that the trackers get?

  • Some methods of fingerprinting are probably used to distinct between real users and bots. Bots can use patched headless browsers that are masquaraded as desktop browsers (for example as latest Firefox or Chrome running on Windows). Subtle differences in font rendering or missing audio support can be useful to detect underlying libraries and platform. Hashing is used to hide exact matching algorithm from scammers.

    There is a lot of people trying to earn on clicking ads with bots.

    Edit: and by the way disabling JS is an effective method against most of the fingerprinting techniques.

  • What annoys me the most is how many useless cycles these trackers use to track me.

  • WebRTC guys get around this by stating fingerprinting is game over, so don't even bother. They ignore that they are going against the explicitly defined networking (proxy) settings. Browsers are complicit in this. If the application asks "should I use a proxy", then ignores it, silently, wherever it wants, that's deceptive and broken.

    There's still zero (0) use cases to have WebRTC data channels enabled in the background with no indicator.

    If all these APIs are added, the web will turn into a bigger mess than it is. They can't prompt for permissions too much. So they'll skip that, like WebRTC does.

  • Seems like browsers should ask the user's permission to use these html5 features. Then whitelist. For example, a site that does nothing with audio should be denied access to the audio stack.

  • undefined

  • I think it's time for HTML--, which would contain no active content at all and simply be a reflowable document display format.

  • There is an acceptable tradeoff between pseudo anonymous access through browsers vs non-anonymous access through native apps.

    To interpret this research as reason for crippling web or browsers would be a giant mistake. Crippling browsers will only work against users, who will be then forced into installing apps by companies.

    Two popular shopping companies in India exactly did this, they completely abandoned their websites and went native app only. This combined with large set of permission requested by apps lead to worse experience in terms of privacy for consumers. As the announcement for Instant Apps at Google I/O demonstrate, web as an open platform is in peril and its demise will be only hastened by blindly adopting these types of recommendations.

    Essentially web as open platform will be destroyed in the name of perfect privacy. Only to be replaced by inescapable walled gardens. Rather consider that web allows a motivated user to employ evasion tactics, while still offering usability to those who are not interested in privacy. While with native apps where Apple needs a credit card on file to install, offer no such opportunity.

    I am happy that Arvind (author of the paper) in another comment recommends a similar approach:

    """ Personally I think there are so many of these APIs that for the browser to try to prevent the ability to fingerprint is putting the genie back in the bottle. But there is one powerful step browsers can take: put stronger privacy protections into private browsing mode, even at the expense of some functionality. Firefox has taken steps in this direction https://blog.mozilla.org/blog/2015/11/03/firefox-now-offers-.... Traditionally all browsers viewed private browsing mode as protecting against local adversaries and not trackers / network adversaries, and in my opinion this was a mistake. """

    https://news.ycombinator.com/item?id=11730373

  • Over 3,000 top sites using the font technique, and from the description this sounds really wasteful (choosing and drawing in a variety of fonts for no reason other than to sniff out the user).

    Each font is probably associated with a non-trivial caching scheme and other OS resources, not to mention the use of anti-aliasing in rendering, etc. So a web page, doing something you don’t even want, is able to cause the OS to devote maybe 100x more resources to fonts than it otherwise would?

    A simple solution would be to set a hard limit, such as “4 fonts maximum”, for any web site; and, to completely disallow linked domains from using more.

  • After reading this it makes me want to disable JavaScript entirely, along with cookies, and go back to text browsing. I've been using Ghostery on my phone, it's been pretty good.

  • Whoa, what's the use case for exposing battery information?

  • Of course this is something you do. Throw it together with all of the other information you can clean from a browser (referrer, ip) and you can get a match with a very high confidence level.

    Shops can do the same with baskets, you find that people are either identified by one very rare feature which reoccurs often or their little graph of 4-5 items which correlate 99% to them.

  • All these things make the websites the new apps. Most probably we won't need to use many desktop applications a few years later.

  • If you want to see a live demo of all the ways your browser can fingerprint you, this is a great website: https://www.browserleaks.com/

  • Since the original web based ad campaigns were launched we have been tracked. Serious web analytics companies know these tactics already.

    So what exactly is the research contribution being made here? What's new and interesting?

  • I think its similar to how Absolute Computrace rootkit identifies Android and Lenovo devices. Each hardware compoment has a unique ID, like your ethernet, bluetooth, even microphones and batteries.

  • Would it be more secure to use tor than traditional browser. The only drawback is the longer RTT.

  • Malware filtering is needed.

  • Ahhh. Remember when this was just a Flash problem, and getting rid of Flash was going to rid the world of evil?

    Spoiler: that didn't happen.

  • All this so they can show me adds for stuff i already bought.

  • Well, who would have guessed. Surprise surprise.

    The web is such a shit technology.