Data diffs: Algorithms for explaining what changed in a dataset

  • This is a long known issue in BI. You can either report as is (i.e. the latest version of data) or as was (if you store the history either as deltas). A good example is subscriptions. You have someone pay for an annual subscription on the 1st of January 2021, but they cancel half-way through and get a refund. This already shows various outcomes.

    Meta, Twitter and other advertising networks overwrite data. In a past analysis I oversaw the numbers could differ ranging from 25 % to 300 % difference. This could be to a variety of reasons - i.e. the report you pull tells you you had 1000 impressions, but two weeks later when you download the same report you find out it was actually 900 (100 were filtered out due to bot filtering post-correction).

    Perhaps the biggest culprit is Google with conversions. For some reason they attribute conversions to the date the converting user saw the ad. This means that if that person purchases your product for up to 28 days since first seeing the ad, Google goes back and updates conversion for the day when they saw the ad.