Applying Textbook Data Structures for Real Life Wins

  • A few years back I had a traveling salesperson problem, I googled a bit, and somehow didn't find anything interesting, so I just created a random loop, and randomly mutated it until it got better. A few months ago I discovered a neat little heuristic: the best loop never self-intersect. And a few days ago, I found that there is a definite algorithm that gives a result whose worst case is bounded WRT to the optimal.

    My question is how do you find all those algorithms? Wikipedia never really feel like a good introductory nor discovery place. In particular, some problems have been perfectly studied by scholars, but you don't find them because you have not found the keywords that will direct google in your search. And I am not a researcher, so I don't keep a tab on a domain, I'm a jack of all trades, I work on a wide set of things.

  • Hey, author here! I've been working on infrastructure and database-y things at Heap for the last couple years, ranging from improving Postgres performance and availability to building out services (like this one!) and refactoring, encapsulating, and optimizing core systems.

    I'll try to answer any questions about the post (technical or otherwise).

    Edit: added more details about me.

  • People are so jaded about whiteboarding algorithms/data structures that it's great to see bonafide examples of their use in production outside of Skiena's war stories. Great content.

  • Somehow "4 hours to 15 minutes" sounds way more impressive than 15x speed up, at least to me. From the article, it's unclear how the hashing mechanism you use actually approximates rank at all; most hashes I'm familiar with are essentially one way functions, ideally preserving no information about the input at all. How are you hashing ID's so that unions result in smaller hashes?

  • I feel like, 25 years post-undergrad and over 10 years post grad school, I should go back and re-read some data structures classics: I spend all of my time running database queries and debugging web service calls and only rarely dipping into real algorithmic optimization. I wonder how much performance optimization I'm leaving on the table because my data structures are so rusty.

  • Some of the more advanced postgres SQL blog posts over the years have come from the Heap team. I'm curious what their good enough, optimized postgres solution was that they replaced with this project. :)

  • Biggest and most challenging trend is to properly devise a data design schema for:

    1. an appropriate privacy and deanonymization level to each data item in a structure (corporate privacy, trade secret, account ID)

    2. an appropriate privacy level to a tuple of data items (i.e., PII; name, ID, birthdate)

    3. And a security context for each class of end-users (support, engineering team, marketing) to use what amount of and degree of deanonymized data items.

  • I don't see the point of hashing the key instead of picking the lexicographically smaller key. It will have the same effect and comparison is O(n), same as hashing.

  • It's a good way to capture user tracking data and be precise in knowing exactly who is doing what on site or mobile app. Hopefully this data stucture can be extended to completely remove every bit of tracking information collected about a user, so that it can comply with GDPR and respect privacy.

    Be careful in tracking users, it may fall foul of GDPR and GDPR compliance. If your company is using analytics like this, it will be almost impossible to remove personally identifiable tracking information from website or mobile app platform.

    This means it will become a nightmare to remove the user tracking information from every system and system logs. This is one of the reasons most solutions in USA are ill-suited for Europe. Hopefully USA can really start respecting privacy and build systems which keeps privacy on top.

    Given most admired companies are built based on invasion of privacy (facebook, google, amazon, netflix), I doubt there is a will to get rid of privacy invading technologies from core. I see some efforts by Apple, but than the larger app eco-system still rely on trading privacy for some free apps or utilities will be hard to go.