Pingora, the proxy that connects Cloudflare to the Internet
> When crashes do occur an engineer needs to spend time to diagnose how it happened and what caused it. Since Pingora's inception we’ve served a few hundred trillion requests and have yet to crash due to our service code.
> In fact, Pingora crashes are so rare we usually find unrelated issues when we do encounter one. Recently we discovered a kernel bug soon after our service started crashing. We've also discovered hardware issues on a few machines, in the past ruling out rare memory bugs caused by our software even after significant debugging was nearly impossible.
That's quite the endorsement of Rust. A lot of people focus on the fact that Rust can't absolutely guarantee freedom from crashes and memory safety issues. Which I think misses the point that this kind of experience of running high traffic Rust services in production for months almost without a single issue is common in practice.
For any of the Cloudflare team that frequents HN, curious if you have an eventual plan to open-source Pingora? I recognize it may stay proprietary if you consider it to be a differentiator and competitive advantage, but this blog post almost has a tone of "introducing this new technology!" as if it's in the cards for the future.
I'm mildly blown away to read, 'And the NGINX community is not very active, and development tends to be “behind closed doors”.' Is this a reflection of the company, nginx (now owned by F5) going the way of an Oracle-style takeover of WebLogic from another era?
We did the same. We've replaced nginx/lua with a cache server (for video) written in Golang - now serving up to 100 Gbit/s per node. It's more CPU and memory efficient and completely tailored to our needs. We are happy that we moved away from nginx.
Huge congratulations to the tokio.rs team — the async runtime has proven to work well even in such demanding project.
I share lots of feelings towards NGINX that Cloudflare mention on this blog post. New features like 103 Early Hints and HTTP/3 exist in HAProxy and Caddy but there is nothing coming in NGINX.
Does anyone know why nginx used separate processes for workers, instead of threads? This post makes it sound like threads are the way to go, but presumably nginx had a reason for using processes back in the day.
What are HTTP status codes greater than 599 used for in practice?
It'd be interesting to see another Cloudflare blog post that just goes into detail on the weird protocol behaviour they've had to work around over the years. I imagine they have more insight into this than pretty much any other organisation on the planet.
Did you guys consider HAProxy? I've only ever heard good things about it - particularly stability (though it probably can't beat Rust), performance, and configurability.
Great write up !
Any cloudflarer involved in this project mind sharing some basic metrics like LOCs, team size, how long from design to first deployment.
Just curious.
In the 3rd party section, no mention of HAProxy as a candidate, any specific reason for that?
Sounds good. I never encountered any performance issues with Cloudflare.
If you have the time for enhancement, then:
1. Option to hit the cache before workers. (Why we never use workers).
2. Rules for blocking traffic during nights (time-based rules).
3. Make sure every product is a replacement. If you offer the same thing as a cloud provider. Don’t make us write a lot of custom code.
Wow this is just what I was looking for, a proxy written in a memory safe language like rust with no GC as an alternative to nginx. Looking forward to the open source version!
Any one else immediately do "open source" ctrl-f? That's all I wanted to read but I bookmarked the article and put it on my list of things to peruse later
Was Go considered as the language to write Pingora in? If so, why was Rust chosen?
The post mentions tokio, but I would be curious to see if it uses tower or something similar in house. For our product (caido.io) we also built a custom HTTP parser so if you open source the tool it could be nice to split the parsing in its own crate so we have an alternative to hyper that can understand malformed requests.
Really curious, are they using async/await?
Besides comparing this to Nginx plus Lua (OpenResty), has Cloudflare compared it to Haproxy plus Lua or any other similar proxies.
The main issue for me with Rust is that it takes significantly more resources (time, space, memory, CPU) to build projects from source. Building Haproxy is comparatively quick and easy.
The haproxy plus lua static binary (musl, no pcre) I use is already growing rather large. I will bet that Pingora binaries using shared libraries will be at least twice, maybe three times the size.
Maybe they could find the time to allow RSS readers to read the HTML on this post. I guess using RSS mean you are atacking their infrastructure.
I wonder how this is deployed to presumably a large number of hosts? Do you build a distribution package out of your Rust build and ship that? If so, what about the Rust standard library? Though I believe some distributions do provide a package for the Rust standard library, but that means one also has to use the packaged rustc/cargo, which tends to lag behind quite a bit.
Is it open source?
They don't say much on why not Envoy. It would be interesting to hear if there were concerns with it.
> Our Rust code runs more efficiently compared to our old Lua code.
What a surprise, replacing an interpred dynamic language with a AOT compiled static language leads to performance improvements.
I guess the learnings using Tcl as configuration language for Apache based proxies 20 years ago was lost in newer generations.
Why did Google never try to buy Cloudflare?
Should have waited to post this until it was actually ready to be open sourced. Otherwise this is just kinda like "huh, neat" without anything else to do with it.