Facebook Fabric Networking Deconstructed
Thanks @virtuallynathan, @sargun, @tw04.. Yes indeed the fabric switches in the MDF appear to be Arista 7304.. Once you know the numbers your looking for you can work backwards to calculate appropriate aggregation.. We know we need 192 (40GE) ports for the ToRs and 192 for the Spine. In the photo it looks like 60 Ports are wired per switch but at a max of 128 40GE ports, we would need 3 fully loaded chassis per server Pod. (about 285 for 95 Pods) The real trick with Fat-Trees are that you do the port reservations up front and you build out the wiring plant knowing your upper limits to avoid mass rewiring tasks.
@sargun as you point out there are large scaling challenges with L2 networks due to the flat addressing and ON^2 learning required (even though there are plenty of hacks in place here). There are also almost limited or difficult to scale policy controls for L2 networks making them hard to build differentiated services. We have known for a while that L2 networks do not scale very well but have been restricted by the need to support link-local communications due to legacy applications. Once you take this fundamental constraint off the table, L3 networks offer a far superior solution even though there are still challenges with isolation and mobility.
That being said, I don't believe Facebook builds applications that require link local non-routeables or requires an affinity between applications and the faux host identity (e.g. the IP Address). As you point out there is a tight coupling between the mac address, IP Address and Point of Attachment due to a flawed model which has existed since IP split from TCP in version 4 of the protocol (i.e. the one common in use today). All of the monkey-patching (i.e. Trill, SPB, LISP, VXLAN, NVGRE, etc..) has been to deal with this flawed model.
For those existing application that do suffer from this affinity the solution de jour has been to use encapsulation protocols (i.e. network virtualization) to solve mobility (Loc/ID split) while also improving isolation by adding a new network namespace (e.g. ContextID, VNI, Switch Name, etc... This actually could have been avoided if we hadn't lost the inter-networking layer of the stack (see John Day's work for an explanation).
Just because you build a Fat-Tree network as devised by Charles Leiserson based on Charlie Clos, you have to realize that because of the statistical nature of communications, shared resources and pathologies related to out of order packets, saturating the bi-section is extremely difficult.. Some studies show that latency goes exponential at just 40% offered load. In building any network the ability to maximize throughput is dependent on the topology first, then routing and flow-control.. Fat-Trees are designed to maximize the bi-section for worst-case pairs-permutation (i.e source communicating with each destination across the min-cut..) and such can waste a proportion of the capacity of the network depending on the workload. Again, a longer conversation :)..
Great work! Very informative.
This photo may change some of your assumptions: https://www.facebook.com/AltoonaDataCenter/photos/pb.4403012...
In later photos, the ToRs appear to be Arista as well. It also appears they tend to have only 24 servers/rack, so a non-blocking ToR could be a 48x10G switch. Although I suppose 4x40 could be used as 16x10G here with the 48 port ToR.
EDIT: Scratch that, I forgot about the OpenCompute design, it looks like they have 42-48 server per rack in some cases.
undefined
Couple notes: Mis-spelling: "end-dhosts"
Probably why they went with layer 3: http://sargun.me/a-critique-of-network-design-1
I'm very interested in their Fastpass research, and how it'll allow for better utilization of network, and actually be able to properly utilize full, or 25% bisection bandwidth.