Also speaking from direct personal experience. I have an API SaaS (which, yes, I realize, as a B2B product, has a much smaller audience than a B2C platform, and so probably much lower reads per second... but I want to just talk about writes here, since reads and aggregates-of-reads get cached on so many other levels.)
We have 1.5 trillion distinct rows living on a PG cluster right now. Growing at a higher rate. And that requires... four DB servers (four shards). And no replicas — all our prod query traffic hits these servers directly, as a mixed workload with the writes from our queue-consumer ETL agents.
(The servers are "beefy", sure — but they're not that beefy. They've got 128 cores and 1TB of RAM each. That's not even top-of-the-line these days.)
Despite serving a whole ton of QPS, these DB servers are not CPU-bound or memory-bound or even IO-bound. Our primary horizontal-sharding factor, in fact, is PCIe lanes — we're constrained mainly by the number of NVMe drives we can keep connected performantly per CPU [1] — and thereby, generationally, by "fast storage capacity" per server platform.
One thing that's perhaps "special" about our use-case, though, is that our data is inherently append-only. Which is pretty great for OLTP performance: there's very little write contention, as we just partition the data by insert time (which is also a natural partitioning key[2] for most queries our business layer does) and then only write to the newest partition, with all previous partitions becoming effectively immutable.
Most workloads aren't like this... but they could be, if you model your data carefully, with temporal tables holding versioned records and so forth. You trade off more storage growth, for far less worry about write contention. ("Event streaming" is not the silver-bullet solution it would seem here — as you would still need to create an aggregate to query from; and writing into that aggregate would still be an OLTP write bottleneck. No, the true solution here is data-driven [schema] design — i.e. forcing your API engineers to invite the DBA to feature design meetings, and ensuring the DBA has the perogative and incentives to say "no" to a bad design. Or you could just hire database systems engineers to code your business layer, like we did.)
My point here, is that if you're very careful in designing for not just "scalability" but economies of scale — and if your org is engineering-driven rather than product-driven (as a non-profit social network would likely be!), and so has "mechanical sympathy" — then you can achieve things with just the budget of an average B2B SaaS, that would rival a (smaller) social network.
But with the budget of a benefit corp that nevertheless charges even $1 for each install of its social-network mobile app? Who knows what you could achieve!
(See also: WhatsApp, pre-acquisition. They were just charging $1 for each install. With just 50 employees, that got them a lot of operating budget.)
---
[1] And actually, that's part of why the DBs are not IO-bound. Put enough NVMe disks in RAID0 together, and you get something nearly functionally indistinguishable from memory. (Yes, RAID0 — because our DBs aren't the store-of-record; our data lake is.)
And even where NVMe reads are not indistinguishable from memory reads, they're still complementary. Postgres, in relying on mmap(2)ing heap files and thereby on the the OS page-cache as a buffer pool, makes per-query serial page faults expensive, and causes multiple threads wanting the same cold data to bottleneck by repeatedly becoming coalesced awaiters on the same sequence of pages faulting in. But when you've got highly-concurrent queries [= many different OS threads] that are all page-faulting for different cold data, then you get deep IO queuing, and everything works out optimally. So the expensive cases don't really come up — the page-cache serves the head of the request distribution, while the deep IO queues of the RAID0-NVMe-pool are a perfect match for the long tail.
This makes it almost irrelevant, at runtime, whether data is hot or cold. As long as you have sufficient memory to host the very hottest data (e.g. relatively-low-cardinality fact tables like users/blogs that get joined to everything, and esp. their indices), the rest can be read straight from disk with nearly no penalty.
---
[2] Funny enough, given that we're doing heavy joins for some queries, the one thing we do sometimes run out of at runtime, is timeslices of the Postgres postmaster to coordinate lightweight locking of the global locks table.
We not only have time-partitioned tables, but also something akin to tenant-partitioned schemas. This results in a lot of relations. (It takes hours for us to run vacuumdb --analyze-only.)
And some OLAP-ier queries (that we nevertheless have to run, synchronously, in response to client requests!) need to touch many of the partitions and many of the schemas. That's O(MN) locks they need to take — which take time to write into the global locks table, even though those are only memory writes.
Every once in a while, we have to "consolidate" our table partitions — not by rolling them up with aggregations to reduce row-count, but rather by just copying all the data into fewer, larger partitions — so that we can take fewer access-shared locks per query, so that transactions can spend less time waiting on a handle into the global locks table at startup just to write these O(MN) read locks into it. (In other words, we consolidate data to reduce O(MN) read locks per transaction, to O(N log M) read locks per transaction.)
But that's a PG problem, not a resource problem per se. We've been considering writing a patch for PG to allow relations to be marked as "immutable", where an "immutable" relation doesn't have locks tracked for it (or its indices) at all — but also can't be written to, re-indexed, or even dropped. We'd then just apply this setting to all of our historical partitions. (If you're curious: given its semantics, this property could be enabled for a relation at runtime; but would need an instance restart for disabling it to take effect, as the instance would have no idea what read-locks "would be being held if the table would have been tracking locks", and so couldn't safely do any writes to the table without a barrier that drops all existing MVCC read-states — i.e. an instance restart.) As a bonus, such relations could also have "perfect statistics" calculated for them by ANALYZE; and those stats could then be marked as never needing to be re-calculated — allowing both VACUUM and ANALYZE to be skipped for that relation forevermore.
I don't think this access pattern is generally applicable to social networks, and ditto for your operational considerations.
"Four shards" and "no replicas" is a complete non-starter for a popular social network. That low shard count means a hardware failure on a shard master results in downtime for 25% of your users. Putting ~400 billion rows on a single shard also means backups, restores, schema changes, db cloning, etc all can take a tremendous amount of time. No read replicas means you can't geographically distribute your reads to be closer to users.
I'm not sure why you're assuming that social networks aren't 'very careful in designing for not just "scalability" but economies of scale'. I can assure you, from first-hand experience as both a former member of Facebook's database team and the founding member of Tumblr's database team, that you have a lot of incorrect assumptions here about how social networks design and plan their databases.
I can't comment on any of the PG-specific stuff as I don't work with PG.
Append-only design is not usable today for social networks / user-generated content. That approach is literally illegal in the EU. In years past, afaik Twitter operated this way but I was always skeptical of the cost/benefit math for their approach.
The "200 expensive beefy database servers" I was describing were relative to 2012. By modern standards, they were each a tiny fraction of the size of the hardware you just described. That server count also included replicas, both read-replicas and HA/standby ones (the latter because we did not use online cloning at the time).
I haven't said anything about event sourcing as it generally is not used by social networks, except maybe LinkedIn.
We have 1.5 trillion distinct rows living on a PG cluster right now. Growing at a higher rate. And that requires... four DB servers (four shards). And no replicas — all our prod query traffic hits these servers directly, as a mixed workload with the writes from our queue-consumer ETL agents.
(The servers are "beefy", sure — but they're not that beefy. They've got 128 cores and 1TB of RAM each. That's not even top-of-the-line these days.)
Despite serving a whole ton of QPS, these DB servers are not CPU-bound or memory-bound or even IO-bound. Our primary horizontal-sharding factor, in fact, is PCIe lanes — we're constrained mainly by the number of NVMe drives we can keep connected performantly per CPU [1] — and thereby, generationally, by "fast storage capacity" per server platform.
One thing that's perhaps "special" about our use-case, though, is that our data is inherently append-only. Which is pretty great for OLTP performance: there's very little write contention, as we just partition the data by insert time (which is also a natural partitioning key[2] for most queries our business layer does) and then only write to the newest partition, with all previous partitions becoming effectively immutable.
Most workloads aren't like this... but they could be, if you model your data carefully, with temporal tables holding versioned records and so forth. You trade off more storage growth, for far less worry about write contention. ("Event streaming" is not the silver-bullet solution it would seem here — as you would still need to create an aggregate to query from; and writing into that aggregate would still be an OLTP write bottleneck. No, the true solution here is data-driven [schema] design — i.e. forcing your API engineers to invite the DBA to feature design meetings, and ensuring the DBA has the perogative and incentives to say "no" to a bad design. Or you could just hire database systems engineers to code your business layer, like we did.)
My point here, is that if you're very careful in designing for not just "scalability" but economies of scale — and if your org is engineering-driven rather than product-driven (as a non-profit social network would likely be!), and so has "mechanical sympathy" — then you can achieve things with just the budget of an average B2B SaaS, that would rival a (smaller) social network.
But with the budget of a benefit corp that nevertheless charges even $1 for each install of its social-network mobile app? Who knows what you could achieve!
(See also: WhatsApp, pre-acquisition. They were just charging $1 for each install. With just 50 employees, that got them a lot of operating budget.)
---
[1] And actually, that's part of why the DBs are not IO-bound. Put enough NVMe disks in RAID0 together, and you get something nearly functionally indistinguishable from memory. (Yes, RAID0 — because our DBs aren't the store-of-record; our data lake is.)
And even where NVMe reads are not indistinguishable from memory reads, they're still complementary. Postgres, in relying on mmap(2)ing heap files and thereby on the the OS page-cache as a buffer pool, makes per-query serial page faults expensive, and causes multiple threads wanting the same cold data to bottleneck by repeatedly becoming coalesced awaiters on the same sequence of pages faulting in. But when you've got highly-concurrent queries [= many different OS threads] that are all page-faulting for different cold data, then you get deep IO queuing, and everything works out optimally. So the expensive cases don't really come up — the page-cache serves the head of the request distribution, while the deep IO queues of the RAID0-NVMe-pool are a perfect match for the long tail.
This makes it almost irrelevant, at runtime, whether data is hot or cold. As long as you have sufficient memory to host the very hottest data (e.g. relatively-low-cardinality fact tables like users/blogs that get joined to everything, and esp. their indices), the rest can be read straight from disk with nearly no penalty.
---
[2] Funny enough, given that we're doing heavy joins for some queries, the one thing we do sometimes run out of at runtime, is timeslices of the Postgres postmaster to coordinate lightweight locking of the global locks table.
We not only have time-partitioned tables, but also something akin to tenant-partitioned schemas. This results in a lot of relations. (It takes hours for us to run vacuumdb --analyze-only.)
And some OLAP-ier queries (that we nevertheless have to run, synchronously, in response to client requests!) need to touch many of the partitions and many of the schemas. That's O(MN) locks they need to take — which take time to write into the global locks table, even though those are only memory writes.
Every once in a while, we have to "consolidate" our table partitions — not by rolling them up with aggregations to reduce row-count, but rather by just copying all the data into fewer, larger partitions — so that we can take fewer access-shared locks per query, so that transactions can spend less time waiting on a handle into the global locks table at startup just to write these O(MN) read locks into it. (In other words, we consolidate data to reduce O(MN) read locks per transaction, to O(N log M) read locks per transaction.)
But that's a PG problem, not a resource problem per se. We've been considering writing a patch for PG to allow relations to be marked as "immutable", where an "immutable" relation doesn't have locks tracked for it (or its indices) at all — but also can't be written to, re-indexed, or even dropped. We'd then just apply this setting to all of our historical partitions. (If you're curious: given its semantics, this property could be enabled for a relation at runtime; but would need an instance restart for disabling it to take effect, as the instance would have no idea what read-locks "would be being held if the table would have been tracking locks", and so couldn't safely do any writes to the table without a barrier that drops all existing MVCC read-states — i.e. an instance restart.) As a bonus, such relations could also have "perfect statistics" calculated for them by ANALYZE; and those stats could then be marked as never needing to be re-calculated — allowing both VACUUM and ANALYZE to be skipped for that relation forevermore.