All the best (and fun) scaling issues is best solved by hacking. Is there really any proper decent methods out there that can actually take a company from 5 users a day to 5 billion without any problems at all?
I'm aware that you could perhaps "foresee" it, and get a big data-center, and lots of hardware solutions -- but is this really the solution that's called an 'out of the box' solution?
Sounds like when Google gets a problem, they create something to fix that, which is pretty damn cool in my opinion.
"...most applications don't use [Google File System (GFS)] today. In fact, we're phasing out GFS in favour of the next-generation file system that is very similar, but it's not GFS anymore."
what could this be? home grown? an open source project?
BFT... big freaking tapes.
I wonder if it's really tape though. You can get 'tape' backup systems where the cartrige looks just like a tape, but it's actually a special hard drive. I wouldn't be surprised if that's it.
Otherwise, I wonder how many miles (of tape) long my gmail inbox is?
Urs wouldn't have said tape if he didn't mean tape.
This brings to mind an amusing anecdote. I once saw a conversation about how many miles of tape are storing Viagra ads. Someone quipped, "The only thing that I know for certain is that the trend is up."
According to one source I found[1], 60 meters of DAT tape holds 1300mb of data. That's 21.67mb per meter. My gmail inbox is 679mb, thats about 31 meters of DAT tape.
LTO5 stores approximately 3 TB with Compression. (1.5 Native). It is 846m long.
Let us assume you have 7597MB of emails in your gmail inbox. That's about .48% of the capacity of a Uncompressed tape. So that's a smidge over 4m. And if google compresses their data -- then you're potentially looking about 2m.
I love this comment: "The reason why we put it in is not physical data loss, but once in a blue moon you will have a bug that destroys all copies of the online data and your only protection is to have something that is not connected to the same software system." I think that is often overlooked when designing HA storage systems.
It's hard to see SSD's being a clear win on the surface. It would be very interesting though to see if the cost begins to change when you consider power savings (much smaller than hoped, but they do exist), figure out how much they'd save on HVAC in the datacenters, etc.
Having worked on power-aware hybrid storage for the last two years, the power savings per GB is pretty much zero; there is significant W/IOPS savings but that still doesn't pay for the capital cost. Performance is really the reason to use SSDs.
think commodity... have ssd's reached the commodity level yet? from a commodity stand point, i think it would be far more likely that google would use an array of SD cards per node. SSDs are really just an array of SD cards anyway, with a pile of hype and marketing piled on. you would still get most benefits of an SSD. power usage (esp at idle) might even be less than an ssd. google could develop their own wear leveling algorithms, and the rest of the stuff that an SSD controller provides for the internal flash. replacement costs could be less as well over time.
I was at a linux user group meeting recently where a talk was given by a Google sysadmin where he talked about hard drive reliability. Someone asked him about SSDs to which he replied that he couldn't talk about it. My take is that they are definitely trying out SSDs, but either 1) found SSDs provide a huge competitive benefit and don't want to publicly share that knowledge or 2) they simply don't have enough data yet. I'm leaning towards #1.
[1] http://research.google.com/pubs/author79.html