Yesterday’s Terabyte is Today’s Petabyte

In the mid 1990’s, hearing about someone with a 1 terabyte data warehouse (DWH) was a sort of mystical, illusory event, engendering doubt or even suspicion as being a ‘fish that got away’ story. The person telling the story was never the one who actually built the DWH, they were just exposed to it in some way, and they threw the story around as if it was nothing, loving the awed look on the faces of their audience. Invariably this would be someone from the Information Technology (IT) field, since the business users would be unlikely to know, care, or be surprised that a very large amount of data is needed to answer their questions. So the IT person would also carelessly throw out a rejoinder such as ‘You know, at that size, you can’t simply [insert technique IT people do every do with a ‘normal’ large DWH].’
Fast forward a decade. Today, terabyte+ warehouses are common. However, one hears the same stories with one small difference: replace the word terabyte with petabyte . A petabyte, at 1000 terabytes, is a seemingly unreachable stretch of data. However, as we all witness the increasing power of processing and decreased cost of storage, we seem to be seeing enough examples of PB+ warehouses to say, “yesterday’s terabyte is today’s petabyte”.

Before you get a petabyte DWH, you need a petabyte of operational data. When a petabyte of data is present to ‘run’ your business, only then can someone say ‘we need to analyze all this data’. Today’s petabyte-operational business is much more likely to be communication or information based. For example, AT&T reported one year ago that “AT&T currently carries about 16 petabytes of total IP and data traffic on an average business day”. (With our log scale growth in storable communication, presumably it’s on its way to doubling…) Other companies with petabyte businesses include Google, all the major telecommunications companies, all the major web businesses—digital media and telecommunications. It’s nice to know the exception that proves the rule is the PB+ data collection at the Large Hadron Collider.

In a recent conference, a member of Facebook revealed the accelerating growth of their DWH. They reported that in March 2008 they were collecting 200 gigabytes (GB) of data per day. In April 2009, they were collecting 2+ TB of data per day, and in October 2009, they were collecting 4TB+ data per day. If you chart this, you see something approaching a classic logarithmic curve. While Facebook reports its DWH is closing in on 5 PB today, by the time a reader is absorbing this sentence, it has likely long surpassed that.

Does this mean in 2020, more than half of the Fortune 100 will have petabytes size data warehouses? Probably not. However, they’ll all have TB+ warehouses, and a herd of businesses will be PB+:

• All large and mid-size digital media, social media, and web businesses
• Large and mid-size telecommunication firms, driven by their Call Detail Record databases
• Financial market-based companies (think of tracking all stock market transactions to the microsecond level of granularity), and more and more bricks and mortar companies (e.g. banks) who have done as little as dipped a toe into financial markets, social media, streaming communication, and the like.
• Large energy companies recording all seismic and atmospheric ‘communications’ to a very specific latitude/longitude
• The energy grid will be getting close. It’ll be likely that cars are talking to the grid to reduce congestion and to enable metered driving in the fast lane, so chances are the cars will be talking to each other spitting out signals every second.  Just like that, we’ve added another 100M four-wheeled ‘people’ in our country communicating and someone will want to analyze it.

And, you know, when your car’s antenna is a source for an exabyte data warehouse, you can’t just change the wiper blades, you have to……