Archive for November, 2009

  1. Healthcare .  In this visualization, GE took 500,000 records from the millions in its electronic medical record database, and calculated the out of pocket and insurer cost of a handful of chronic conditions by age.  One of the few both uses of and displays/designs of a radar graph, the visual changes as one increases and decreases age.  This also has a predictive analytics component in that it answers a user’s question “What if I develop a case of hypertension, what will that cost me when I’m 65?”  Like any good analytic, being able to see the data brings up more actionable and specific questions that this analytics doesn’t answer but the data set could.
  2. Digital Media.  This visualization is part of The New York Talk Exchange, a visualization project developed by the Senseable City Lab at MIT.  Perhaps the potential applications of the analytics are as explosive or more than the specific data they used in this case.  The analytic shows starting or sourcing neighborhood within NYC and to where their communications via the AT&T network were destined.  Users can see the frequency distribution of endpoints, and by comparison who was talking to whom geographically across sister boroughs.  Imagine creating a site map of your web site, application, team or workflow, and seeing the frequency of where the user, function, business process or capital goes next.  Predictive analytics can say: “If we invest in area x where is that capital, profit opportunity, or waste most likely to go next, and how does that change if I make another investment?”  Perhaps only this specific visualization is best for that type of comparative predictive analytic.
  3. Retail.  According to Well Formed Data, Sankey Diagrams and stacked bar charts informed this (4MB .pdf download) time series visualization of how medical journals in related fields merged into a cohesive ‘basket’ of journals in the emergent field of neuroscience.  While on the surface not retail related, it points to a very compelling—and as yet a visualization I’ve never seen produced— which would explain how specific products that drive volume and profit affinitize into specific types of market baskets.  Replace a) each journal with a specific retail product, which the user can color code at run-time for visualization, perhaps color-coded for on/off promotion, b) the eigenfactor of each journal which is represented by the weight/width of the specific line with the amount of profit or volume the item produces—again the user can choose profit or volume or some other measure at run time, c) the ten or so portfolio of ending blocks of journals as specific types of market baskets, and d) the breadth of starting lines moves from medical disciplines to aisles or departments in a store.  In effect what you have here as a decade long time series one has compressed into a single, specific shopping trip.  Data can include many trips, a single store, one day or years of data.  The predictive angle is to be able to answer questions like “If I promote this item, does it move away from its core ‘7 items per basket, quick trip’ basket into a ‘destination item weekly stock-up’ basket?”  One can also look historically how different store consumers shop categories, volume vs. profit items, and more.

These predictive analytic visualizations start with healthcare as the least complex and become more complex.  Depending on how frequently the user needs the data updated, the amount of data (in the digital media/telco example clearly tens or hundreds of gigabytes), the processing (in the retail example, calculating eigenvalues in an analytic that could be analyzed hourly, especially for promotional out-of-stocks, for example), and the speed required means a predictive analytic visualization is not something to try at home with an off-the-shelf database platform and hardware.

In the mid 1990’s, hearing about someone with a 1 terabyte data warehouse (DWH) was a sort of mystical, illusory event, engendering doubt or even suspicion as being a ‘fish that got away’ story. The person telling the story was never the one who actually built the DWH, they were just exposed to it in some way, and they threw the story around as if it was nothing, loving the awed look on the faces of their audience. Invariably this would be someone from the Information Technology (IT) field, since the business users would be unlikely to know, care, or be surprised that a very large amount of data is needed to answer their questions. So the IT person would also carelessly throw out a rejoinder such as ‘You know, at that size, you can’t simply [insert technique IT people do every do with a ‘normal’ large DWH].’
Fast forward a decade. Today, terabyte+ warehouses are common. However, one hears the same stories with one small difference: replace the word terabyte with petabyte . A petabyte, at 1000 terabytes, is a seemingly unreachable stretch of data. However, as we all witness the increasing power of processing and decreased cost of storage, we seem to be seeing enough examples of PB+ warehouses to say, “yesterday’s terabyte is today’s petabyte”.

Before you get a petabyte DWH, you need a petabyte of operational data. When a petabyte of data is present to ‘run’ your business, only then can someone say ‘we need to analyze all this data’. Today’s petabyte-operational business is much more likely to be communication or information based. For example, AT&T reported one year ago that “AT&T currently carries about 16 petabytes of total IP and data traffic on an average business day”. (With our log scale growth in storable communication, presumably it’s on its way to doubling…) Other companies with petabyte businesses include Google, all the major telecommunications companies, all the major web businesses—digital media and telecommunications. It’s nice to know the exception that proves the rule is the PB+ data collection at the Large Hadron Collider.

In a recent conference, a member of Facebook revealed the accelerating growth of their DWH. They reported that in March 2008 they were collecting 200 gigabytes (GB) of data per day. In April 2009, they were collecting 2+ TB of data per day, and in October 2009, they were collecting 4TB+ data per day. If you chart this, you see something approaching a classic logarithmic curve. While Facebook reports its DWH is closing in on 5 PB today, by the time a reader is absorbing this sentence, it has likely long surpassed that.

Does this mean in 2020, more than half of the Fortune 100 will have petabytes size data warehouses? Probably not. However, they’ll all have TB+ warehouses, and a herd of businesses will be PB+:

• All large and mid-size digital media, social media, and web businesses
• Large and mid-size telecommunication firms, driven by their Call Detail Record databases
• Financial market-based companies (think of tracking all stock market transactions to the microsecond level of granularity), and more and more bricks and mortar companies (e.g. banks) who have done as little as dipped a toe into financial markets, social media, streaming communication, and the like.
• Large energy companies recording all seismic and atmospheric ‘communications’ to a very specific latitude/longitude
• The energy grid will be getting close. It’ll be likely that cars are talking to the grid to reduce congestion and to enable metered driving in the fast lane, so chances are the cars will be talking to each other spitting out signals every second.  Just like that, we’ve added another 100M four-wheeled ‘people’ in our country communicating and someone will want to analyze it.

And, you know, when your car’s antenna is a source for an exabyte data warehouse, you can’t just change the wiper blades, you have to……