Although I'm not a Big Data aficionado, I'd recently been struck by a few statistics from the IDC Digital Universe study: It is estimated that in 2011, 1.8 Zettabytes of information was created, 75% of which comes from individuals. And by 2015, that number may grow to 7.9 Zettabytes.
So, just where does all of this data come from? And the more I thought about it, the more I realized that each one of us (in developed countries, at least) kicks-off a near constant massive stream of data that gets stored somewhere, even if only transiently.
In our seemingly innocuous day, we surely generate more data than we consume, some of which is captured by others, and only some which we might be privileged enough to retain.
Thus I sat down to think about a typical "digital contrail" that I might generate. While I really can't quantify exactly how much stored data is created from each transaction, simply ball-parking the numbers would seem to support the Digital Universe claims. And, if any of you reading this can help me quantify some of this, I'd be happy to append this post. Thanks in advance.
Sources of my "Digital Contrail"
- I make a cell phone call: Phone location tracking data (i.e. from towers) created and tracked by the carrier; phone log files; data created and stored by multiple mobile apps and their own hosting infrastructures
- I browse the web: Site tracking; clickstream storage; site analytics; Email storage, including replication on devices as well as replication in geographical-mirrored data centers.
- Driving my car: Location-based tracking by RFID tags at toll booths; unique instrumentation data such from as OnStar systems
- Go to the bank: data streams initiated and stored from a simple ATM withdrawals; security analysis of banking transaction patterns; audit and verification trails for individual transactions; mirrored/backed-up data within the bank's data center
- Go to the store: data streams initiated and stored from a simple credit card transaction; product inventory changes; buying patterns stored and allocated to individual affinity discount programs
- Browse an online store: All of the above, plus clickstream storage and analysis
- Plan some travel: Airline reservations & pricing systems such as SABRE ticketing; airline tracking databases; TSA flyer database updates & analysis
- At my home: electricity usage via smart metering data collection
- Using entertainment: Uploaded Photography and Video; sales pattern data and DRM data
- Go to the doctor’s office: Medical imaging , EMR data, reports, other records
- Somewhere in the background: With everything I do, there are surely security systems, kicking-off background data processes and analytics DB’s
- Also somewhere the background: Every service is sourced from a data center, where all data (including device data) is surely replicated and backed-up, including log files.
I finally thought through a simple habit I had, and how much storage space it spawned: I would receive an email with a PDF attachment - and carefully file the email in a folder while copying the PDF also into a separate folder related to the project. So I'd have 2 copies of the file on my PC, not to mention another copy on the Exchange server as well as one on the PC backup server file - 4 in all (assuming no deduplication system was in place). And if the email had been sent to others besides me... you get the point. I was suddenly sensitized to data growth on a personal scale.
So, now I've convinced myself that "the data's out there". I've created scads of data in the past 24 hour stint... and fortunately (or unfortunately) it's all recorded in different repositories. But now I begin to wonder - what *if* some of these structured and unstructured data streams were re-constructed, mashed-up and analyzed? That bit makes me both nervous (from a privacy and security perspective) and excited (from a Big Data and personalization perspective). More later when I stop to think about that one.