Big Data, Big Content -- The New World
The explosion of personal and enterprise digital content has been recognized as one of the most significant characteristics of the decade. The Economist recently pointed to an estimate that 1,200 exabytes (billion gigabytes) of content being created last year, as opposed to 150 exabytes that were created in 2005. The article went on to predict that new models being deployed will produce ten times as many data streams as their predecessors, and in 2011 this will increase to 30 times as many.
Mainstream examples of the consumer generation of rich content abound:
The only thing predictable about storage requirements is that they will be unpredictable and much bigger than you could possibly imagine. To manage this growth, the focus must shift to managing “Big Content”.
- Five years ago, a construction company inspected a building and made a note “crack on the wall on floor 5”; now it videotapes the whole wall before and after the crack is fixed.
- In 2009, American drones flying over Iraq and Afghanistan sent back 24 years’ worth of video footage.
|Figure 1: Content Communism Doesn't Work
Content Communism Doesn’t Work – Think Like Bing or Google
When it comes to the cloud, “Content Communism” doesn’t work.
The total amount of content is growing dramatically. However, what most people care about is the “working set” of content. Some simple examples to explain working sets are SharePoint, Email and the Shared File Drive. In SharePoint, people care much more about the latest version than older versions. In the case of email, people look at today’s inbox much more often than they search email from a year ago.
The biggest traditional Big Content problem is the humble shared drive, where shared content is stored. Here, for example, a corporate presentation is stored and it is typical for a person to copy the original, paste it, edit one or two slides and save a new copy. The same slides are stored over and over again. Old presentations are always kept. Everyone has the problem, but no one has the responsibility of fixing it. The biggest new Big Content problem is rich media and large volumes of video.
You can't treat all content the same way. You need to think like Bing or Google; otherwise your CapEx and OpEx with rise exponentially with the volumes of content. If search engines treated every web page equally, the page you want could be on Page 1 or 20 of the results -- and that would be useless. Web companies have learned to rank each page automatically and display them by PageRank. Also, this ranking needs to be done automatically without human beings manually categorizing the content. We need an equivalent BlockRankTM for content, to place the blocks of content people want most access to on fast local drives and the blocks people have not accessed for some time to be automatically tiered to cloud storage.
|Figure 2: The Advantages of Cloud Storage
The Potential of the Cloud and Elastic Cloud Storage
The cloud has some inherent advantages when it comes to storage. You get:
- Instant Thin Provisioning
- Unlimited Elastic Expansion
- Auto-High Availability
- Availability from Anywhere in the World
- No Hardware Maintenance
- Elastic Storage/CPU, Potential DR
- Utility Billing 5x to 10x Lower
Cloud Storage Myths
Some common myths about cloud storage include:
- You will have to rewrite your user and backup applications because cloud storage uses a HTTP/REST API.
- You will have performance issues due to WAN latency and WAN bandwidth costs.
- WAN optimization doesn't work with public clouds.
- The cloud is insecure.
|Figure 3: Hybrid Cloud - Simplifying Data Storage
The Cloud-as-a-Tier Architecture: An Integrated Enterprise Cloud Storage Strategy
To exploit the inherent advantages of the cloud, you need to have an architecture designed for it. This article will examine patterns of usage in a “Cloud-as-a-Tier” architecture and present a framework on how it can be designed to deliver secure, high-performance, tiered enterprise cloud storage. This will be presented in the context of the dynamic lifecycle of content using the cloud compared with the traditional phases of primary storage to archival, data protection, disaster recovery, and offsite tape.
The article will conclude with a strategy on a new, simpler integrated enterprise cloud storage architecture as an alternative to having to buy, integrate and manage hardware and software for:
- Primary Storage
- Disk Based Backup Storage
- Archival Storage
- Tape Infrastructure and Management
- Replicated Storage for Disaster Recovery
- Offsite Locations for Geo-resilience
|Figure 4: Instant Thin Provisioning for Big Content /
Multi-Tier Volume Management with Deduplication, BlockRankTM and Transparent Tiering
Instant Thin Provisioning for Big Content
When you are planning a rollout of a Big Content application, one of the big headaches is how much storage to provision. If you plan too far ahead, storage is not used for years, and by the time you use it the cost per terabyte has gone down dramatically. On the other hand, if you plan with too short a horizon and you run out of disk space, you have down time and major restructuring. As was said in the introduction:
The only thing predictable about storage requirements is that they will be unpredictable and much bigger than you could possibly imagine.
The traditional approach pioneered by 3PAR has been to use “Thin Provisioning,” removing the need for up-front capacity allocation. Hybrid cloud storage takes this one step further, allowing a company to instantly provision volumes with as much storage as it could want and only pay for what it is currently using - Thin Provisioning with utility billing.
Deduplication, BlockRank and Automatic Tiering
The working set of content concept was previously discussed with a simple example that can be used to illustrate BlockRankTM, Automatic Tiering and Deduplication – SharePoint. It’s common for a user to create a 5MB PowerPoint presentation and store it as version one. The user then makes some minor changes – for example, the customer name on the front page – and stores a second version.
Traditionally, even though there are only minor changes, the second version will be stored as a 5MB file. Deduplication changes this concept by looking at the world as a series of blocks. The same block is never stored more than once. In this example, only the new changed blocks will be incrementally stored, making the second version occupy only tens of kilobytes, not 5MB.
Looking at the world as a series of blocks has more advantages than just dramatically reducing the amount of storage required. Just as web search engines assign a PageRank to a web page, a BlockRankTM can be assigned to each block. The blocks of “Hot Content” that users are constantly accessing have a high BlockRankTM, and those that are less accessed get colder and a lower BlockRankTM. Just like PageRank, BlockRankTM is conceptually simple but complex in practice. Automatic Tiering uses the BlockRankTM of each individual block of storage and a BlockRankTM-oriented algorithm to choose which tier to store a block in. Blocks automatically and transparently move between fast Solid State Disks (SSD), traditional Serial Attached SCSI (SAS) drives, and elastic cloud storage.
Security in a Cloud-as-a-Tier Architecture
Security breaches normally involve human beings accessing unencrypted data. Recent front-page examples included the UK Government cutting a CD of every citizen’s national insurance and bank details and having them posted to London. They were subsequently sold on the Internet. Engineers were recently found reading young girls’ cloud emails. In five years we will look back at the cloud as increasing security, not decreasing it; the use of sophisticated encryption would make these security breaches impossible. Security and the cloud is a matter of education.
There are some key architectural issues that are critical to implementing a secure cloud for Big Content. All blocks that move from the on-premise appliance to cloud storage need to be encrypted with high levels of military-grade encryption both at rest and in motion. A second critical component is key management. The key must be stored on the customer’s premises, not in the cloud. Otherwise you are again trusting a human being not to use the key to get to what will become unencrypted content.
There is a subtle additional twist to this. One hybrid cloud storage approach is to aggregate access to cloud storage. In this case, each customer has its own private key. However, the vendor – the aggregator, may access that key. Again this presents a risk to security.
|Figure 5: Not all Volumes are Created Equal
Not all Cloud Volumes are Created Equal
Again, when it comes to the cloud, “Content Communism” doesn’t work, and not all volumes are created equal. You don't want your SharePoint database pushed out to the cloud. You do want content that has not been read for some time pushed out. What is required in a cloud-as-a-tier architecture is a way to indicate what pattern of BlockRankTM behavior you want to encourage, and a way to consumerize this behavior.
For example, a database volume should have BlockRankTM behavior driven to SSD. A sequential log file should be focused on spanning SSD and SAS drives. A Big Content/BLOB volume should span SSD, SAS and Cloud Storage, as should a Virtual Machine Volume.
Live Archival -- Removing the Tears with Tiers
As volumes get bigger and bigger, administrators feel compelled to archive content and remove it from live access. This is done typically for cost reasons and to offline mediums such as tape on a project or time basis. This can have major implications when the content is required. What if content is archived after six months and it is needed for end-of-year reconciliation? What if there is a critical failure in a building infrastructure and important design data is archived off to tape that is needed urgently?
Archiving becomes even more complicated for applications such as Content Management Systems or SharePoint. You need to archive not only the content, but also related database rows, security policies and more. The multiple assets need to be linked and stored together so that, if necessary, there can be a consistent restore.
Given the choice, most administrators would keep a live archive but don’t because of cost. The cloud, with its out-of-the-box elasticity offering massive volumes and thin provisioning, has the potential to be a solution. The economics add up when you combine block-level deduplication, compression, encryption and use the cloud-as-a-tier for both primary and backup. And the benefits are great.
|Figure 6: Fast Off-site Cloud Backup /
Application Consistent Cloud Snap – Phase I
Fast Cloud Backup for Big Content
You can’t treat the cloud as a big dumb disk at the end of a WAN. If you do it will be a disappointing experience, particularly if you are dumb about backup. People jokingly say, “What’s the quickest way to backup to the cloud? The answer is not to do it.” This has an important truth in it. If the content volume is spanning SSD, SAS and cloud storage, it will be common for the majority of content to already be tiered into the cloud – potentially in a live archive. A cloud snapshot, where all of the volume is backed up to the cloud, does not have to move content that is already tiered into the cloud, to the cloud again. Only the content that is only on the appliance has to be (deduplicated, compressed and encrypted and then) moved to the cloud with WAN optimization.
For, the second or any subsequent cloud snapshot, only the changed blocks from the appliance need to be moved to the cloud. This is a tiny percentage of the volume. Therefore, large multi-terabyte volumes can be simply and rapidly backed up to offsite cloud storage.
A second component is also vital. Volumes often do not live in isolation. For example, Content Management Systems and SharePoint operate across multiple volumes. A consistent backup requires consistency across the database, content and log “volume group.” Using the Microsoft Volume Shadow Service is a Microsoft-supported way to manage backup consistently across a volume group.
Fast Cloud Restore for Big Content
I spoke to someone at a Fortune 500 company who said, “People don’t really care about backup, BUT, they care massively about getting a quick recovery when they need it!” The cloud architecture has a big impact on recovery in a primary or secondary data center. If your cloud backup is one very large multi-terabyte “lump,” recovery could be a slow and painful process. Fast cloud recovery must use the strengths of the cloud-as-a-tier architecture. This allows volumes to be mounted and tiered into the cloud for storage with the working set on fast SSD and local drives. This also needs to be possible for recovery where large multi-terabyte volumes can be instantly mounted without pulling all of the content from the cloud on-premises. Instead, the volumes need to be mounted and only a small amount of meta-data is pulled on premises. Subsequently, blocks of content are pulled on demand on an as-needed basis with smart read-ahead. The local working set is effectively being re-created on the fly, with faster and faster performance until a fully populated working set is again in place.
|Figure 7: Cloud Clone Disaster Recovery /
Simple Disaster Recovery – Physical Appliance Content, Data and VM
Disaster Recovery using Cloud-as-a-Tier: Removing the Tears of Tape
When you ask people how often they test their tape disaster recovery procedures, they go either very quiet or red in the face. The process is so painful, it is not often tested; when it is, it is only from one location. Disasters are by their nature unpredictable. Before 9/11, the procedure for getting offsite tape quickly to Wall Street was by plane. Suddenly there were no planes. When Katrina happened, tapes were stranded in flooded mines.
Being able to simply and painlessly mount an encrypted volume from the cloud has some profound benefits. It is now possible to simply and regularly test your DR procedures. The cloud is accessible from many locations, so DR can be tested from many locations, potentially on a regular round-robin basis.
All that is needed is to take a second appliance and install the configuration files and the private key. Then the cloud volumes can be mounted at a remote location and the content is accessible again.
|Figure 8: Buckets and Clouds - What’s in a Bucket?
Snapshot, Cloud Snapshot, Cloud Clone
Buckets and Clouds: What’s in a Bucket?
Bucket is a term often used in discussing cloud storage and is related to snapshot and DR strategies:
- A “snapshot” is traditionally stored on-premises. In the case of the failure of the on-premises storage appliance, this snapshot is lost with the appliance.
- A “cloud snapshot” drives all of the blocks of content to the cloud in an application-consistent, deduplicated, encrypted way. In the case of the failure of the on-premises storage appliance, the “application-consistent cloud snapshot” is available from the primary data center and other geographic locations. The “cloud snapshot” is stored on the same volume as the primary content. A small vulnerability is that rogue software writing to the primary bucket can write to the cloud snapshot.
- A “cloud clone” is similar to a cloud snapshot in that all of the blocks of content are driven to the cloud in an application-consistent, deduplicated, encrypted way. The difference is that they are written to a separate bucket that may be configured to be in a different geographic cloud data center location. In this case, what is delivered is an “application-consistent, isolated cloud snapshot” that is completely insulated from any rogue software writing to the primary volume. This encrypted cloud clone is accessible from any data center with an appliance that has access to the cloud and the relevant configuration information and private keys.
Managing Virtual Machines in a Hybrid Cloud Environment
Big Content and virtual machines have one thing in common – they are big. Large virtual machine libraries can also be stored using the working set model and cloud-as-a-tier architecture. This can be for the whole virtual machine or just the datastore.
Summary: Moving to a Simpler Hybrid Cloud Storage World
The traditional approach to offering enterprise storage across the lifecycle of content is complex and costly with a storage stack of hardware and software for:
- Primary Storage
- Disk Based Backup Storage
- Archival Storage
- Tape Infrastructure and Management
- Replicated Storage for Disaster Recovery
- Offsite Locations for Geo-resilience
Hybrid Cloud Storage with a cloud-as-a-tier architecture is designed to use the inherent advantages of the cloud and dramatically simplify the storage stack using:
- Hybrid Cloud Storage Appliance
- Cloud Provider – e.g. Microsoft Windows Azure and other public and private cloud providers
This makes the cloud look like a regular disk drive/volume and tiers content across fast disks for the working set and the cloud for non-working set content. Your existing on-premises stack and on-premises existing applications continue to work and stay on-premises. The cloud is seamlessly integrated into your on-premises applications without your applications having to be ported to the cloud.
Big Content will flow into the cloud from existing on-premises stacks and existing on-premises applications. Big Content does not move anywhere as easily as virtual machines.
Virtualization vendors pioneered the efficient movement of whole virtual machines. Just as smaller planets circle larger ones, the massive gravitational pull of Big Content will draw nimble virtual machines to move toward the content and the cloud provider that manages that content. The infrastructures that manage “Big Content” will use it to control the compute that accesses it either directly, in a hybrid architecture or for Disaster Recovery.
“BIG Content will become the new center of gravity and applications will follow the BIG Content to the cloud”
About Ian Howells
CMO at StorSimple
Ian drives corporate marketing strategy and operations activities globally for StorSimple. He is a 20+ year industry veteran and a pioneer in high-volume inbound marketing for enterprise software. Prior to StorSimple, Ian was the CMO at Alfresco where he was core part of the team that built it from a startup to the largest private open source company in the world and the clear leader in open source Enterprise Content Management. Prior to Alfresco, Howells was responsible for worldwide marketing at SeeBeyond before its acquisition by SUN. Howells was the first employee of Documentum in Europe where he had both European marketing and global marketing roles. Ian started his career at Ingres where he worked initially in engineering and then in marketing roles.
StorSimple is the leading provider of hybrid cloud storage solution for Windows and VMware infrastructures. It has achieved the most stringent “Certified for Windows Server 2008” level of certification and also has achieved VMware Ready Status. StorSimple securely and transparently integrates the cloud into on-premises applications and offers a single appliance that delivers high-performance tiered storage, live archiving, cloud based data protection and disaster recovery. reducing cost by up to 90 percent compared to traditional enterprise storage. StorSimple is based in Silicon Valley and is funded by Ignition Partners, Index Ventures, Mayfield Fund, and Redpoint Ventures.