Bite the Bullet and Throw Away Your Data
There's only one way to keep data storage costs under control, and that's to get rid of unnecessary data. That may go against the grain of IT managers who rightly consider backup and preservation as critical to an organization's health, but when it comes to data storage, there are ways to separate what's useful from what's disposable.
It should come as no surprise to any IT manager that your organization's appetite for data keeps growing every minute. No sooner than a new set of storage tools is deployed, it becomes clear there's a need for more, and you start planning the next wave of hardware purchases. More hardware means more floor space, power and cooling -- it's a vicious circle.
The total amount of disk storage shipped last year grew 40.5 percent from 2007, according to a recent study by IDC. If you imagine that this appetite for consumption is going to be questioned at some point in the interest of trying to curtail the IT carbon footprint, then you are not alone. Of course, there is no smoking gun, but there are choices you can make to get rid of unwanted data and free up space. That precious space can then be utilized for other purposes, thereby limiting the amount of new storage capacity to be purchased.
Before getting into "slice and dice" mode, it's important to define what, exactly, is the goal? The goal is to ensure that spinning disk, i.e., online disk storage, is used for storing data that is really valuable to the organization. Any data that is not -- and it could be in any shape or form -- needs to be appropriately disposed of. Disposal could mean simply deleting it, or it could mean archiving it in another medium, either a magnetic medium such as tape, or an optical medium such as DVD or CD ROM.
What kind of data can be disposed of? Most often, the kind of data that is "time classified" or unstructured -- that is, data that was important at one point but whose importance faded away as it got older -- or data in the form of flat files without any indexing. Log files are big space hogs. Don't get me wrong, log files are very valuable in performing forensics, root cause analysis and other key analyses. However, once the time frame has expired, those log files are practically useless. If the contents of the log files matter a great deal to your organization, then invest in a log file analysis tool that processes the file, extracts meaningful information, and then keeps it in a database that can be used as a reference. The raw files can then be disposed of. You will be surprised at the amount of space you gain by getting rid of such files.
You can start by instituting a policy that removes files that meet certain criteria and are older than a certain time stamp. This will allow you to ensure that if they are backed up and can be restored as necessary, then they don't need to consume expensive online disk storage. Implementation of an archive solution makes this task easier.
The next place to examine for redundancy is duplicate software distributions. One can start by examining the obvious ones -- such as software binaries and media that most people have a tendency to store in an online mode for eternity. Let's face it: If you have no intention of ever using that software, or if it is so old that the chances of installing it are slim to none, is there any need for it to consume disk storage?
Many IT organizations think that an online repository for storing all software versions is prudent. To some extent that's true, but unless there's a "garbage collection" process in place, that repository can quickly become an online dump so vast that no one really knows what is useful and what is not. Implement a policy requiring that if the version is more than two years old, and if it is not deployed anywhere, then it will be copied to an offline medium and stored there in the event someone does need it. Online storage is not the place for it.
Taking It to the Next Level
Beyond the level of data that can be easily dispatched is structured data -- data that is in the form of an application proprietary structure, such as relational databases. Cleanup of structured data can be challenging, as it is generally challenging to figure out what is important and what is not. However, a careful examination of data will reveal characteristics that make it somewhat obvious that certain policies can be implemented regarding the retirement or disposal of such data. If this data resides in dedicated table spaces, they can simply be dropped. Of course, before you simply delete the data, make sure it is backed up in an easily retrievable fashion in the event it is necessary for any reason.
Log files associated with structured data are often ignored. These files sometimes are not necessary for the online day-to-day functioning of an application and can be disposed of. For example, if your database generates a lot of archive logs and you are in a position to restore your database in full using a backup copy, then archive logs older than the backup copy itself are not necessary. These can be deleted, and precious space can be gained. This may seem to be stating the obvious, but it is worth noting that due diligence to enforce such simple policies goes a long way in ensuring that precious space is not wasted.
Another factor to consider is the number of backup copies that are stored online. Many times, it is necessary to store full backup copies of a database or another application online so that you can quickly restore the application in the event of a catastrophe. Here, prudence is warranted, as there is no limit to paranoia. You could keep as many copies as you'd like, but you are actually paying for every gigabyte of storage consumed.
Reclaiming space is the biggest challenge in any organization. It has almost become an art rather than a science. How do you go about searching what is being actively used or consumed and what is not?
One approach is to put some kind of lifecycle on each system or server in the datacenter. If a certain server does not see any activity for a few months, then it is time to start questioning whether the server is really being used. If not, then its space can be reclaimed.
There are software suites available that make this task easier to implement, and these tools are well worth the investment.
Driven by Policy
Most disciplined IT environments are driven by policies that are strictly enforced and observed. Data disposal should be treated no differently. Once you have adequately secured data from the risk of being corrupted or lost due to human error -- and, more importantly, secured it for compliance and e-discovery purposes -- you should enforce all kinds of policies to prevent data sprawl. Only data that absolutely must reside primarily online (read, your most expensive) needs to be there. The rest should be moved to a suitable alternative medium -- or simply deleted or destroyed.
Inculcating these habits at every level of an organization is almost a science in itself, but an individual or a group within the organization can be charged with the responsibility of creating and enforcing data management policies. This cross-functional group could be incented in the amount of storage that is saved by proper enforcement of policies.
Investing in Next-Generation Solutions
Technologies that hold a high promise for making this task easier are server virtualization, thin provisioning and online deduplication. During an asset upgrade, purchase or replacement, these functions should be incorporated into the design and implemented to the fullest extent that's practically possible.
Thin provisioning changes the way storage is provisioned to systems. With traditional or standard methods, the amount of storage provisioned is independent of the amount consumed. Once storage is provisioned, the task of reclamation becomes difficult. Thin provisioning makes it easier by spoofing the amount of space that the system thinks it has vs. the space that it actually gets, the latter being a function of "available on demand" -- i.e., space is allocated only as the system actually writes to that area, and not otherwise. This allows all the unwritten space to be kept in a reserve pool, making overall disk space management more efficient.
Online deduplication offers great promise by eliminating redundant chunks of online data. Imagine four servers all running the same exact version of the operating system. All the binary files are the same, but because of the way disk space is provisioned, you are essentially using up four times the amount of space with a redundancy factor of four. Now if you could "deduplicate" this data, you could gain all that space back. Leave the array to then spoof the disk space to appear as if each system got its own unique set of binaries. Imagine the amount of disk space that could be saved if this were scaled to the entire datacenter.
It goes without saying that server virtualization promises to cut down the amount of disk space consumed by systems. When combined with deduplication and thin provisioning, it offers a crucial disk-space-saving recipe that is almost a must for every IT organization.
Disposing of data is like cleaning your closet. You know most of the stuff in there is no longer needed, but you are always posed with the "What if?" question each time you dare to throw something away.
The bottom line is that you need to bite the bullet and get it done for the greater good of your organization, the industry -- and, ultimately, planet Earth.
Ashish Nadkarni is a principal consultant at GlassHouse Technologies.