IT Focus Area: Infrastructure Optimization
January 9, 2013
The Keys to Calculating Big Data Costs
“If I had only one hour to save the world, I would spend 55 minutes defining the problem, and only five minutes finding the solution,” physicist Albert Einstein once said.
In other words, understanding the problem is as important as solving it.
Today, with organizations possessing an enormous amount of data, powerful insights can be gained by asking the right questions about data, especially big data. Big data is the petabytes and exabytes of unstructured and semi-structured data that companies produce and collect. Big data can help organizations know more about their business and enable them to directly translate that knowledge into better decision-making and overall performance.
Before a company begins its first big data project, it is important to calculate the costs so a company doesn’t overspend. There are seven key areas a company should examine before starting a big data project.
1. Appetite for change
IT leaders should take a hard look at the organization’s change readiness. For example, a company with an aggressive growth strategy and an appetite for innovation may want big data capabilities to better execute the rollout of a make-or-break new product or to deliver more “sticky” services to clients as a new source of revenue. In these cases, a high-visibility project that captures the attention of the entire senior management team makes sense. That sort of transparency tends to add rigor and focus to an information technology (IT) project.
On the other hand, a skunkworks approach might be appropriate for an enterprise that has a more cautious change culture. For a proof-of-concept big data project, moving an existing application from the data warehouse into a leased Hadoop environment may be the best approach. There is less concern about keeping pace with technology lifecycles, and there is a smaller capital investment. And scaling a big data project feels safer to senior management, if it is understood that the IT department is building on an earlier success.
2. Project scope
Before taking on a big data project, companies can save time and money by doing some careful calculations and benchmarking. These steps can help determine what sort of investment is really required for a big data project. In most cases, companies inevitably determine that the expense and effort required to add the huge amounts of information required for big data applications puts too much strain on their current computing environment. For example, a Hadoop framework is typically filled with petabytes of data from enterprise data stores, online transactions and social media.
The challenge, then, is to create a new computing environment in the most cost-efficient, secure way. Business requirements and performance expectations should be well understood. Technology-related issues around data management, storage and testing should be resolved. Technology costs, including hardware, software, network and the associated maintenance, should be calculated. And data center costs, including floor space, power, cooling and the operational cost of managing systems, should be accounted for.
“If I had only one hour to save the world, I would spend 55 minutes defining the problem, and only five minutes finding the solution.” — physicist Albert Einstein
3. Business problem or opportunity
Like other IT investment decisions, the search for an answer starts with the business problem or identifying the business opportunity. If the marketing department wants to monitor Twitter sentiment trends about a new product introduction, for example, they may expect virtually real-time analysis. That means low latency and a bigger price tag. A financial services firm or large retailer may be looking to conduct fraud analysis or manage transaction security. In this case, it will likely require a robust big data environment. If, however, the task is mining data collected over years and the answers are not time sensitive, or the task is hypothesis-driven analysis, high input/output operations per second (IOPS) isn’t nearly such a concern.
4. Time to market
Time to market is another important issue to discuss with the business, particularly when it comes to creating a test bed to simulate a big data environment. The prudent course is to use an isolated data-management and storage stack that is secure from the company’s production environment. But creating that environment can take months if it is not on the IT department’s critical path of enterprise initiatives. The safer alternative may be one that puts less stress on internal IT staff resources such as a third-party environment that already has a secure and scalable Hadoop distributed file system. But the challenge is to consider a continuum of factors in working toward the right solution. There is no single right answer.
5. Storage choices
All storage options should be considered. From high-performance / higher-cost to lower-performance / lower-cost, alternatives range from solid-state (SSD), storage area network (SAN), network attached storage (NAS), to serial ATA (SATA). For example, a big data application that requires instant response such as a stock trading program may require use of SSDs. Local SATA disks on Hadoop cluster data nodes may be a better fit for an application that has less stringent response time requirements (most common setup for a Hadoop cluster).
6. Data management
Next, think about all the data-management challenges that come with tapping into multiple sources of information. Pictures, video, medical records, Internet search indexing, radio-frequency identification (RFID), sensor networks, genomics, and call center data — virtually any form of unstructured data can be stored and processed in a big data storage system. Solid information management framework becomes paramount. The many interdependencies that exist between data sources, storage tiers, protocols and platforms should be carefully considered before data is exported to a Hadoop environment. Systems administrators will have to juggle multiple tools to operate within various storage environments. The best course of action may be to step back and reassess the information-management framework to identify and remediate potential roadblocks to a big data project. These issues could include availability, recoverability and security.
The data sets with big data are different from the traditional data warehousing approaches of sampling or exhaustive verification. A test environment should be built on a Hadoop distributed file system, and testers should be facile with programming tools such as MapReduce. It is important to remember that structured query language (SQL) tools won’t work in that environment. Testers will have to acquire big data skills in Hadoop and the infrastructure that Hadoop sits on, or will have to look elsewhere for support.
Conducting performance analyses of different storage and data-management techniques is a prerequisite for any big data investment. Can SATA drives or a SAN structure meet the business requirements for speed, cost and security? Or will a distinct, high-speed, special-purpose storage area network that connects with a variety of data servers be required? The answer may be found by testing prototype applications in a variety of secure environments, and finding the right balance of performance and cost for specific service levels.
Finding the Right Balance
A company’s first big data project can present a dramatic departure from business as usual in the IT department. But every company is looking to effectively aggregate, store, manage and analyze the data they have, regardless of the volume. Therefore, it is important that IT staff help the business address problems and opportunities in new ways. For example, using unstructured or semi-structured data is far different than traditional relational database-management systems (RDBMS) information. Consequently, it is important for an IT organization to learn and understand unfamiliar technologies like Hadoop and MapReduce. Also, costs associated with running a new big data environment—hardware, software, data center and operations—should be carefully managed.
By weighing the options, risks and rewards of a big data project, companies can head in the right direction. Then, by finding the right balance of strategic objectives, technology investments and project capabilities, they can start putting big data to work.