Big data has become synonymous with Hadoop and its rich ecosystem, which includes Sqoop, Flume, Pig, Hive, Hbase, Spark, Hue, Oozie, Zookeeper, Ambari, etc. This open-source programming framework helps you manage and use computing environments to store, process and analyze large data sets.  

A big data environment needs to be highly distributed, expandable and fault-tolerant. In other words, the infrastructure should be comprised of a large pool of highly effective resources for calculations, memory, storage and communications. This infrastructure breaks from the virtualization trend that most companies have recently learned to manage internally. Due to performance requirements, virtualization solutions and any components that can slow down processing are not viable. Big data tools must be installed as close to the physical components as possible.

Data volumes tend to be in the petabyte or exbibyte range, which significantly exceeds the normal volumes of data in enterprise systems. This data is not necessarily stored for long periods of time (it depends on how valuable it is over time). In many cases, only aggregate data is kept. For example, there’s no advantage in saving detailed information captured by sensors on power distribution networks beyond a short period. But these volumes, however fleeting, mean that massive storage resources need to be available for that short period.

In the context of big data, data can also flood in quickly and accumulate in enormous data sets in a very short timeframe. The ebb and flow can vary, much like the busiest periods for stock market transactions. For companies, fluctuations in data flows result in a highly elastic processing capacity and availability with corresponding storage capacities. According to 2015 figures, every minute we create more than 350,000 tweets on Twitter, over 300 hours of video on YouTube, 171 million emails on Gmail. By way of reference, an aircraft engine can generate more than 330 gigabytes of reading data in the same amount of time.

In addition, keep in mind that data comes from a wide range of sources and in many different formats, including relational databases, binary files, text documents, spreadsheets, videos, images, sound clips, etc. This variety of information presents a major challenge for companies that want to integrate, transform, process and store data.

That’s why infrastructure has to be very high-performance, agile and efficient in its use of material resources.

A major challenge for CIOs

It’s easy to see how big data poses a major challenge and why it makes any self-respecting CIO nervous. The potential impacts on IT assets are significant. How many resources will you need? Which ones? What’s the best way to manage all these resources? Will the company be able to offer the different service levels? At what cost?

While IT organizations have adopted exemplary operations management practices that are based on reference models like ITIL, do we need to start over using another base? Even though there are Hadoop solutions and distributions available on the market—like Hortonworks, Cloudera and MapR—their stability is an important issue for operations, particularly with regard to technology change management. In addition, some distributions introduce proprietary components to expand features—this is supposed to bring added value, but it can hinder compatibility with other distributions. For companies hesitant to take advantage of the potential innovations that this technology offers because of management issues, there are also cloud-based solutions (Saas) like Databricks and Seldon. These are some of the things you should consider when choosing a solution.

In the next article in this series, we’ll take a closer look at operating a big data environment and analyze different alternatives and scenarios, including local mode, cloud computing, hybrid and managed services.

Read the next article of our Big Data Series : 1, 2, 3... How do I operate all this?