Why a Mere 300 Exabytes In Legacy Data Will Give Us A Headache

Although 90% of the available data in the world was created in the last two years, it does mean that there is still a lot of ‘old data’. In 2010 and 2011 we created in total 3 Zettabytes of data. If we use a very simplified calculation, it would mean that the amount of ‘old data’ is still approximately 0.3 Zettabyte or 300 Exabytes. If we compare that to the 2.5 Exabyte of data that we currently create every day, it looks like it is nothing to worry about. Unfortunately, that is wrong. Those 300 Exabytes of data will give us headaches, sleepless nights and it will costs a lot of energy and money.

Why? Because a large percentage of those 300 Exabytes reside in legacy systems that are incompatible with modern technology. We cannot switch off those systems and we cannot simply import the data in modern Hadoop platforms. Especially banks and insurance companies have many legacy systems, some of them having been in place for decades. Due to the many mergers and acquisitions in the finance world, banks sometimes have dozens of separate legacy systems. As Karl Flinders writes in his article, one bank even had 40 different legacy systems. These ageing cobbled-together legacy systems can often be found in payment and credit card systems, ATMs and branch or channel solutions. The fact that these legacy systems cause companies headaches is illustrated by the Deutsche Bank, whose big data plans are held back due to the legacy systems.

Not only banks have to deal with legacy systems. Also the car industry has to deal with them. At Ford Motor Company, they have data centres that are running on software that is 30 or 40 years old. But also the pharmaceutical industry, travel industry or the public sector have to deal with legacy systems. Replacing these legacy systems is almost impossible. Flinders refers to it as “changing the engines on a Boeing 747 while in flight”.

However, how hard it may seem, it is not impossible, as was shown by the Commonwealth Bank of Australia. In the past 5 years they have replaced the entire bank’s core system, moved most of the services into the cloud and developed many apps and innovations that brought the bank at the forefront of innovation.

Legacy systems consist of traditional relational database management systems often on old and slow machines that cannot handle too much data at once. Hence, most of these legacy systems process their data at night and it can take some time to query the data needed. Real-time processing and analysing of data in legacy systems is impossible. We have to look for solutions to continue to use that old data.

One of the solutions how to deal with legacy systems and big data is to replace the entire legacy system of a company. A part from the massive risks involved in such an operation, there are also a lot of costs involved so it is not very likely that many organisations will take up this strategy.

As such, it is important to find ways to have new innovative technologies that allow real-time analysis of various data sets to co-exist with the legacy systems. These systems from the terabyte era still contain valuable (historical) information. There are several ways to keep and use the historical data in the data warehouses:


  1. Macro-batching of the data into the new big data solutions on a periodic timescale, for example every night. This data can then be used together with the ‘new’ data.
  2. Sending periodic summaries of the data in the data warehouse in order to use the data in those warehouses while preventing continues querying of that data. Only when certain information is required, the data warehouse is queried for that data and the data is retrieved.

These solutions will enable analysing both unstructured and structured legacy data within a single integrated architectural framework. Such a platform allows the legacy data to remain within the existing data warehouses and at the same time enable near real-time analyses.

Using middleware to enhance systems and replace the hardware that supports them is however not ideal. Another problem for the legacy systems is that with the overall acceptance of big data, a larger percentage of the IT budget will go to these big data projects. Leaving less money for the legacy systems. While in turn, the employees being able to work with the legacy systems become scare and thus expensive.

If such a trend continues for a too long time, there is a danger that the legacy system will fail one day, placing the company into a lot of trouble. The later organisations start with replacing these legacy systems, or at least try to make them compatible with big data technologies, the more expensive and difficult it will be.

A less risky but still expensive solution could also be to develop a specific algorithm that can transfer millions of lines of old data into modern distributed file systems. Until all data works correctly in the new distributed file systems, they can both co-exist. A paper by among others Mariano Ceccato, explains how they developed an algorithm to interfere a structured data model from a legacy data system in an attempt to restructure that legacy system into an up-to-data and usable data model.

Real transformative insights can only come when all data is used, including the data from legacy systems with incompatible data formats. Therefore, eventually the data in legacy systems will need to be transferred to massively scalable storage system and thus replacing the critical search, calculation and reporting functions of those legacy systems.

In the end, the ambition for any organisation with legacy systems should be to truly retire these systems, as companies will not be able to support them forever.  If, in the mean time, they simultaneously integrate that legacy data in one platform to produce data aggregation, they can already reap the benefits from historical data to create truly valuable insights.

Finally, I came across below video from EMC, which gives a great explanation of how to deal with legacy systems:


Image Credit: Kinga/Shutterstock