DATA QUALITY FOR ANALYTICS: clean input drives better decisions
Organizations are increasingly relying on analytics and advanced data visualization techniques to deliver incremental business value. However, when their efforts are hampered by data quality issues, the credibility of their entire analytics strategy comes into question. Because analytics traditionally is seen as a presentation of a broad landscape of data points, it is often assumed that data quality issues can be ignored since they would not impact broader trends. But should bad data be ignored to allow analytics to proceed? Or should they stall to enable data quality issues to be addressed? In this article, Niko Papadakos, Mohit Sharma, Mohit Arora and Kunal Bahl use a shipping industry scenario to highlight the dependence on quality data and discuss how companies can address data quality in parallel with the deployment of their analytics platforms to deliver even greater business value.
AN ANALYTICS USE CASE: FUEL CONSUMPTION IN THE SHIPPING INDUSTRY
Shipping companies are increasingly analyzing the financial and operational performance of their vessels against competitors, industry benchmarks and other vessels within their fleet. A three-month voyage, such as a round trip from the US West Coast to the Arabian Gulf, can generate a large volume of operational data, most of which is manually collected and reported by the onboard crew.
Fuel is one of the largest cost components for a shipping company. Optimum fuel consumption in relation to the speed of the vessel is a tough balancing act for most companies. The data collected daily by the fleet is essential to analyze the best-fit speed and consumption curve. Figure 1 demonstrates an example of a speed versus fuel consumption exponential curve plotted to determine the optimum speed range at which the ships should operate. With only a few errors made by the crew in entering the data (such as an incorrect placement of a decimal point), the analysis presented is unusable for making decisions. The poor quality of data makes it impossible to determine the relationship between a change in speed and the proportional change in fuel consumption as presented in Figure 1.
Figure 1: Speed – Fuel consumption curves (including data quality issues).
If the outliers are removed, the analysis shown in Figure 2 provides a clear a correlation between the speed of the vessel and its fuel consumption.
Figure 2: Speed – Fuel consumption curves (cleaned data by removing outliers).
As shown in these examples, most analytics programs are designed based on the belief that removing outliers is all that is needed to make sense of the data, and there are many data analysis tools available that can help with that. However, what if some of those outliers are not outliers and were the result of a scenario that needs to be considered? For instance, in the example, what if some of the outliers were actual fuel consumption points captured when the ship encountered inclement weather? By ignoring these data points, users can make assumptions without considering important dimensions—and that could lead to very different decisions. This approach not only makes the analysis dubious, but also often leads to incorrect conclusions.
In some cases, the practice of removing outliers can lead to the deletion of a significant number of data points from the analysis. But can users get the answer they are looking for by ignoring 40 percent of the data set? Companies need to be able to determine the speed at which vessels are most efficient with a lot more certainty. Data quality issues only reduce the confidence in the analysis conducted. In the shipping example, a difference in speed of 1 to 2 knots can potentially result in a difference of $500,000 to $700,000 in fuel consumption for a round trip US West Coast to Arabian Gulf voyage at the current bunker price.
Does this mean that data needs to be validated 100 percent before it can be used for analytics? Does the entire universe of data need to be clean before it is useful for analytics? Absolutely not. In fact, companies should only clean the data they intend to use. The right approach can help to determine which issues should be addressed to manage data quality.
DATA USED FOR ANALYTICS: WHERE SHOULD I USE MY CLEANSING TOOLS?
Analytics use cases have specific needs in terms of which pieces of data are critical to the analysis. For each piece of data, the rules or standards required to make it suitable for the analysis must also be defined. But not all data standards have equal priority.
For instance, in the shipping example above, it might be more important to ensure that the data used for analysis is accurate as compared to ensuring that all the data is available. In other words, using 80 percent of 100 percent accurate data to generate the trend is better than using 100 percent of data that is only 80 percent accurate. An organization should focus most of its energy on data used by high-impact business processes.
To manage the quality of data, organizations need a robust data quality management framework. This will enable them to control, monitor and improve data as it relates to various analytics use cases.
APPROACH TO DATA QUALITY MANAGEMENT
Data is created during the course of a single business process and it moves across an organization as it goes through the different stages of one or more business processes. As data flows from one place to the next, it transforms and presents itself in other forms. Unless it is managed and governed properly, it can lose its integrity.
Although each type of data needs a distinct plan and approach for management, there is a generic framework that can be leveraged to effectively manage all types of data. As shown in Figure 3, the data quality management framework consists of three components: control, monitor and improve.
Figure 3: Data quality management framework.
The best way to manage the quality of data in an information system is to ensure that only the data that meets the desired standards is allowed to enter the system. This can be achieved by putting strong controls in place at the front end of each data entry system, or by putting validation rules in the integration layer responsible for moving data from one system to another. Unfortunately, this is not always feasible or economically viable when, for example, data is captured manually and then later captured in a system, or when modifications to applications are too expensive, particularly with commercial off-the-shelf (COTS) software.
In one particular case, a company decided against implementing changes to one of its main data capture COTS applications that would have enforced stricter data controls. They relied instead on training, monitoring and reporting on the use of the system to help them improve their business process, and as a result, experienced improved data quality. However, companies that have implemented strong quality controls at the entry gates for every system have realized very effective data quality management.
It is natural to think that if a company has strong controls at each system’s entry gate, then the data managed within the systems will always be high in quality. In reality, as processes mature, people responsible for managing the data change, systems grow old and the quality controls are not always maintained to keep up with the desired data quality levels. This generates the need for periodic data quality monitoring by running validation rules against stored data to ensure the quality meets the desired standards.
In addition, as information is copied from one system to another, the company needs to monitor the data to ensure it is consistent across systems or against a “system of record.” Data quality monitors enable organizations to proactively uncover issues before they impact the business decision-making process. As shown in Figure 4, an industry-standard fivedimension model can be leveraged to set up effective data quality monitors.
Figure 4: The five Cs of data quality.
An example of a monitoring dashboard is shown in Figure 5. It is built to provide early detection of data quality issues. This enables organizations to perform root-cause analysis and to prioritize their investments in training, business process alignment or redesign.
Figure 5: Sample data quality monitoring dashboard.
When data quality monitors report a dip in quality, a number of remediation steps can be taken. As mentioned above, system enhancements, training and adjusting processes involves both technology and people. When a dip in quality occurs, it may be the right time to start a data quality improvement plan. Typically, an improvement plan includes data cleansing, which can be either done manually by business users or via automation. If the business can define rules to fix the data, then data cleansing programs can be easily developed to automate the data improvement process. This, followed by business validation, ensures that the data is back to its desired quality level. Often, organizations make the mistake of ending data quality improvement programs after a round of successful validation.
A critical step that is often missed is enhancing data quality controls to ensure the same issues don’t happen again. This requires a thorough root-cause analysis of the issues and data quality controls that need to be added to the source systems to prevent the same issues from reoccurring. Implementing these steps is even more critical when a project includes reference or master data, such as client, product or market data. Also, organizations that are implementing an integration solution will benefit from taking on this additional effort as it enables quality data to flow across the enterprise in a solution that can be scaled over time.
Most effective data quality management programs are centrally run by an enterprise-level function and are only successful if they are done in partnership with the business. Ultimately, it is the business that owns the data, while the IT teams are the enablers. But how can the business contribute to these seemingly technical programs?
DATA QUALITY IS AS MUCH ABOUT THE PEOPLE AS IT IS ABOUT TECHNOLOGY
In addition to the technical challenges faced by most data projects, there are often organizational hurdles that also must be overcome. This becomes particularly pronounced in organizations where data is vast, diverse and often owned by different departments with conflicting priorities. Therefore, a combination of data governance, stakeholder management and careful planning are needed, along with the right approach and solution. Key challenges that must be addressed for data quality initiatives include the following:
- Stewardship—Like any corporate asset, data needs stewardship. A data steward is needed to provide direction and influence resources to control, monitor and improve data. The data steward should be someone with a strategic understanding of business goals and an interest in building organizational capabilities around data-driven decision making. Having a holistic understanding will help the data steward direct appropriate levels of rigor and priority to improve data quality.
- Business Case—Organizations are unlikely to invest in data quality initiatives just for the sake of improving data quality. A definition of clean data and a justification for why it is important for analytics as well as operations needs to be documented. Some of the common themes in the business case include accurate and credible data for reporting, reduced rework at various levels and good quality decisions. The business case should present the data issues as opportunities that can unlock significant gains in the form of analytics and/or become the foundation of future growth.
- Ownership—Often, personnel other than data stewards and data entry personnel (data custodians) use the data for decision making. In that context, it is imperative for custodians to understand the importance of good quality data. The drive and ownership for entering and maintaining good quality data needs to grow organically. As an example, the crew onboard a vessel is more likely to take ownership of entering good quality and timely data about port time or fuel consumption if they know that the decisions involving asset utilization and efficiency are driven from data reported by the crew.
- Sustainable Governance—Making data quality issues visible or measuring the quality of data is good information to have, but ultimately does not move the needle in terms of improving data quality. A sustainable governance structure with close cohesion between data stewards, data custodians and a supporting model is required. It is nice to know that the data supporting a certain business process is at 60 percent or 90 percent quality, but that in and of itself will not automatically drive the right behaviors. A balanced approach of educating and training data custodians and enforcing data quality standards is recommended. With a changing business landscape and personnel, reinforcing the correct data entry process from time to time may improve quality. On the other hand, to ensure that overall data quality does not drop over time, effective monitoring and controls are also equally important. Doing one without the other may work in the short term, but may not be sustainable over time. For real change and improvement to happen, organizations need to implement a robust and sustainable data governance model.
- Communication—Any data quality initiative is likely to meet resistance from some groups of stakeholders and poor communication can make matters worse. Therefore, a well-thought-out communication plan must be put in place to inform and educate people about the initiative and quantify how it may impact them. Also, it is important to clarify that the objective is not just to fix the existing bad data, but to also put tools and processes in place to improve and maintain the quality at the source itself. This communication can be in the form of mailers, roadshows or lunchand- learn sessions. Further, the sponsors and stakeholders must be kept engaged throughout the lifecycle of the program to maintain their support.
- Remediation—Every attempt should be made to make the lives of data stewards easier. They should not view data quality monitoring and remediation routines as excessive or a hindrance to their day-to-day job. If data collection can be integrated and the concept of a single version of truth replicated across the value chain, it will ultimately improve the quality of data. For example, if the operational data captured by a trading organization (such as cargo type, shipment size or counterparty information) is integrated with pipeline or marine systems, it will ultimately enable pipeline and shipping companies to focus on collecting and maintaining data that is intrinsic to their operation.
As organizations increasingly rely on their vast collections of data for analytics in search of a competitive advantage, they need to take a practical and fit-for-purpose approach to data quality management. This critical dependency for analytics is attainable by following these principles:
- Tackle analytics with an eye on data quality
- Use analytics use cases to prioritize data quality hot spots › Decide on a strategy for outliers and use the 80/20 rule when pruning the data set
- Ensure decisions are trustworthy and make data quality stick by addressing root causes and implementing a monitoring effort
- More than any other program, make this one business-led for optimum results
is a Director at Sapient Global Markets in Houston, focusing on data. He has more than 20 years of experience across financial services, energy and transportation. Niko joined Sapient Global Markets in 2004 and has led project engagements in key accounts involving data modeling, reference and market data strategy and implementation, information architecture, data governance and data quality.
is a Senior Manger and Enterprise Architect with eight years of experience in the design and implementation of solutions for oil and gas trading and supply management. During this time, Mohit was engaged in multiple large and complex enterprise transformation programs for oil and gas majors. Most recently, he developed a total cost of ownership (TCO) model for a major North American gas trading implementation.
is a Senior Manager at Sapient Global Markets and is based in Houston. He has over 11 years of experience leading large data management programs for energy trading and risk management clients as well as for major investment banks and asset management firms. Mohit is an expert in data management and has a strong track record of delivering many data programs that include reference data management, trade data centralization, data migration, analytics, data quality and data governance.
is a Senior Manager in Sapient Global Markets’ Midstream Practice based in San Francisco. He is focused on Marine Transportation and his recent assignments include leading a data integration and analytics program for an integrated oil company, process automation for another integrated oil company and power trading system integration for a regional transmission authority.