- Data quality problems can hurt companies in multiple ways, from damaging an organization’s reputation to increasing its risk exposure, and they cost U.S. businesses more than $600 billion annually.
- In contrast to transaction datasets, which rarely exceed the capabilities of traditional data processing, unstructured “big data” is difficult to process with traditional tools due to three characteristics (“three Vs”): volume, velocity, and variety.
- Data quality, defined as “fitness for use,” depends upon its context and has at least 26 identified dimensions, of which four are particularly relevant for big data: 1) consistency, 2) provenance and believability, 3) accuracy, and 4) completeness.
- In launching an effort to improve quality, managers should “start small but start now,” first making investments to improve the quality of those pieces of big data that have high functional utility.
Advances in information technology have enabled organizations to collect and store more data than ever before. As data volumes increase, though, so do the complexities of managing that data and the risks of poor data quality. As many companies have discovered, bad data can have a huge, costly impact. Customers fail to receive their orders; clients are overcharged (or undercharged); inventory runs out unexpectedly; parts don’t arrive on time; and so on. This can hurt companies in a number of ways: damage to the organization’s reputation, heightened risk exposure, inability to meet regulatory compliance, and significant capital losses. Although international figures are difficult to obtain, data quality problems currently cost U.S. businesses more than $600 billion annually.
It’s not surprising, then, that most organizations have a robust set of standards and practices to manage, monitor, and clean their transaction data. But, for the most part, similar controls rarely exist for big data. Although transaction datasets can be huge, they almost never exceed the capabilities of traditional data processing approaches, even for large retail chains. Also, transaction data tend to be well structured and generated in relatively small amounts at discrete points in time. Contrast that with the continuous, voluminous video data that a retail chain might collect about its customers as they navigate through the aisles of its stores.
Big data are difficult to process with traditional tools because of characteristics commonly known as the “three Vs.” The first is volume, which is the amount of data. This sheer size is what most people understand by the term, big. The second characteristic is velocity, which is the speed at which big data arrives and needs to be processed. An example of high-velocity data is the data collected by sensors on a piece of equipment. The third characteristic is variety, which refers to the formats used. In-store customer behavior can be recorded not only with video, for example, but also by signals captured from people’s mobile phones as well as records of their purchases at checkout. A commonly cited example of big data is social media, which is characterized by all three Vs: every minute, users generate 204 million emails, 1.8 million Facebook likes, 278,000 tweets, and 200,000 photo uploads to Facebook.1
What Is Data Quality?
Data quality is defined as “fitness for use,” which is, above all, a contextual construct. For instance, a semester before the start of a college course, data on the number of students registering for a class may be good (accurate) enough to place an order for textbooks, but that same data might be insufficient for the professor to start creating project teams for that class.
Addressing context is a difficult problem even with transactional data, for which the most potential uses are typically known a priori. When dealing with big data, the potential uses are often unknown, and thus managing quality is even more problematic. For example, will a company use social media for evaluating the success of a branding effort, for responding in real time to customer complaints, or for some entirely different purpose? And, might that purpose change in some unknown way after the initial analysis? Furthermore, the reasons why people use social media (to tweet about a new restaurant, for instance) might have nothing to do with how a company might want to use that data, requiring that data to be repurposed for the different context.
Academic research on data quality has identified at least 26 of its dimensions. In this article, we focus on the four that are particularly relevant in the world of big data: 1) consistency, 2) provenance and believability, 3) accuracy, and 4) completeness.
Even for data that are entirely under an organization’s control, consistency is surprisingly difficult to achieve. The reasons include data entry errors; changes in data over time; the use of synonyms, abbreviations, and acronyms; differences in format, such as voice versus text; differences in semantics and coding, especially across business units and after company acquisitions; and equipment malfunctions for data that are collected by sensors. Consistency is generally even more difficult to achieve for big data. Because many big-data sources are external and unstructured, imposing rules at data entry, which can be done for internal data, becomes impossible. So, for example, the use of video data requires a company to tag and interpret that data, and the technology for doing so is still relatively immature.
Data cleansing software used for traditional data processing can be helpful to some extent, but the processes for cleansing big data are different and have some major limitations. When cleansing software finds an inconsistency in a traditional dataset, the data can be flagged for a human to decide which of the conflicting data entries is correct. But, because of the volume and velocity of big data, human intervention is inadequate for resolving inconsistencies. Instead, artificial intelligence systems built around business rules need to be programmed into the cleansing software so that inconsistencies can be resolved without the need for human intervention.
Moreover, traditional relational database management systems typically support integrity constraints and triggers that help enforce a variety of business rules for improving data consistency. The structure of big data, however, is too irregular for relational databases, so NoSQL database systems must be used. To enforce integrity in such systems, companies must write computer programs in a language such as Java. Systems analysts, who generally act as the interface between managers and database specialists, may now need to increase their programming expertise in order to fulfill their same roles with respect to NoSQL databases.
2. Provenance and Believability.
Provenance (that is, the history of the source of data) and believability are two related concepts. The sources of external big data are generally not as trustworthy as most internal sources. For one thing, much of big data was not intended for collection, let alone analysis for decision making. Believability is the extent in which the data are accepted, in a specific context, as true, or at least apparently true.
To assess the credibility of big data, companies can track and manage metadata on the lineage of that data—from their source, through transformations and computations, including replications, to their current form. That metadata will then help people assess whether the data can reliably be used in a particular application based on their understanding of the context of that application.
An important role here is that of the data steward, who is responsible for collecting and managing not only the data elements used but also their associated metadata, which should include data on provenance. Master Data Management (MDM) as well as data governance are important to ensure that the metadata on provenance is not only collected in the right places and at the right time, but also managed in a manner that ensures the accuracy and usability of that data. In general, the individuals who are responsible for collecting a specific element of big data also must be responsible for documenting the provenance of that data element.
Problems with data accuracy arise from data entry errors, data integration errors, system errors, and even inaccurate reporting by the data source. To determine the accuracy of a data value, companies should compare that value to a baseline or to a known correct value. But, that process often is difficult because the baseline value is often unknown or indeterminable at the time of measurement. With big data, determining accuracy is even more difficult. For instance, if someone tweets that he just purchased a Spirit bicycle for $600, several elements need validation: did he actually just purchase a bike, was it a Spirit, and was $600 the price paid?
The accuracy of transactional data can be estimated by using historical data and statistical methods. With big data, historical data are often unavailable, although social media data and technologies can be used to obtain baseline estimates. One way is through a variation of crowdsourcing—a way to outsource a task to a large, undefined group of people. Large organizations have used this model to obtain estimates of data that are otherwise difficult to obtain. The approach, tested and validated at Google, 2 can even be used to obtain baseline data from experts. This solution is inexpensive but it’s important to offer people some incentive in order to obtain genuine responses.
Accuracy also may be verified by cross-checking the same data from different sources. The problem, though, is when every source offers a different value. Here, users who are familiar with the data may have the best ability to identify values that don’t make sense. Further, data stewardship to manage the accuracy of data items at their source has been a successful approach for transactional data and it should work for big data as well. Therefore, as mentioned earlier, knowing the provenance of the data can go a long way toward understanding the relative accuracy of that data.
The impact of missing transactional data can be very costly. Imagine, for example, a bank losing the record of a big withdrawal, or a retail company misplacing a large order. The impact of incomplete big data also can have serious consequences. Missing sensor data could, for instance, lead to a company’s failure to service an important piece of equipment, such as a jet engine, resulting in an unexpected outage that compromises safety.
The examples above describe data that are missing due to errors in collection, storage, or processing, but completeness also can be a potential issue in systems design. A company might, for instance, miss a negative customer tweet because the person misspelled the firm’s name. Effective systems design would have ensured that any close spellings would be captured. More often, companies miss customer tweets because they do not even have a process in place to capture them. Here, again, the lack of completeness is due to poor systems design. Design issues are common with big data because such data are often collected without a clear understanding of their potential use.
Unfortunately, two attributes of data completeness—provenance and granularity—are often overlooked in the design phase. To the extent possible, the provenance of any data item should be collected and stored along with the data item itself. Otherwise, that data item might lack the necessary credibility. Data granularity can be a big problem when the granularity needed during analysis exceeds that of the data collected. Granularity problems arise most often with time and geography. Data items collected hourly might be useless if they need to be analyzed every millisecond. Similarly, data items recorded by state will be of limited value if they need to be analyzed by city or household.
The fundamental problem is that big datasets often are collected before their need or usefulness beyond a particular context is known. Thus the completeness of such data needs to be reassessed periodically, with management recognizing that big data design should be a planned and supported process responsive to evolving company needs.
Of course, the above four quality dimensions—consistency, provenance and believability, accuracy, and completeness—also are important for transactional data, but some of the associated challenges are more pronounced for big data. Consistency in big data is even more critical because of the multiple sources from which that data are collected and because those sources often are external to the organization. The same is true for believability. Accuracy always has been an issue with transactional data and remains an important data quality concern with big data as well. Completeness can be a serious issue with transactional data as there is typically only one source for a data item. Because there are often multiple big data sources for the same data item, the temptation is to believe that completeness is not as great a challenge. But, given the importance of provenance and granularity for completeness in many applications, this quality dimension must still be addressed in master data management and governance efforts.
Although this article addresses four dimensions of data quality that are important in the context of big data, other dimensions also might be relevant. For instance, timeliness might be considered important in the context of many applications. For the retail store chain mentioned earlier, timely video data of shoppers would enable the company to offer customers specific discounts in real time as they stop to examine certain products.
Some managers might believe that the huge amount of good data in a dataset of big data will overwhelm any errors that might exist. This is just wishful thinking. As the amount of data increases, so do the number of errors. And, these errors don’t necessarily cancel each other out. More often, the errors are systematic, resulting in an inaccurate view of whatever is being analyzed. Managers also might think that because big data are typically used only for making broad decisions, averages and trends are more important than details. The assumption is that outlier errors won’t affect averages by much. But, systematic errors easily can bias any analysis, resulting in bad, long-range planning decisions that can have far-reaching effects.
To be sure, improving the quality of big data requires significant investments in both data management and governance. From a business perspective, will those investments pay off in terms of the value that big data can generate? To answer that question, managers should consider big data efforts much like social media initiatives—specifically, the best strategy might be to start small but to start now. One possible approach is to identify what specific pieces of big data are important from a utility point of view, and then make the investments to improve the quality of just those pieces. Then, as those efforts pay off, managers can consider making additional investments to incorporate subsequent pieces of big data. The lesson here is that big data are like any other kind of data—useful only when that data meet or exceed an acceptable level of quality.
- T. Dull, “Big Data and the Internet of Things: Two Sides of the Same Coin?”
- B. Cowgill, J. Wolfers, and E. Zitzewitz, “Using Prediction Markets to Track Information Flows: Evidence from Google”