Institutions facing data quality issues often respond like teenagers first experiencing acne. After the swell of immediate anxiety subsides, they reach for a simple fix without attempting to address contributing factors or understand underlying causes. There’s only so much a tube of pimple cream can do to counteract infrequent face washing and a steady diet of pizza, chocolate, and sugar.
In the case of higher education, institutions have expressed concerns about data quality. To some degree, their anxiety makes sense. Poor data quality can cause inefficient decision-making processes, erode stakeholder trust in data-based institutional initiatives, and present a significant cost to the institution.
In response, many immediately look to implement institutional data management processes or technology (i.e., data profiling or data quality tools). Much like that tube of pimple cream, these are surface solutions that are limited by the quality of the inputs they are fed. Before jumping to these tools, institutions should try to define standards for data quality to better understand which tools would best address their specific issues.
Defining a Data Quality Diet
While “data quality” may seem like an intuitive concept, its definition varies among stakeholders. For example, they may agree that data has to be accurate but disagree about how often data should be updated to meet the standard.
In order to find consensus, institutional leaders should establish common data quality dimensions and data quality thresholds. A “data quality dimension” is a way to classify institutional information and data-quality needs. Although there is a great deal of discussion about the precise dimensions by which data quality can be determined, establishing these standards allows all stakeholders to view data quality through the same lens. Examples of these dimensions might include:
- Completeness: All required data is provided
- Uniqueness: Each entity (e.g., student) is uniquely represented in the database
- Timeliness: Data represents the required period in time
- Validity: Data matches established rules (e.g., format)
- Accuracy: Data accurately represents reality
- Consistency: Different data instances do not provide conflicting information about the same object (e.g., a college freshman whose age is listed as five years old)
For each dimension, institutional leaders should then establish the threshold by which to judge data quality. For example, an institution might define student record data as high quality only if it consists of each student’s date of birth, name, address, enrollment status, and financial aid status. Outlining what information all stakeholders need enables institutions to set clear, consistent standards for data quality.
Gaining Clarity about the Right Tools
Having established data quality guidelines, institutions can then use data profiling tools to review existing data against these dimensions and thresholds. There are a number of data profiling tools currently on the market, but choosing the best fit will depend on a number of factors. Institutional leaders should carefully consider the following before beginning a data quality assessment:
- Cost: Tools like Informatica Data Quality and Oracle Enterprise Data Quality are very powerful and robust and provide users with a great deal of functionality. They may be cost-prohibitive for some institutions, which might consider lower cost alternatives. We have found robust open-source or free tools, such as Talend Open Studio for Data Quality and Datamartist, to be very compelling.
- Functionality: Nearly all data profiling tools enable users to check data for dimensions like completeness and uniqueness, but checking data for consistency and validity is difficult for some lower-end tools. These tools are not able to check data across tables or against rules. Institutions should carefully review potential data profiling tools for these capabilities before acquiring one.
- Integration: It is important to ensure that data quality tools can integrate with different databases. Some, such as Informatica and Talend, have a number of popular integrations embedded in their systems (e.g., Salesforce). Others require integration through an Open Database Connectivity (ODBC) or Java Database Connectivity (JDBC) connection. Institutions should review both the intended data profiling tool and database to ensure that the required connection can be established.
These steps will enable higher education leaders to work across their institutions to correct identified data quality issues, such as improving documentation (data dictionaries), working with data stewards to complete missing data fields, or revising data entry policies for different systems. More importantly, these steps will empower stakeholders to take responsibility for the data they handle, leading to an overall improvement in data quality.