The plethora of headlines on machine learning and artificial intelligence highlights not just increased interest but potentially an abrupt shift in business focus toward these new and shiny analytical approaches. While that shift offers the possibility of groundbreaking changes, the speed with which companies are hurtling headlong down this path makes me worry that some fundamental, and definitely less “sexy” data issues may be overlooked. We know that the world is awash in data. In many industries, digitization leads to more digital information and data. Many institutions handle the deluge well; others are struggling to stay afloat. Regardless of where an institution is currently, I see several data management challenges.
Data experience will be required in nearly all roles.
As digitization leads to more and more data and information being “born digital,” nearly every role in an organization will have a data component. Not everyone is being trained for this eventuality. It is true that young professionals are more comfortable with technology than many of their seasoned counterparts, but data are not a technology issue and most people have not received any formal training in data management. This knowledge gap holds across industries. Technical professionals from medicine to human resources are highly skilled individuals who probably do not have the faintest idea what “data management” is or why they need to care. Yet how they classify a patient or employee record, where they file a chart or employment file, or how they describe a particular encounter or performance review all have data management implications. Are the metadata tags available for classification sufficient for the purpose of accurately capturing the issue at hand? Is the organization of the storage for their records transparent enough to allow for future discoverability? Are the actual terms being used to describe a diagnosis or employee action adequate to allow interoperability between different systems, business processes, or allow sharing with other relevant professionals? All of these are data management activities that are often picked up on the job informally without strategic consideration of best practices or consistency across the organization. Every institution has silos where local problems are optimized but few people think critically about global optimization of data management issues. If there is a strategic direction for data, it is often created by information technology because many institutions classify data as an IT issue which creates a disconnect with the business side.
I’ve seen the recognition of this gap at institutions with whom I have worked. There has been a sharp increase in demand from staff for data and statistical knowledge, through formal training programs or online classes. “Data literacy” is the hot topic everywhere from corporate boardrooms to consulting research. Even though many of a firm’s employees may work with data regularly, most of the work involves generating or consuming preformatted reports. There doesn’t seem to be a good understanding of what is involved with putting such things together or critical evaluation of what the reports say/don’t say or show/don’t show. Probably even more importantly, most staff are not capable or comfortable communicating data topics with others even if they have some expertise. This is a critical skill that needs to be emphasized and I am annoyingly pedantic about this point in the data science classes I teach.
In addition, I’ve seen a need for people to understand not just data but metadata which I often describe as the unsung hero of data management. People across an institution need to understand the issues involved with some of the basic data management principles that are necessary for anyone who wants to find, use, or store information. Application developers may not realize that they need metadata but every time they create a menu or a set of radio buttons on an application, they are using metadata. Administrative assistants or business analysts responsible for the creation or maintenance of a document sharing site probably don’t realize that they are incorporating basic data management principles when they decide how to organize things and who can have what kind of access. It will become even more critical to provide this skill to a broad spectrum of employees in order to ensure a basic level of understanding on these topics.
Planning will need to go beyond managing internal data.
Now that there has been a groundswell in the capture or creation of business data, many institutions are working to enable staff to use those data. This phase leads to the development of “data lakes” or other platforms where transactional or administrative data from a variety of systems are integrated into a single analytical repository. In many industries, this progression is well underway and employees with data and analytics skills are in high demand to get business value from the data. What will come next is the realization that while there will be some business value derived from the data that companies already have, the questions that can be answered with strictly internal information will be limited. The next phase will be a dawning realization that external data will be needed in order to answer the next order of questions that really matter. It’s not clear that there is a good understanding in many companies how to identify, evaluate, or acquire external data. Most of the work I have seen presented is all focused on the creation of the data lake and the integration of existing data sources. Granted, some of those existing data may have been acquired externally at some point but they are likely to be few and far between. The rate at which new data products are created and the high price tags that are now being attached can be daunting. Are companies budgeting for data purchases? Can their data governance programs accommodate externally provided data for which data ownership and data quality considerations differ drastically from their enterprise data? And finally, are the data architecture and platforms configured to bring in external data as part of a production process?
Again, I have seen this firsthand in several institutions. I worked with an institution who piloted an analytics project with Human Resource professionals and data scientists using new techniques to predict employee turnover using internal data. It was deemed to be very successful, not necessarily because it led to accurate predictions but because it highlighted the shortcoming of only using internal data to predict an activity that is affected by external forces. The project highlighted for senior leaders that such predictions were more closely related to local employment options than on personal characteristics of staff. Luckily, the participants understood how to go about finding external data because of the availability of data librarians and other information professionals who are trained to find, evaluate, and acquire external data. I have spent most of my career in a research function, which always requires external data sources to answer questions about the outside world, so this notion is second nature to me. But that is not the case for many departments in a large organization. There needs to be a better understanding across other business lines of the benefits to hiring and working with data professionals like data librarians as external data are required by more projects and processes.
Conclusion
I know that the data world is changing faster than ever before as new sources and uses for data are posited every day. I want to make sure that in the enthusiasm for new and exciting analytical approaches that the basic issues around the data that they use are not overlooked. There are lessons to be learned from a variety of industries as digitization and the resulting data becomes not just the norm but an opportunity for growth. I certainly would not want to impede that development but I will continue to focus on making sure that people don’t forget some of the first order data considerations that can be overlooked as the speed of data changes accelerates. Machine learning and artificial intelligence might be the new thing to change the world, but business will always need people who are happy to pay attention to the underlying data issues so that the new kid on the block can deliver.