Posted on 5 mins read

In my experience with large enterprises, a persistent major limiting factor to using AI/ML technologies is data.

The root cause of why data is usually a problem is that enterprises don’t treat their data as an asset - it is a necessary pre-requisite for operational excellence. Often, this goes as far as enterprises treating data as a liability. As an example, many financial services regulations require banks to hold onto data for a certain number of years with full audit trails for decision making - and this is often a headache for the CIOs, since it doesn’t actively add value to the business.

It is easy to understand the perspective of the CIO - the short term ROI make her conclusion obvious. Time for an anecdote - during my graduate studies, I realized that the easiest way of publishing great research was to take solutions from one area and marry it with problems in other areas. For example, applying a new type of machine learning method to cancer research. This is the hidden potential of data - the prize, if you will - that data collected in one part of the organization happens to be useful to another part of the business.

To affect this change, the first thing that needs to happen is that we need to start thinking of data as an asset and invest in it even if the immediate reward is unclear.

This post has some thoughts about the problems with data in large enterprises and then a potential solution, that I’ve noticed some large tech companies employ to great effect.

What’s wrong with my data?

  1. Referential integrity - the vast majority of discussion around data quality (and perhaps materiality), revolves around a single problem : referential integrity - whether we can trust an entity as represented in one system to be the same entity in another system.
  2. Soft-silos - Enterpises buy lots of enterprise software from large vendors such as Oracle, SAP, IBM, Microsoft etc. These large vendors obviously build their systems to work in perfect harmony within their walled garden - so Microsoft’s CRM product works best with Microsoft’s SQL server. This makes their product hard to rip out and makes it easy for the sales rep to sell add-on products. Data has gravity - it attracts other data and applications. This is typically a problem for the customer since they use multiple vendors and it takes significant gymnastics to get data from one vendor’s pretty garden to another’s - there are logistical issues, duplication, etc.
  3. People - I often quip that the big problem with big data is big people. Typically, career growth in large enterprises requires on a level of diplomacy. One bargaining chip in many such maneuvers is data.
  4. SPOF - finally, since data is an operational side-effect, it grows organically i.e. it is usually not planned and organizationally, it means that it requires some last minute ninja work from some organizational hero. Heroes are great, but they are a single point of failure - what happens when the hero leaves? what happens if she doesn’t have the bandwidth to respond?

Alright, so we understand the problem and it is worth fixing!

So, hire a CDO?

One particular solution that lots of enterprises have tried out has been to hire a Chief Data Officer. Kudos for recognizing the issue and trying to fix it, but I have yet to see this work - because:

  1. The expectations are unicorn-like - the CEO and the board feel happy that they’ve hired a unicorn and everything’s going to be fixed.
  2. In many cases, CDOs don’t have a team, so it ends up being a one-man-army who is responsible for for data related issues, but does not have the capacity to execute.

A solution

My proposal for fixing this issue is to hire product managers for your data and sprinkle them throughout the organization. I recognize that this is a upfront expense to setup with expected payoff in the 12-18 month timeframe. A few notes:

  1. The core idea is for these product managers to:
    1. Be thoughtful about the data assets for the various lines of business.
    2. Document the data and all its fields in detail.
    3. Ensure that there are common tools that help move data to and from various vendor’s walled gardens.
    4. Advertise their data assets to other parts of the organization.
  2. Have the function be matrixed into business as well as IT.
  3. My hunch is that many DBAs will likely be ideal for this role.

Metrics for Data Product Managers

  1. Speeds and Feeds - The basics of the relationship between the business and its representation in data - simple dashboards that represent:
    1. Where is the data?
    2. How much?
    3. How does it grow?
    4. Who accesses it? How frequently?
  2. Documentation - ensure that 100% of the data is documented in detail, including processes that produce and consume it. This is the key to cross-fertilization.
  3. Simplification - To me, this means reducing redundancy and maintaining sanctity of source/golden systems. In other words, think of systems are a source of truth vs scratch space. Be religious about deleting data from scratch. This alone can resolve 30% or so compliance issues.
  4. Cross-fertilization - I strongly think that data needs to be treated as a horizontal asset. This makes an easy metric - how often does a business unit successfully use data from another? As an example, I have seen this done very well in the financial crime space within banks.