Why self-service data access is crucial for great data engineering and data science collaboration

This is a re-post of my original article over on LinkedIn.

When creating an organizational structure, one might be confronted with the choice of cross-functional teams versus functional teams, or say: do I put data engineers and data scientists in the same group or not? This article does not favor one way over the other. Both choices have pros and cons to them. I'll highlight only those that are relevant to this article. My perspective is heavily influenced by my role as a data engineering leader, and these opinions are my own. I am not writing on behalf of my employer.

Let me start with a general statement: Any organizational structure you create is a decision on what is managed by a process vs what is managed by the structure itself. Everything you decide against implementing in the design of your structure needs a process.

Cross-Functional

Creating a cross-functional team greatly benefits creating accountability for the outcome. You give the team autonomy and agree on an outcome. The interaction between engineers and scientists is fluid, and the outcome created is represented as one. If the quality of the outcome is not as intended, the team can work it out together and come up with a new iteration until the desired outcome is reached. From a director's or team lead's point of view, this is desirable because, to be effective, the team must collaborate and improve their cultural relationship.

On the contrary, I also see drawbacks along the way. People development gets harder because an engineering lead rarely combines people skills, data engineering excellence, and data science excellence. I've come across many people who want to report to a person of their craft and profit from their experience. The only way I know to mitigate it would be to have a great senior person on the team who can lead the people development instead of the team lead. I cannot judge if that would work out well as I've never seen this model implemented. There is also an efficiency loss for the data science output because not every task includes the need for their craft. I won't drill deeper into the pros and cons of this structure because it is less or not at all affected by the problem I want to point out.

Data Science and Engineering in separate Teams

I've often seen the implementation of a separate data engineering and science organization. Data engineers report to data engineering leaders, and data scientists report to data scientist leaders. That's also how we currently do it at Veeva Link. This approach mitigates the people development problem, at least from a structural perspective (you still need to care about good leaders). It also helps to improve the efficiency of the data science team(s) because you can staff the units in a different ratio (e. g. 3 data engineering teams to 1 data science team).

Decoupling data science from engineering also removes the need for collaboration imposed by the cross-functional structure. Engineering and data science work on each of their agendas at their respective speed. When the collaboration is not facilitated through the structure, it needs a process to happen. Most of the models I encountered include a process that makes data scientists request data from a data engineering team in one way or another. Either it's a separate database filled by some batch job that requires a code change or just flat file exports out of some system to be loaded into another environment where scientists can run experiments. In any case, those processes mostly need human interaction on the engineering side to be executed.

Why do I think this is a problem?

It's related to the way both teams work internally. Engineers work on (comparably) longer tasks. Hours, even days of focus time on a single task that has been analyzed beforehand. The expected result is pretty clear, and every interruption makes it harder to achieve the result in the committed timeframe. In contrast, data scientists work in an explorative mode (not talking about MLOps right now). Experiments and outcomes determine the following steps, including looking at a new set of potential features or a different dataset. Scientists cannot analyze the task beforehand because that is a significant portion of their work. Discovering twists and turns is the default, not the exception.

Consequently, data scientists are forced to rely on a process contradictory to the optimal working condition of engineers to produce outcomes. This creates conflicts on a cultural level and negatively affects the teams' outcomes.

Why does self-service data access help?

Enabling self-service access to data through accessible interfaces mitigates most of the disturbances and waiting times for the respective teams. In addition, a culture where you can create a mutual understanding of the different ways of working can yield similar results regarding collaboration compared to those within cross-functional teams. The access helps the data science teams to operate at their pace, getting new data whenever needed, and creating autonomy for both teams, as the interface between them is now a technical one. Removing the need for human interaction also removes the distraction from the data engineering teams.

A complementary step to the introduction of self-service data access you can add a great metadata or data catalog tool to even further remove the need for wasteful interactions.

Simply said: APIs scale better than people.

Cross-Functional#

Data Science and Engineering in separate Teams#

Why do I think this is a problem?#

Why does self-service data access help?#

Cross-Functional

Data Science and Engineering in separate Teams

Why do I think this is a problem?

Why does self-service data access help?