Written by Alex Farquhar and Michael Jones
At loveholidays we are embracing data mesh as a way to improve the organisation and accessibility of our data. In doing so, we have quickly experienced the benefits of easy access to clearly defined, well structured data (known as Data Products) in helping drive business improvements.
The ability to dive into detailed performance data, or to quickly look at trends, is empowering and addictive. As a consequence, it is becoming the expectation for all data in the business. This attitude is testament to the value of the approach but it is something that is difficult to realise.
One challenge is the high technical barrier required. Whilst all teams in our business contain valuable data, many do not contain the necessary skills required to build and maintain data products.
To help address this, we have been making the process of getting data into products on the data mesh as easy as possible. Our goal is to unlock the potential of all data in the business using the least amount of engineering effort.
A People data product
The People team at loveholidays is a good example of a business area without the support to unlock the full value of its data. This has meant that the insight contained in our engagement surveys was often only able to be shared in summary form.
This presents a risk as understanding and properly responding to feedback, especially as you scale, cannot be overstated. It's important to note that doing this is not just the Peoples team’s job — it’s everyone's responsibility.
However, the lack of visibility and access for teams to understand and explore this data made this very difficult. Addressing this presented us with an ideal starting point to look at building low effort data products.
The data challenge
One of the surveys we send is our version of Spotify’s “team health check”. It asks technology teams to rate their experience across a number of key dimensions: from the health of code, how easy it is to release through to the level of support they receive and fun they are having.
There are many ways (and products you can buy) to capture and collate this feedback but we use Google Forms because of its relative low effort and cost.
After you have sent the survey, as each employee submits a response, the results are automatically saved into a Google Sheet. The resulting data is structured with each row containing all the employees responses and each question represented as columns in the sheet.
It might seem obvious, but this data isn’t really in a good format for analysis. It’s messy to aggregate across surveys (especially if you add different questions) and hard to cut the data into different dimensions or compare over time.
What we really need is data that has been pivoted. Doing so makes it a lot easier to analyse, create views and build graphs and charts to show trends.
Getting it into our data mesh
Getting this data out of the google sheets, into in a data store in a better shape for analysis is the same challenge for all our data products.
The process our technology teams use to build data products follows a standard pattern:
Extract the raw system data sources and load them into BigQuery tables in our “landing area” ready to be processed
Transform the raw data into data products that are loaded into separate BigQuery tables
Enable this data to be shared and explored using standard tools through clearly documented data and standard tooling (we use connected sheets, Looker and Looker Studio)
For our survey data we need to follow this same process but make it much lower effort.
Extracting data from google sheets
Bringing the raw data into our processing infrastructure was the first automation challenge we faced. We had been using Fivetran successfully in our data mesh for several months to integrate data from multiple sources into BigQuery.
There is a google sheets connector, so we immediately tried it. It worked for simple sheets, but we had problems with our survey data. The way the forms have been set up broke the 300 character limit for a BigQuery column name and there wasn’t an easy way around it.
The solution was to implement our own custom process as we have a lot of good tooling available to us. Using an Airflow operator, with a list of google sheets in a folder, we perform the extraction and pivoting in one go.
Applying the transformation
This was a surprisingly complex thing to implement — not easy for non-technical teams — but could we create a general solution for something like this? After some experimentation, we decided that we would ingest and pivot the data in python before we wrote it to the landing area. Because we use Airflow, we had the perfect setup to make this happen.
Normally we would not do processing directly on the Airflow cluster, because its primary role is to orchestrate workflows rather than do any heavy data processing. But in this case we were confident that the data would stay small enough to process in-memory without negatively impacting our cluster.
What we ended up with was a generic custom operator which can read google sheets from a designated google drive folder, and pivot all of the data into the right format to be loaded into the BigQuery landing area.
Running the jobs automatically
Once we had data being loaded into the landing area, our standard Data Mesh infrastructure could take over. This system gives people an easy way to setup and configure SQL processing jobs which work on the landing area data. They don’t have to worry about the boring bits like deployment and scheduling, all they need to do is:
write a SQL query that will transform the landing data into a data product
write a small JSON file that configures how often the job should run (as well as some other aspects like where to send alerts)
push the code to github
Once pushed, our infrastructure takes over and dynamically generates an Airflow DAG which will run the SQL query on the specified schedule.
Results
The result of this is a data product that is easy to interrogate. Obviously you can query the table in bigquery but it is also easy for non-engineers to pull the data into google sheets for their own analysis using data connectors.
Doing this makes it simple to pivot and chart to help answer any question you have about the data.
We are also able to integrate this data into dashboarding tools like Looker Studio. Making the information about important KPIs accessible and visible to everyone in a standard way.
Learnings
We created a very straightforward way of building data products from non-standard sources that, whilst not zero effort, was still very low maintenance.
For awkward data sources it’s worth putting the effort in to convert them early on in the process. The further downstream you allow difficult schemas to propagate, the more confusing and complex the work becomes.
This is a general approach we can apply to many other data sources in the business. For example, we could look at employee data to understand how our team changes over time and connect to other data sets to correlate to deployments and productivity.
Finally, it really is worthwhile making data products from this type of data. Doing so has helped us realise the goal of making improvements to employee experience everyone's responsibility. Teams now regularly use the survey data to understand their performance and as the basis for key results for team improvement objectives.