Monthly Archives

March 2019

Data Quality: Enter the 4th Dimension

By | Data Platform | No Comments

Data quality is a uniform cause of deep pain in establishing a trusted data platform in Data & AI projects. The more systems that are involved the harder it gets to clear it up, before you even start accounting for how old they are, how up to speed the SME’s are, how poor front end validation was – there’s a host of potential problems. However something tells me that the number of projects where the customer has said that it’s OK if the numbers are wrong is going to remain pretty small.

Scope, Cost, Time – Choose one. But not that one.

Project Management Triangle

Data Quality is a project constraint

Many of you will be familiar with the Project Management Triangle which dictates that you vary two of Scope, Cost or Time to fix the other. The end result being that in the middle, Quality gets affected. For most Data & AI projects I have found cost and time tend to be least negotiable, so scope gets restricted. Yet, somehow Time and Cost get blown out anyway.

Whilst Data & AI is hardly unique in terms of cost and schedule overruns, there is one key driver which is neglected by traditional methods. Leaning once again on Larissa Moss’s Extreme Scoping approach, she calls out the reason. It’s because in a Data & AI project, Quality – specifically Data Quality – is also fixed. The data must be complete and the data must be accurate for it to be usable – and there is no room for negotiation on this. Given that the data effort consumes around 80% of a Data & AI projects budget, this becomes a significant concern.

How do we manage Data Quality as a constraint?

We have to get the business to accept that the traditional levers can’t be pulled in the way they are used to and that requires end user education. The business needs to be made aware that it is a fixed constraint – one that they are imposing, albeit implicitly. The business has to accept that if Quality is not a variable, then the three traditional “pick two to play with” becomes “prepare to vary all of them”.  Larissa Moss refers to this as an  “Information Age Mental Model” which prioritises quality of output above all else.

Here is where strong leadership and clear communication comes into play. Ultimately if one business demands a certain piece of information the Data & AI project team will have to be clear to them that to obtain that piece of data to the quality which is mandated, they must be prepared to bear the costs of doing so, including the cost of bringing it up to a standard that means it is enterprise grade and reusable, so that it integrates with the whole solution for both past and future components of the system. This of course does not mean that an infinite budget is opened up to deal with each data item. Some data may not be worth the cost of acquisition. What it does mean is that the discussion about the costs can be more honest, and the consumer can be more aware of the drivers for the issues that will arise from trying to obtain their data.

ELT Framework in Microsoft Azure

Azure ELT Framework

By | Data Platform | No Comments

The framework shown above is becoming a common pattern for Extract, Load & Transform (ELT) solutions in Microsoft Azure. They key services used in this framework are Azure Data Factory v2 for orchestration, Azure Data Lake Gen2 for storage and Azure Databricks for data transformation. Here are the key benefits each component offers –

  1. Azure Data Factory v2 (ADF) – ADF v2 plays the role of an orchestrator, facilitating data ingestion & movement, while letting other services transform the data. This lets a service like Azure Databricks which is highly proficient at data manipulation own the transformation process while keeping the orchestration process independent. This also makes it easier to swap transformation-specific services in & out depending on requirements.
  2. Azure Data Lake Gen2 (ADLS) – ADLS Gen2 provides a highly-scalable and cost-effective storage platform. Built on blob storage, ADLS offers storage suitable for big data analytics while keeping costs low. ADLS also offers granular controls for enforcing security rules.
  3. Azure Databricks – Databricks is quickly becoming the de facto platform for data engineering & data science in Azure. Leveraging Apache Spark’s capabilities through Dataframe & Dataset APIs and Spark SQL for data interrogation, Spark Streaming for streaming analytics, Spark MLlib for machine learning & GraphX for graph processing, Databricks is truly living up to the promise of a Unified Analytics Platform.

The pattern makes use of Azure Data Lake Gen2 as the final landing layer, however it can be extended with different serving layers such as Azure SQL Data Warehouse if an MPP platform is needed, Azure Cosmos DB if a high-throughput NoSQL database is needed, etc.

ADF, ADLS & Azure Databricks form the core set of services in this modern ELT framework. Investment in their individual capabilities and their integration with the rest of the Azure ecosystem continues to be made. Some examples of new upcoming features include Mapping Data Flows in ADF (currently in private preview) which will let users develop ETL & ELT pipelines using a GUI-based approach and MLflow in Azure Databricks (currently in public preview) which will provide capabilities for machine-learning experiment tracking, model management & operationalisation. This makes the ELT framework sustainable and future-proof for your data platform.

Agile Zero Sprint for Data & AI projects

By | Data & AI | No Comments

Agile methodologies have a patchy track record in Data & AI projects. A lot of this is to do with adopting the methodologies themselves – there are a heap of obstacles in the way that are cultural, process and ability based. I was discussing agile adoption with a client who readily admitted that their last attempt had failed completely. The conversation turned to the concept of the Agile Zero Sprint and he admitted part of the reasons for failure was that they had allowed Zero time for their Agile Zero Sprint.

What is an Agile Zero Sprint?

The reality of any technical project is that there are always certain fundamental decisions and planning processes that need to be gone through before any meaningful work can be done. Data Warehouses are particularly vulnerable to this – you need servers, an agreed design approach, a set of ETL standards – before any valuable work can be done – or at least without incurring so much technical debt that your project gets sunk after the first iteration cleaning up after itself.

So the Agile Zero Sprint is all that groundwork that needs to be done before you get started. It feels “un”-agile as you can easily spend a couple of months producing nothing of any apparent direct value to the business/customer. The business will of course wonder where the productivity nirvana is – and particularly galling is you need your brightest and best on it to make sure you get a solid foundation put in place so it’s not a particularly cheap phase either. You can take a purist view on the content from the Scrum Alliance or a more pragmatic one from Larissa Moss.

How to structure and sell the Zero sprint

The structure part is actually pretty easy. There’s a set of things you need to establish which will form a fairly stable product backlog. Working out how long they will take isn’t that hard either as experienced team members will be able to tell you how long it takes to do pieces like the conceptual architecture. It just needs to be run like a long sprint.

An Agile Zero Sprint prevents clogged pipes

An Agile Zero Sprint prevents clogged pipes

Selling it as part of an Agile project is a bit harder. We try and make this part of the project structure part of the roadmap we lay out in our Data & AI strategy. Because you end up not delivering any business consumable value you need to be very clear about what you will deliver, when you will deliver it and what value it adds to the project. It starts smelling a lot like Waterfall at this point, so if the business is skeptical that anything has changed, you have to manage their expectations well. Be clear that once the initial hump is passed, the value will flow – but if you don’t do it the value will flow earlier to their expectations, but then quickly after the pipes will clog with technical debt (though you may want to use a different terminology!)

BI User Personas – are you scaring users with the kitchen sink?

By | Data & AI | No Comments

BI User Personas are a key part of delivering any BI solution. Throughout my career I have encountered clients who have all faced the problem that their BI Solution has not achieved the adoption they had hoped for. This in turn has reduced the impact of the solution and thus the ROI. A common thread in the examples I have seen is the horrifying kitchen sink that is thrown at every user.

 The wrong BI User Persona causes some interesting reactions

The kitchen sink is scary for some!

To explain to those not familiar with the idiom, to include “everything but the kitchen sink” means to “Include just about everything, whether needed or not”. What it means in this context is that the BI solution presents so many dimensions, measures and KPI’s to the user that the experience becomes confusing, overwhelming and as a consequence – useless.

Why building your BI solution is like making a hit movie.

No Hollywood movie is ever made without considering the audience appeal – they even use predictive analytics to drive scripting decisions. So why should your project be any different? You have consumers that need to be satisfied, and their wishes must be taken into account.

A key element of our Data & AI Strategy is to ensure that the end users different needs are planned for. Constructing BI User Personas to define what level of detail gets exposed to each persona helps in this process. To stretch our analogy a little further, your executive team may only care that there *is* a kitchen sink and whether it is working or not. A management team may need to know how hot the water is and how water efficient the tap is. The analysts will need to know detailed water usage statistics over time for analysis. Not everyone needs to know the same thing.

Most BI tools allow you to provide different views of the data model so that you can tailor the output of a very complex model to users with simple needs. An executive may only need a few key metrics and dimensions in a dashboard to examine before they pass further analysis downstream. A manager may need a more complex report with interactivity to drill into an issue. The analyst may simply need access to raw data to answer their questions.

The same applies for less data literate users. If they are not technically minded they may find a much simpler model less intimidating. Data literacy is whole additional topic but with proper preparation, it can be managed and taught.

BI User Personas drive a smash hit!

Understanding your audience is essential. As part of the process of designing your solution, BI User Personas need to be defined so they get appropriately suited content.

Building and understanding the personas of your end user team is of course only part of the equation. There are many human components in a Data & AI Strategy that need to be implemented. Change management, training and ongoing communication help ensure that what you deliver is adopted, and part of the strength of FTS Data & AI is that as part of the FTS Group we can bring our stablemate Cubic in to help with this.