Category

Data Platform

Data Quality: Enter the 4th Dimension

By | Data Platform | No Comments

Data quality is a uniform cause of deep pain in establishing a trusted data platform in Data & AI projects. The more systems that are involved the harder it gets to clear it up, before you even start accounting for how old they are, how up to speed the SME’s are, how poor front end validation was – there’s a host of potential problems. However something tells me that the number of projects where the customer has said that it’s OK if the numbers are wrong is going to remain pretty small.

Scope, Cost, Time – Choose one. But not that one.

Project Management Triangle

Data Quality is a project constraint

Many of you will be familiar with the Project Management Triangle which dictates that you vary two of Scope, Cost or Time to fix the other. The end result being that in the middle, Quality gets affected. For most Data & AI projects I have found cost and time tend to be least negotiable, so scope gets restricted. Yet, somehow Time and Cost get blown out anyway.

Whilst Data & AI is hardly unique in terms of cost and schedule overruns, there is one key driver which is neglected by traditional methods. Leaning once again on Larissa Moss’s Extreme Scoping approach, she calls out the reason. It’s because in a Data & AI project, Quality – specifically Data Quality – is also fixed. The data must be complete and the data must be accurate for it to be usable – and there is no room for negotiation on this. Given that the data effort consumes around 80% of a Data & AI projects budget, this becomes a significant concern.

How do we manage Data Quality as a constraint?

We have to get the business to accept that the traditional levers can’t be pulled in the way they are used to and that requires end user education. The business needs to be made aware that it is a fixed constraint – one that they are imposing, albeit implicitly. The business has to accept that if Quality is not a variable, then the three traditional “pick two to play with” becomes “prepare to vary all of them”.  Larissa Moss refers to this as an  “Information Age Mental Model” which prioritises quality of output above all else.

Here is where strong leadership and clear communication comes into play. Ultimately if one business demands a certain piece of information the Data & AI project team will have to be clear to them that to obtain that piece of data to the quality which is mandated, they must be prepared to bear the costs of doing so, including the cost of bringing it up to a standard that means it is enterprise grade and reusable, so that it integrates with the whole solution for both past and future components of the system. This of course does not mean that an infinite budget is opened up to deal with each data item. Some data may not be worth the cost of acquisition. What it does mean is that the discussion about the costs can be more honest, and the consumer can be more aware of the drivers for the issues that will arise from trying to obtain their data.

ELT Framework in Microsoft Azure

Azure ELT Framework

By | Data Platform | No Comments

The framework shown above is becoming a common pattern for Extract, Load & Transform (ELT) solutions in Microsoft Azure. They key services used in this framework are Azure Data Factory v2 for orchestration, Azure Data Lake Gen2 for storage and Azure Databricks for data transformation. Here are the key benefits each component offers –

  1. Azure Data Factory v2 (ADF) – ADF v2 plays the role of an orchestrator, facilitating data ingestion & movement, while letting other services transform the data. This lets a service like Azure Databricks which is highly proficient at data manipulation own the transformation process while keeping the orchestration process independent. This also makes it easier to swap transformation-specific services in & out depending on requirements.
  2. Azure Data Lake Gen2 (ADLS) – ADLS Gen2 provides a highly-scalable and cost-effective storage platform. Built on blob storage, ADLS offers storage suitable for big data analytics while keeping costs low. ADLS also offers granular controls for enforcing security rules.
  3. Azure Databricks – Databricks is quickly becoming the de facto platform for data engineering & data science in Azure. Leveraging Apache Spark’s capabilities through Dataframe & Dataset APIs and Spark SQL for data interrogation, Spark Streaming for streaming analytics, Spark MLlib for machine learning & GraphX for graph processing, Databricks is truly living up to the promise of a Unified Analytics Platform.

The pattern makes use of Azure Data Lake Gen2 as the final landing layer, however it can be extended with different serving layers such as Azure SQL Data Warehouse if an MPP platform is needed, Azure Cosmos DB if a high-throughput NoSQL database is needed, etc.

ADF, ADLS & Azure Databricks form the core set of services in this modern ELT framework. Investment in their individual capabilities and their integration with the rest of the Azure ecosystem continues to be made. Some examples of new upcoming features include Mapping Data Flows in ADF (currently in private preview) which will let users develop ETL & ELT pipelines using a GUI-based approach and MLflow in Azure Databricks (currently in public preview) which will provide capabilities for machine-learning experiment tracking, model management & operationalisation. This makes the ELT framework sustainable and future-proof for your data platform.

SSIS Integration Runtime Connectivity Testing

By | Data Platform | No Comments

SSIS Integration Runtime Connectivity Testing is hard as there is no physical Azure VM to log in to as part of the Azure Data Factory (ADF). While behind the scenes there is effectively a VM spun up there is no way to access it.

The scenario our Data Platform team faced was reasonably simple – we needed to connect to a 4D database that sat behind the Storman application that our customer used so that we could extract data for their various workloads. Because 4D is not supported by the Generic ODBC source in Azure Data Factory, we needed to use the 4D ODBC driver. This meant using SSIS to leverage the driver.

The client is well managed in terms of security so the target system can only be accessed within their network. Their Azure network was connected to theirs and properly secured, so part of the setup of the SSIS Integration Runtime in Azure Data Factory is to ensure that it is joined to the virtual network.

Houston, we have a problem

SSIS Azure Data Factory

SSIS Azure Data Factory

However, despite all this – we couldn’t get the ODBC connection to work when deployed. Due to stability issues our first suspect was the driver – after all it frequently crashed Visual Studio 2017 / SSDT and configuration was a pain. Also, initially we couldn’t connect on our dev machines as we weren’t on the clients VPN (easily fixed, fortunately). Then we had the wrong target server (again easily fixed).

Once we got on to ADF of course our debugging options got more limited as we now were having to do SSIS Integration Runtime Connectivity Testing without all the tools available on our desktops . Initially we struggled because the runtime was very slow at sharing its metadata (package paths, connection managers, etc.) so we weren’t initially sure it was even able to work with the driver. Eventually we got enough metadata to start playing with the JSON of the task to configure it. However we continued to get errors in ADF that were’t really helping.

Our breakthrough came when we remembered we could just connect to the more familiar and mature environment of the SSIS catalog that is deployed alongside the runtime. We configured the package correctly, ran it and got a more manageable ODBC error – “Cannot reach Destination server”. A quick ping from our desktops proved the server could be pinged, so as a test we used a simple package with just a script task to ping the server. This worked just fine on our desktop, but when deployed the script task reported failure.

So a quick connectivity test helped pin it down to probable network config issue. Now it’s in the Infrastructure teams hands to ensure everything is configured correctly, but at least we have (for now at least) got SSIS & the ODBC driver off the list of probable causes of the issue. It’s also taught us a few things about SSIS Integration Runtime Connectivity Testing.