All Posts By

Ajit Ananthram

ELT Framework in Microsoft Azure

Azure ELT Framework

By | Data Platform | No Comments

The framework shown above is becoming a common pattern for Extract, Load & Transform (ELT) solutions in Microsoft Azure. They key services used in this framework are Azure Data Factory v2 for orchestration, Azure Data Lake Gen2 for storage and Azure Databricks for data transformation. Here are the key benefits each component offers –

  1. Azure Data Factory v2 (ADF) – ADF v2 plays the role of an orchestrator, facilitating data ingestion & movement, while letting other services transform the data. This lets a service like Azure Databricks which is highly proficient at data manipulation own the transformation process while keeping the orchestration process independent. This also makes it easier to swap transformation-specific services in & out depending on requirements.
  2. Azure Data Lake Gen2 (ADLS) – ADLS Gen2 provides a highly-scalable and cost-effective storage platform. Built on blob storage, ADLS offers storage suitable for big data analytics while keeping costs low. ADLS also offers granular controls for enforcing security rules.
  3. Azure Databricks – Databricks is quickly becoming the de facto platform for data engineering & data science in Azure. Leveraging Apache Spark’s capabilities through Dataframe & Dataset APIs and Spark SQL for data interrogation, Spark Streaming for streaming analytics, Spark MLlib for machine learning & GraphX for graph processing, Databricks is truly living up to the promise of a Unified Analytics Platform.

The pattern makes use of Azure Data Lake Gen2 as the final landing layer, however it can be extended with different serving layers such as Azure SQL Data Warehouse if an MPP platform is needed, Azure Cosmos DB if a high-throughput NoSQL database is needed, etc.

ADF, ADLS & Azure Databricks form the core set of services in this modern ELT framework. Investment in their individual capabilities and their integration with the rest of the Azure ecosystem continues to be made. Some examples of new upcoming features include Mapping Data Flows in ADF (currently in private preview) which will let users develop ETL & ELT pipelines using a GUI-based approach and MLflow in Azure Databricks (currently in public preview) which will provide capabilities for machine-learning experiment tracking, model management & operationalisation. This makes the ELT framework sustainable and future-proof for your data platform.

Confusion Matrix showing True Positives, True Negatives, False Positives and False Negatives

False Negatives: Evaluating Impact in Machine Learning

By | AI & ML | No Comments

Recently, I had the opportunity to build a regression model for one of FTS Data & AI‘s customers in the medical domain. Medical data poses an interesting challenge for machine learning experiments. In most cases when running algorithms for binary classification, the expected result in the training set will contain a large percentage of negatives. For example the goal of an experiment might be to predict if – based on a set of known clinical test results – a patient has a certain medical condition. The percentage of positive results in such a set, if it is a generic dataset for a vast number of medical conditions will most likely be very low. As a result a machine learning model when initially tested using a small set of chosen features will most likely come up with a high number of false negatives.

The latter however is a big problem in experiments involving clinical data, i.e. categorising that a patient does not have a certain medical condition incorrectly could have disastrous consequences. Once a confusion matrix is built, the model’s effectiveness is measured using indicators such as area under curve, accuracy, precision, recall and F1 score. In medical datasets, recall plays a big role as it measures the impact of false negatives. It can therefore hold significant weight in determining the most appropriate model for a given experiment.

The definition of recall is –

Recall = (True Positives) / (True Positives + False Negatives)

In the confusion matrix, the denominator in this equation makes up the total actual positives. So, recall therefore is effectively measuring the correct positive predictions over the actual number of positives in the dataset. If there were no false negatives, recall would be at the ideal score of 1, however if a large number of actual positives were predicated as negatives (i.e. false negatives), recall would be much lower.

As the model evolves and more relevant features are chosen for prediction, recall should start improving. In domains such as medicine where false negative predictions can have dire consequences, the recall score should play a vital role in choosing the most optimum model.

Getting Started with Chatbots

By | AI & ML | No Comments

Most retail websites have a chat channel these days and more often than not, there isn’t a human being on the other end. A trained computer program, i.e. a chatbot, performs the mundane job of answering repetitive questions and never gets tired of doing it. In some cases, the chatbot performs tasks that would’ve been time-consuming for a human being in a matter of seconds. And this experience is only going to get richer for the user over time.

Learning to Walk before Running

Organisations that are looking to leverage chatbots to bring efficiencies into their customer-centric processes can gain valuable expertise by first building an inward-facing chatbot that assists their staff. By building a chatbot that employees can communicate with, the organisation can provide a valuable service to its staff and in the process, get a detailed understanding of the methodologies & tools required to build a productive chatbot. These learnings can then be applied to chatbots that are made available to customers.

Getting Started

The primary use case for building a bot is automating repetitive manual tasks. In the case of a chatbot, a good use case is to help in answering questions that a user would usually search in a published document. Most organisations have an internal Wiki page or a corporate policy document which staff needs to manually trawl through periodically to get answers to specific questions. Getting a chatbot to simplify this process and making it efficient can help improve staff productivity.

The technical services and tools required to build a chatbot are now mature. Microsoft’s Azure Bot Service facilitates building & deploying a chatbot and integrating it with knowledge bases stored in cognitive services such as QnA Maker. Once the chatbot has been published, it can be integrated with chat channels such as Skype for Business & Teams.

The Next Steps

Once a chatbot that can answer questions from a knowledge base has been built, it can be made more intelligent by integrating it with cognitive services such as LUIS (Language understanding intelligent service). This makes the chatbot responsive to actual intents deciphered from the conversation. The models that power these cognitive services are constantly learning, thereby making the chatbots more responsive over time.

Once an organisation successfully implements such an inward-facing chatbot, building a customer-facing chatbot becomes a natural extension. The organisation can then look to implement more complex process flows & integrations with internal systems such as CRMs to improve the overall user experience.

Our Experience

At FTS Data & AI, we practice what we preach. We’ve developed a chatbot named ‘fts-bot’ which we’ve integrated with our Teams chat channel. The fts-bot can answer questions from FTS’s employee handbook thereby eliminating the need for staff to manually search a PDF document. Our staff, especially those who haven’t had a lot of interactions with chatbots, have found this experience productive, and we continue to receive new ideas from technical & non-technical staff.

Conclusion

Chatbots will become ubiquitous on the internet in the future. They will offer customers a personalised user experience and continue to learn with each interaction. Food for thought – which time-consuming process do you currently follow that could be optimised by having a chatbot assist you? Please comment.

DevOps in Database Development

By | AI & ML | No Comments

When we speak of applications development today, we assume DevOps is an integral part of the software development cycle. Modern microservices-based architectures facilitate the use of DevOps and the benefits of this are well known – agile development, quicker defect resolution, better collaboration, etc. Through containerisation using platforms such as Docker and container orchestrators such as Kubernetes and DC/OS, continuous integration and deployment become essential and not optional steps in daily activities. PaaS offerings in Microsoft Azure like AKS (Azure Kubernetes Service) make management of the platforms even simpler and thereby encourage uptake.

However, while DevOps practices have become mature in the applications development sphere, the same cannot be said when it comes to database development. To be able to build a true DataOps team that can integrate agile engineering processes encompassing IT and data teams, a DevOps mindset is essential. Many large enterprises as well as small organisations continue to follow age-old practices for developing data-related artefacts and as a result, we still see a lack of agility and at times, poor quality.

Microsoft has invested heavily to ensure that database developers can also leverage the benefits that have been reaped by application developers. Today’s SQL Server development IDE, SQL Server Data Tools (SSDT), comes loaded with features that enable a development team to collaborate and follow good programming practices. When combined with Visual Studio Team Services (VSTS), we get the environment needed to engender a DevOps-focused development culture.

Six Steps to DevOps

At FTS Data & AI, we believe DevOps is a foundational step in ensuring high-quality outcomes for our clients. Therefore, we make use of the toolsets made available by Microsoft in our development activities and adhere to strict policies, which are enforced by the tools. If you are looking to enable a similar culture in your database development team, consider the following guidelines –

  1. Version Control – Use a distributed version control system like Git for your database code. Git is ingrained in SSDT and VSTS, and for those who prefer the command line, Git can be used in a PowerShell window. Once you’ve set-up a VSTS environment, make use of a SQL Server database project in SSDT for your database development and sync it with Git.
  2. Branching Strategy – Start with a simple branching strategy in Git. There is no one-size-fits-all approach for this, so you’ll need to pick a strategy based on the complexity of the project and the size of the team. As an example, in addition to the master branch, create a dev branch and have the development team work of this branch. Create pull requests to merge the changes into the master branch. Ensure that the master branch is always stable.
  3. Development Environment – Consider making use of SQL Server 2017 hosted on Linux in Docker as a development instance. The containerised SQL Server instance is quick to boot, tear down & replace. PowerShell can be used to issue docker commands, or Kitematic can be used if the preference is for a GUI. 
  4. Continuous Integration – VSTS can be configured for automated builds which can be triggered when changes are committed. Configure continuous integration on the dev branch to ensure that the database builds successfully on every commit. 
  5. Continuous Deployment – Automate publishing changes to QA environment. This will allow testing to commence as soon as changes are committed successfully. When the process becomes mature, deployment to production can also be automated.
  6. Policies – Ensure access to the branches is only given to those who need it. Apply strict policies such as requiring a successful build as a prerequisite for a pull request to succeed. Automatically include code reviewers who would need to approve the changes before pull requests can be completed.

These initial steps will ease the team into the DevOps culture. Look to get these steps right before moving to more advanced areas like automated unit testing, NuGet packaging, coupling database with application changes, etc.

Through the use of a combination of mature tools and strict practices, a DevOps pipeline for database-related development activities is no longer a pipe dream. As MapR’s Chief Architect Ted Dunning has predicted, a sophisticated DataOps team comprising of data-focused developers and data scientists will be the way of the future (MapR press release). Sound DevOps practices will be the first step towards getting there.