Historising tables from CDC records in Azure SQL Database

Software Computing

Študent: Tine Ogrinc

Tine Ogrinc is a graduate of the Computer Science - Software Engineering module study program at Academia, College of Short-Cycle Higher Education. He successfully defended his thesis paper in July 2024.

PDF

Diploma paper Tine Ogrinc

This thesis presents a project where a cloud dataflow is set up in Azure Cloud. The data flow begins in a source database, in our case, Azure SQL Database. Debezium was setup on this database, which is a tool that captures changes on configured tables and forwards those changes to a data streaming service, in our case that was Azure Event Hubs.

The arrival of the message on the Event Hubs instance triggered an Azure Functions program, which allows for serverless execution of a program. The program in question was code written in Python, which parsed the message and entered the data in question into a final sink database, in our case that was Synapse Dedicated SQL Pool.

The data was saved in multiple ways, so as to compare different methods of historical data storage, and those tables were also queried to evaluate the performance of those types of historical data storage.

As part of this thesis, 4 hypotheses were evaluated. The first, H1, was that streaming tools are functionally equivalent to the main cloud providers. This hypothesis was confirmed, as I found that streaming services amongst the three main cloud providers had very similar functionalities and behavior.

The next hypothesis, H2, dealt with SQL databases in Azure. Here I believed that Synapse Serverless SQL Pool was the database most suitable for data warehousing. This hypothesis was mistaken, as the Serverless SQL Pool does not allow for any traditional database tables, which were needed in our project. So Synapse Dedicated SQL Pool was chosen as the most suitable database for our purposes.

Hypothesis H3 dealt with the storage of historical data. Here I believed the SCD 6 type of historical data was most suitable. But this was proven to be wrong in our examples, and SCD 2 provides the same performance in querying, but is significantly better for inserting new data.

The last hypothesis, H4, was that relational databases would maintain their dominance amongst databases and that NoSQL and Vector databases would not replace them. This hypothesis was confirmed.


 

Diploma paper Tine Ogrinc

PDF

Diploma paper Tine Ogrinc

Želite biti obveščeni o novicah na Academii?

Ko bo kaj novega vam to enostavno sporočimo na vaš e-naslov.

X