Data preparation: on-premise vs cloud

The first step in a BI project is the preparation of data in a (pseudo) data warehouse. Most of the time similar source data is distributed across different source systems which makes it difficult to get a unified view across these data. Also, the source data structure is often not optimized for reporting purposes which might cause performance issues in reporting tools. The ETL process is therefore quite important in the process of building a BI solution.

In the last years we, as data engineers, are confronted with an extra dilemma in this process. Where are we going to store our data warehouse? Are we going to store all the data on an on-premise server or are we taking the next step and create our warehouse in the cloud? Microsoft’s Azure platform contains several solutions for setting up a data warehouse (and a lot more, see Cloud Computing Services | Microsoft Azure) but for now, I would like to concentrate on the SQL database and ETL options.

Nearly every time I’ve set up a data warehouse this was on a SQL database. From my point of view, the difference between a SQL database on an on-premise server or one in the cloud is very minimum so I’ll leave this choice for the DBA’s and system admins. It will probably come down to finding the balance between scalability, management, and costs. More important from my point of view is the tools being used for the ETL.

For the ETL process on a SQL server database, stored on-premise we’ve been using Microsoft’s SQL Server Integration Services (SSIS) for years now. While this solution does its job perfectly when connecting to the sources available in the SSIS Toolbox, connecting to other sources is sometimes quite tricky. Especially when we need to connect to APIs for the loading of cloud data. Luckily Microsoft has released the Power Query source to help with this as we were forced to work with the script source. This required scripting in C#, something not every data engineer is familiar or comfortable with.

Azure’s DataFactory on the other hand comes with 100+ connectors to all sorts of sources right out of the box. It also comes with a REST source that lets you create your API calls in mere minutes. The only downside to this is that a lot of these connectors only seem to work for the copy and lookup job so you need to load your data to Azure storage or database first before you can do your actual transformations. This results in a higher cost.

Conclusion

It’s become more and more clear that cloud database solutions are the way to go when the need arises to store data from other cloud sources. Azure Data factory is much more suited for connecting to and transforming data from cloud sources. If this is the case, it is recommended to take a closer look at the Azure tools.