When it comes to ETL development, optimal performance and scalability are key. In this article, we are going to cover our top 5 Clariba act·in | ETL Framework best practices to achieve this goal.
First impression when using Microsoft SQL Server Integration Services (SSIS), the developer’s first impression is the tool provides powerful features in addition is simple to use. Although this is true, there are some common errors that can make us waste our time. The purpose of this blog article is to cover some of these very common issues and save some time when starting to develop a solution with Microsoft SSIS.
Are you satisfied with the way you currently manage the dependencies in your ETL? Dependencies between jobs (or parts of jobs) are an important aspect of the ETL management. It pertains to questions like: Do you want to execute job B if job A failed? Imagine that you have a job C with sub-job 1 (usual runtime: 3 hours) and sub-job 2 (usual runtime: 2 minutes). If sub-job 1 was successful and sub-job 2 failed, can you gracefully restart job C without the sub-job 1 being restarted again?
As soon as you have more than 1 simple job, you have to manage your dependencies. In this article (part 1 of a series of articles about ETL Dependencies Management) I’ll first list some of the characteristics I’m looking for in an ideal dependency management system. I will then have a look at some of the possibilities offered by SAP Data Services 4. In part 2 (my next post), I will propose the architecture of a possible dependency management system. In part 3, I will go into the details of the implementation in Data Services. I’ll finish with part 4 by telling you about how the implementation went, and if some improvements are possible.
The ideal dependency management system
In this post I will use the word “process” to design a series of ETL operations that have a meaning together. Example: extract a source table, create a dimension, or update a fact table. The objective here is to manage the dependencies between the processes: updating a fact table should probably only be allowed if updating the corresponding dimensions was successful.
A dependency management system should ideally have at least the following characteristics:
- Run a process only if its prerequisites ran correctly
- After a failure, offer the option to re-run all the processes or only the processes which failed
- Trace the outcome of each process (ran successfully, failed, did not run)
- Run dependent processes dynamically (rather than statically, i.e. based on date/time)
Let’s enumerate some of the possibilities offered by Data Services, with their respective pros and cons.
1) One job with all processes inside. This is very easy to implement, dynamic in terms of run times, but it doesn’t allow for concurrent runs. Most importantly, it means that failures have to be managed so that the failure of one process does not stop the whole job.
2) One process per job, with jobs scheduled at specific times. This is very easy to implement, allows concurrent runs, but is not dynamic enough. If the process durations increase with the months/years, jobs may overlap.
3) One main job calling other jobs (for example with execution commands or Web Services).
4) One process per job, all the jobs being scheduled at specific times, but checking in a control table if the pre-requisites ran fine. Otherwise they just sleep for some time before checking again.
5) Use the BOE Scheduler to manage jobs based on events (how-to is well described on the SCN). I’ve not tested it yet, but I like this approach.
By default, the first two possibilities only manage the “flow” side of the dependency management (after A, do B). But they do not manage the conditional side of the dependency management (do B only if A was successful). In both cases, a control table updated by SQL scripts would allow the ETL to check if the prerequisite processes have been run correctly.
What I don’t really like in the solutions 2 to 5 is the fact that it’s difficult to have an overview of what’s going on. You cannot really navigate within the whole ETL easily. The solution 1 gives you this overview, but at the cost of having a potentially huge job (without the possibility of processes running concurrently).
Also note that the solutions with multiple jobs will need to manage the initialization of the global variables.
What I miss in all these solutions is an optimal re-start of the ETL. If 10 of my 50 processes failed, and I want to restart these 10 only, do I really have to start them manually?
In my next blog post I’ll propose an architecture that addresses this optimal restart.
Until then, please let me know your thoughts about how you manage your ETL dependencies. Any of the 5 solutions mentioned before? A mix? Something else? And how well does it work for you.