Gartner Magic Quadrant

Informatica PowerCenter, Advanced mapping techniques - Part 1

According to Gartner (2013 Magic Quadrant for Data Integration), Informatica PowerCenter (PWC henceforth) is the market leading solution for enterprise level Data Integration. In this series of articles I'll cover some of the tiny technical tidbits which can make a whole world of difference in implementing appurtenant functionalities, hopefully assisting an aspiring ETL developer in day-to-day engagements.  Subject matter will be segmented in multiple scenarios and a couple of sequels. As always, all readers are encouraged to reply, question and request additional information related to the topic at hand.

Fine tuning ordinal number generation and/or ranking

When implementing ETL logic in Informatica PowerCenter best practice suggests sole usage of PWC objects as a rule of thumb. However, at times this guideline may leave ETL logic execution performance with something to be desired. Before stepping in to shady areas such as code overriding, there's sometimes an option to implement the desired functionality by using PWC objects but in a way not primarily intended. One example of such scenario is generation of ordinal numbers.

Informatica PowerCenter offers the functionality of Rank Transformation to facilitate ordinal number generation. An alternative approach, often superior in terms of execution time and resource utilization, is to use a simple combination of Sorter and Expression Transformation with variable ports. The key functionality required for this approach is an inherent feature of PWC to initialize port values in succession from top to bottom (when observed in Ports tab of most common transformation objects). How does that help? Well, placing a port holding the initialization value for variable port beneath the same makes said variable to be initialized with the value of previous record and for each record in the pipeline except for the first one - this one classifies as unidentified. This basically means the variable value is equal to pertinent value for previous record. In addition, Sorter Transformation provides ordering functionality which in end yields a very efficient way to rank records based on arbitrary logic.

I'll illustrate with a scenario: Generate ordinal number values for the following data set, partitioned by CUSTOMER_ID, ordered ascending by CHANGE_DATE_ID.

CUSTOMER_ID

HOME ADDRESS

BILLING ADDRESS

PHONE_NUMBER

CHANGE_DATE_ID

As mentioned, the first step is to sort the data prior to assigning rank. If we think of the functionality being implemented in a form of SQL rank() function, the sorting needs to be done on all PARTITION BY and ORDER BY columns. Direction of sort for PARTITION BY segment is irrelevant while it needs to be set according to requirements for the ORDER BY clause.

In this case the oldest record will be ranked highest (rank=1). Other scenarios including multiple columns in PARTITION BY and ORDER BY clauses are trivial and will not be visited within this post.

After we've got our data sorted in the right manner, we proceed with assigning rank values. We propagate all required ports to a Expression Transformation, add a variable port for PARTITION BY data set (CUSTOMER_ID), place it above appurtenant I/O port and assign the value.

With said variable port being placed above the port holding the value being assigned we ensure the latency between the same of exactly one record (variable counterpart is one record behind). Finally, we add a variable port in which we'll calculate the rank value (v_Sequence in this case) with the following logic:

This way, the data is being ranked efficiently with performance exceeding that of a Rank Transformation provided by PowerCenter.

Change Data Capture (CDC) by CHECKSUM

One functionality that is inherent to every DWH and appurtenant ETL process stack implementation is Change Data Capture or CDC for short. This means fetching only new and/or changed data from the source systems, applying the required transformation logic and loading said data scope into DWH. This approach is commonly known as incremental loading.

Now,the first and sometimes quite comprehensive task when implementing CDC is to precisely identify the data set required for retrieval from source systems. This process can be very easy to implement when, for example, records are tagged with creation and change time, however this is not always the case. Worst case scenario is when there's no way of identifying new and changed data other than comparing complete data sets for differences. Considering the volumes of contemporary DWH incremental loads and the number of columns required per data object this task can ramp up to unsustainable infrastructural and processing requirements.

However, there is an approach that can circumvent said scenarios to a more manageable outcome - instead of comparing the entire source and target data sets, one can derive a unique value for each record on both source and target systems and compare the two value sets. An easy way to facilitate this process is to calculate a checksum value for each and every record entering the comparison. This way the values compared are dependent on all values used as input parameters for checksum calculation but the final comparison is done solely between two values. In the first case scenario, if the case were where we didn't have the CHAGE_DATE_ID attribute we could easily calculate checksum value from the following input parameters:

(TO_CHAR(CUSTOMER_ID)||HOME_ADDRESS||BILLING_ADDRESS||PHONE_NUMBER)

With comparing source and target checksums one could easily and efficiently identify the scope of CDC. There is a plethora of algorithms for evaluating checksum values - MD5() is used quite commonly, but always consider the precision when choosing the algorithm.

Update? No, thanks!

In conventional database systems update is an operation often dreaded. Rightfully so - it is very expensive in terms of processing time and resources. An update however, can sometimes be facilitated without actually updating the data sets. At other times, there just might be no way around it in which case optimal means of performing an update should be considered. As such, this is a comprehensive subject and within this post I'll revisit some of the possible solutions in design and implementation.

Whenever there's a way to manage the same functionality without actually performing database updates it's probably best to go for it. This can be done by dividing the ETL process into multiple sets with inserts exclusively to multiple temporary tables as steps to a final data set. The final outcome will for sure be a much better performing ETL process stack.

On a less trivial note - same process can be implemented when one is required to, for example, refresh dimensional data, which usually means inserting new and also refreshing the existing data sets while maintaining referential integrity. The process in this scenario would be to outer join all data from source system to data contained within appurtenant dimension and continue with following actions:

  • In case of existing records with no changes - skip them or take everything from the dimension
  • In case of updated records - take dimension specific data from DWH (for example surrogate keys) and take the rest from the source
  • In case of new records - generate dimension specific data and take the rest from source

After that, the course of action may vary - one could spool the entire data set derived from above mentioned three steps to a temporary table and exchange the dimension and temporary table content after the load, either by inserts or database specific functionality, for example, Oracle exchange partition statement. The later would require the temporary table to have the exact same structure as the dimension and for the dimension to be partitioned, usually by a dummy value, equal for every contained tuple.

If there is no way around it, one could choose the option of performing a database update. This can be done after the above mentioned three steps with the final result being a temporary table containing either entire incremental scope or only the data set to be used for dimension update and the update itself to be facilitated by SQL script execution post session processing. This again leaves scripting options to be exploited such as MERGE statement in Oracle.

All mentioned scenarios will lead to performance enhancements compared to straightforward insert/update strategy mappings.

If you have any tips or doubts, please leave a comment below. If not, see you on Part 2 of the series, coming out soon. 

SAP BusinessObjects: Through the Eyes of Critics and Clients

In our day-to-day life we are exposed to many purchasing decisions. Going to the supermarket to get some food or buying widgets in an internet shop is no longer a quick and easy task. The diversification of vendors forces us to go through a conscious or unconscious decision making process. It is always interesting to study the reasoning that leads to an action, especially for marketing people, and it is a social phenomenon in constant evolution, ranging from the traditional word of mouth to the most modern surfing of internet forums.

The same happens when a company has the opportunity to make a BI investment and has to choose a vendor that fulfills their particular needs. There are some specialized consulting companies that carry out complex market studies, but the feedback received from clients or contacts is just as influential to the decision making process.

This blog article is focused on the latest studies from Gartner and Forrester regarding SAP BusinessObjects.

The Forrester Wave: Enterprise Business Intelligence Platforms, released during the last quarter of 2010, considers SAP BusinessObjects to be leading the market together with IBM Cognos, Oracle and Information Builders, followed by SAS, Microsoft and MicroStrategy close behind. The 145 criteria used correspond to vendor surveys, product demos and customer reference surveys.  The end result is a graph that situates BI vendors in relation to their current offering and strategy.

SAP BusinessObjects is recognized as having very good BI tools for each job:

  • Crystal Reports as top report writer embedded in thousands of applications;
  • Explorer as a combination of the power of OLAP and the simplicity of search, exploration and analysis;
  • SAP BusinessObjects Dashboards is very popular amongst executives that can visualize the generated flash in disconnected mode and no SAP software installed;
  • BEx being still the most widely used and popular query and analysis tool for SAP BW users;
  • Business Warehouse Accelerator with the speed of in-memory database to provide a powerful DBMS optimized for BI;
  • SAP is also leading in self-service BI software as a service (SaaS) offerings.

According to Forrester, the downside of having all these tools is a poor product integration and a lack of object reuse from product to product, which is supposed to be improved with the new release BI 4.0 in Q1 2011.

Gartner’s Magic Quadrant for Business Intelligence Platforms, released in enero 2011, considers the same leader vendors as Forrester and also adds QlikTech thanks to its recent continued growth. It uses standard criteria divided into two categories: completeness of vision and ability to execute. According to the study, a good strength is to have the largest share of the BI platform market (combination of SAP NetWeaver BW and SAP BusinessObjects) with an installed base over 46,000 customers and one of the largest channel and services ecosystems. Reporting and ad-hoc querying capabilities were rated high by customers, whereas other SAP NetWeaver BW tools were below the market average. However, in regards to OLAP, BEx Analyzer seems to be more widely used than Voyager. The complements to the BI platform are also seen as positive: collaboration and decision support (StreamWork), Text Analysis, search-based data exploration with its Explorer product, and the enterprise information management products with data integration, lineage & impact analysis, and data quality. The vision for an in-memory appliance, HANA, promises to solve many of the perennial performance issues of large complex BI deployments in general (given its SQL and MDX access for third-party BI tools) and SAP BW in particular. As also reported by Forrester, SAP has been one of the first of the leading BI vendors to introduce a SaaS offering, BusinessObjects OnDemand BI. In addition, SAP has also made investments in mobile BI with its Sybase acquisition and the SAP BusinessObjects Explorer iPhone/iPad application.

In the cautions part of the report, Gartner mentions some concerns that customers have about the roadmap after product changes to support optimizations with the SAP Business Applications and NetWeaver BW products, and also a perceived higher cost of ownership. Migration, implementation and integration choices are reported to be confusing by some customers. Finally, ratings for support, software quality and sales experience are unfortunately low compared to other vendors in the survey. This is currently being address by SAP; they are giving more and more importance to Channel Partners like Clariba, which can provide high quality uninterrupted support, much more aligned to the customer needs.

If you have any questions or comments about the Gartner or Forrester summaries mentioned in this article, or would like more information about Clariba’s SAP BusinessObjects services, please leave a comment below or contact us at info@clariba.com.

EspañolEnglish