4 + 2 =
Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.

You are here

Transformation is the Differentiator for ETL

When I first started writing about ETL and data warehousing, I wondered why testing was so complicated.  After all, the procedures involve making a copy, doing a cleaning, and making another copy.  It just does not sound that complex.  Isn't it just a really big cookie-cutter database storage device?  Isn’t ETL testing just copying and checking that "before" and "after" match up?  Then I started wondering about that “cleaning” part, transformation, the T in ETL.  What exactly happens there?  You read about the E, T and L steps, and it all sounds very generic and does not go into detail about what actually happens during transformation.  And then I figured out why.

The transformation is determined by the business needs, and the business needs vary wherever you are.  Your industry determines the kind of data you deal with, and how it will need to be used.  A medical database will need to find details of a patient’s medical records.  A retailer may need to track all individual POS sales, both to watch the financial side, and to observe the impact on inventory.  A voter registration database may need to give reports of people using various filters for marketing or polling.  And each of these systems has different details of what kinds of source data it needs as input and how the output data needs to be queried, all of which need to be checked for data quality before anything gets stored.

Some codes have built-in checksum digits, so that a simple mathematical calculation can ensure the validity of a code.  Imagine an inventory system where some new codes are keyed in manually.  At ETL time, the system could throw a warning that a UPC code is invalid (for UPC-A, all single digit mistakes can be detected).  A car dealer can discover that the VIN number for a vehicle was mis-keyed.  A book vendor can know not only that the ISBN number is wrong, but which character the mistake is in, if it is a single digit typo.  A broker or trader can detect an invalid CUSIP of a traded security.

For more complete validation, a more complex validator API must be used.  The codes must map to real things, regardless of whether or not there’s a checksum.  A Social Security Number in a patient or financial record must be an issued number, and should match up with the listed name.  A VIN must match to a real manufactured vehicle, instead of just having a valid checksum.  The UPC must match with a real product that is the one of current interest, and at POS, the cashier ensures that someone did not just switch the stickers.

But not all fields are identifier codes.  You may have a text field, where you want to ensure that only characters are limited to part of the kind listed here: upper case letters, lower case letters, digits, spaces, punctuation.  You may have a field that must have a non-NULL value for the report queries to work.  A date field may need to have standard format, such as MM/DD/YYYY or DD-MM-YY.  A patient’s birthdate must have a non-zero month that’s less than 13, a non-zero day that less than 32 (or less, depending on the month and if it is a leap year), and be less than or equal to the system date.  Perhaps a mandatory field, like a person’s name or transaction date or email, was left blank.

An individual record may be perfectly accurate, but there needs to be a change to the columns.  You may want to change the field order before putting things into storage.  You may need ensure that certain columns are strings of a specific length, so spaces may need to be padded onto the end.  You may need to standardize a field that can have multiple values meaning the same thing, such as TRUE, YES, Y, T, and 1, which may all translate to a positive response.  Perhaps “unit count” and “price per unit” fields go away, to be replaced by “total spent”.  Columns may need to be converted into CSV’s, or vice versa.

You may need to add, delete, or merge records before entry.  You may need to deduplicate entries, so that you don’t end up mailing multiple catalogs to the same person who just happened to place more than one order.  You may want to aggregate records (tracking daily sales or by-customer-per-month sales instead of individual sales, say).  Going the other direction, records with a CSV webcart field may need to be split apart, so a separate record can be made for each individual item sale, for a database tracking individual item sales instead of just a single checkout record.  In fact, the business need may do a combination that aggregates, splits and dedups.

Maybe business requires the new data set to be sorted by a particular field before the load, to cut down on sorting during report generation.

There is a lot more to ETL than just copying and storing.  Transformation is an important factor in determining how ETL needs to work for you.


About The Author: 

Scott Andery is an expert marketer and author who specialize in software testing tools and resources.



Data Warehouse Testing

List mode
Table mode
Icons mode

Displaying 1 - 8 of 8


Testing tool manufacturers world-wide list
10Levels ABID CONSULTING AccelQ Accord Software ActiMind AdaCore
AdaLog AgileLoad AgileWay Agitar Algorismi ALL4TEC
Android Apache Apica Apollo Systems AppAssist.mobi Applitools
AppPerfect Appsee ApTest Assertible Assure Atlassian
AutoIt Consulti .. Automation Anyw .. Automation Cons .. Aztaz Software Backtrace I/O Badboy
BlazeMeter Borvid BrowserStack BSQUARE BStriker Intern .. CA Technologies
Canonical Canoo Catch Limited CelestialTeapot Celtic Testing .. Chris Mallett
Cleanscape CloudQA CodeCentrix CodePlex projec .. Codoid Cogitek
Compuware Configure IT Conflair ConSol Core Services Countersoft
CresTech Softwa .. Cross Browser T .. Crosscheck Netw .. Crowdsourced Te .. Cucumber Ltd Cyara
Cygnet Infotech DareBoost Databene Datamatics Glob .. DevExpress DTM soft
Dynatrace LLC EasyQA Eclipse EKA TechServ Elvior Emmanuel Jorge
Empirix EPAM Systems Equafy Esterel eXept Software .. Experitest
Finaris Froglogic FrontEndART Ltd GeneXus GitHub project gnoso
Google Code Pro .. GrammaTech Gurock Software Hewlett Packard .. Hexawise High-Tech Bridg ..
Hiptest Hitex IBM Rational imbus Shanghai Impetus Inflectra
informUp INTALIO intechnica InTENSO - IT Ex .. Ipswitch Jamo Solutions
JANOVA JAR Technologie .. JBoss Developer jClarity Jellly.io JetBrains
Jively jQuery foundati .. JS Foundation Jspresso Kanoah Software KMS Technology
Kualitee LDRA Limited Litmus LoadFocus Loadster Perfor .. MarathonITE
Marketcircle Marketcircle Maveryx Meliora Ltd Micro Focus Sof .. Microsoft
Mobile Labs Mobile1st Mockaroo, LLC ModalLogic IT Monkop Inc Mozila
MSys Technologi .. Navicat NeoTys NetCart Netvantage Tech .. NORIZZK.COM
Novosync Mobili .. NRG Global NTT Resonant OC Systems Odin Technology OpCord
Oracle Orcanos Original Softwa .. Ossia Conseil OW2 PANAYA
Parasoft PassMark Patterson Consu .. Perfecto Mobile Pivotal Labs Plutora
Polarion Postman (API To .. PractiTest PrimaTest Process One Programming Res ..
Psoda PureLoad PushToTest Python Q-Assurance QA Systems
QACube QASymphony QAWorks QMetry Quali QualiTest
Qualitia Softwa .. Quality First S .. Quotium RadView Softwar .. Ranorex RedLine13
Reflective Solu .. ReQtest RevDeBug Robotium Tech Rogue Wave Soft .. Rommana Softwar ..
RTTS ruby-doc.org Runscope Sandklef GNU La .. Sauce Labs Seapine Softwar ..
SeleniumHQ Sencha Sensiple SmartBear Softw .. SmarteSoft SOASTA
SoftLogica Softomotive Softsmith SolutionSoft Sy .. SonarSource Sourceforge
SqashTeam SQS Software Qu .. Square Stimulus Techno .. Swifting AB Synopsys
T-komp T-Plan TechExcel TechTalk Telerik By Prog .. Tellurium
Test Collab Test Goat Test Recon TestCaseLab testCloud.de Gm .. Testenium
Testim.io Testing Technol .. TestingBot TestLodge Testmunk TestObject GmbH
TestOptimal TestPlant TestPro Testuff The Core Bankin .. The MathWorks
The Open Group thePHP.cc Thoughtbot Thoughtworks Tigris.org Time Simulator
Top-Q Trace Technolog .. TrendIC TRICENTIS Tritusa Pty Ltd TWD Solutions P ..
TypeMock Tyto Software Ubertesters UniTESK Universal Test .. Usetrace Ltd
Utrecht Univers .. Validata Group Vanamco AG Vector Software Veracode Verifaya Corpor ..
Verit VersionOne Viewlike.us Vornex Inc. Watir.com WcfStorm Soluti ..
We Are Mammoth Web Performance .. Wikidi Wintask Wireshark Found .. Worksoft
Xceptance XebiaLabs XK72 Xpand IT XQual YesSo
ZAPTEST Zeenyx Software .. Zephyr Zeta Software zutubi pty

Theme by Danetsoft and Danang Probo Sayekti