Role ING Investment Management
Introduction
I worked for ING Investment ManagementJune to November 2009.
Prinses Beatrixlaan 15
2595 AK The Hague
Netherlands
At ING I worked in the SI&D team (Systems integration and development) initially, then moved to the ASD team under **Thomas.
In summary, my duties were to:
1. The night run. Check and manage the various SQL scheduled jobs that ran over night every night on the SQL 2000 data warehouse (the 500Gb "Delta" and smaller "ODS" databases), the main one being the "night run" (0045_xtwtfsx_DWH_PCK_MAIN_DAG or DWH_PCK_MAIN_DAG) to load data into the data warehouse from source files provided by Cloverleaf.
2. Implement the Informatica Powercentre replacement for 1.
3. Investigate problems reported by users in regard to bad data in reports generated by applications that get data from stored procedure calls in the data warehouse. This would typically require a fix in the stored procedure code.
In more detail:
4. Other duties: Twice a month I had to come in early (6.30am) and do "the morning checks".
1. The Night Run
The purpose of the Delta database was to carry day-to-day changing data, and historical data, about funds under management. ODS carried generally static customer information.
The night run worked as follows: Files were delivered by upstream applications (such as Cloverleaf) into an input directory, then the night run would run at a scheduled time to import the data from the files into staging tables in the Delta database, and then copy from staging to "GEG" (data) tables, then from GEG to INF (fact/dimension) tables.
This job would typically take 8 hours to run, and this caused problems because it would often still be running in the morning, resulting in:
1. users getting bad data when they ran their own reports on non-updated data, and
2. the chance of blocking and deadlocking occurring between it and their own (read-only) report runs.
3. slow system operation as a result of the nightrun using up system resources.
My job was to try to work out why it was taking so long to run, and speed it up.
Description of the Night Run
1. Files (such as corbi/corba files, multifonds MF, ??) are delivered by upstream software (Cloverleaf) into source directories. Each file has a source code such as MF for "Multifonds" or FK for "Finance Kit", or SKD for "Simcorp Dimension".
Typical names for these files were: BNP_PORTFOLIO.txt, FK_CORBA.txt
2. The "night run" scheduled job starts.
3. First part of the job consists of stored procs (or DTS packages I can't remember) to read the data from each of these files and load it into the staging tables. Each file source code (such as FK) has its own staging table, and the staging table is truncated before new data is loaded into it. The staging tables had names in the format DWH_TB_????_STAG (meaning Data Warehouse table name of table, then STAG meaning Staging).
Once processed the files are moved into a DONE directory.
Once processed, an entry is made to a logging table to say the time at which the data was loaded in correctly.
The subject areas themselves are read from a configuration table.
So at the end there should be 10 or so files processed and loaded into staging.
4. Once all of the files have been loaded the next thing is to load the data into the GEC (data) tables. These had names in the form: DWH_TB_????_GEG
This is done for each file type (e.g. FK) by running a stored procedure that reads the STAG table and inserts/updates the equivalent data in the GEG table. This process was fairly lengthy and took 20 minutes+ for some of the larger tables of 10 million rows.
When I was analyzing its performance for one of them I found it was generating a cartesian product during part of the query plan, with 1 BILLION rows. No wonder it was taking 20 minutes and steam was coming out of the servers.
5. Once complete the final stage was to update the INF tables with the newly updated GEG tables. Names in the form: DWH_TB_????_INF. The INF tables contain the same data as the GEG tables except that the INF are in data warehouse form, meaning they had the data with ID values for data values in the fact tables, and the values corresponding to those ID values in the (several) dimension tables.
This job also took 20 minutes for some of the larger tables (total 3 hours over all files).
Some of them took hours, for example DM_IC_FACT_POSITION_DISCRETIONARY
6. Once the job was complete it was necessary to complete an Excel spreadsheet file every morning which invited you to list what had gone right vs wrong, and email it to other users.
There were many problems with the night run because often files would be delivered late (so are not present when the job runs, resulting in errors) or with errors in the formatting due to unexplained changes in the file format from upstream software changes. If that happens the night run needs to be re-run, which means running it in the afternoon so that it is completed by the time of the next night's run. And by running it in the afternoon it makes the system slow which users are not happy about. The stored procedure code is very complicated, so it was challenging to identify the areas that are causing the slow running.
To see how the job ran the previous night, run: exec DWH_SP_RUN_CHECKSTATUS.
Note that each time the night run runs, it marks new data with a unique "runnummer" (run number). This is so you can see how data got updated in the database, and can investigate faults by finding the source file corresponding to the runnummer in the archive.
SELECT * FROM DWH_TB_REP_RUN_PROCES
WHERE status > 30
Order by runnummer desc
Live server: 10.120.3.47 (BS-PIMSQL01)
EXEC DWH_SP_RUN_MOVE_NEXT_DAY moves the files to their correct new location.
xxxx_mtwtfss_CHECK_FILES_NEXT_RUN is a job that runs every 5 minutes and moves files into their correct locations ready for action.
Another explanation of the process is:
Looking at the main DTS package DWH_PCK_MAIN_DAG it can be seen that it runs 4 packages:
DWH_PCK_STAG_ALL
DWH_PCK_LOAD_ALL
DWH_PCK_CALC_ALL
DWH_PCK_INF_ALL
Then each of the above runs a number of packages in a loop, for example the longest-running (2.5 hours) one named DWH_PCK_LOAD_IC_FACT_POSITION_DISCRETIONARY
Looking inside this package, it appears to be a stored procedure within it that is taking a long time:
DWH_SP_LOAD_IC_FACT_POSITION_DISCRETIONARY
When this stored procedure runs, it declares a cursor based on the values in:
from DWH_VW_LOAD_IC_FACT_POSITION_DISCRETIONARY
where ind_ins_upd = 'U'
This view joins tables: DM_IC_FACT_POSITION_DISCRETIONARY and view DWH_VW_TRANSFORM_IC_FACT_POSITION_DISCRETIONARY (DWH_TB_GEG_PORTFOLIO_POSITION_DISCRETIONARY and others)
Then it calls stored procedure DWH_SP_API_IC_FACT_POSITION_DISCRETIONARY_UPD for each of these (this updates certain values in DM_IC_FACT_POSITION_DISCRETIONARY)
Finally it inserts the data from the view into the large table.
Every month it was necessary to configure the daily run to do extra work in the form of the "month run". This was done by getting some files off the network (check with **James Vine first), loading them into some staging tables to check for consistency (use DWH_PCK_STAG_PI_TBL_NORMEN to do this), and then enabling and running the month run as part of the night run.
Also the monthly "Performa" job: yyyy_xxxxxxx_DWH_MAAND_PERFORMA
IC Flex/ IC Web jobs
Informatica Powercentre
This was the new way of loading the data from the source files provided by Cloverleaf into the Data Warehouse.
It differed from the scheduled method of loading in that it worked by being triggered by the appearance of trigger files in the source directory, which was being monitored at all times. Once triggered, Powercentre loaded the files into the staging tables, and then when that was complete it loaded from staging to GEG, and then GEG to INF.
Powercentre had intricate configuration, in that there were objects for each file, STAG table, GEG, and INF table. And on each object, lists of the columns. Workflows showing lines linking these columns in one object with the columns in the next object etc. There were objects representing transformations of data as it passed through the various stages.
Powercentre has four modules (the following information is a bit incomplete):
1. PowerDesigner. Where the packages are designed
2. Workflow. Where the workflows are designed
3. Monitor. Where the jobs shown as running are shown (current activity etc)
4. ?
Note that Powercentre does everything with direct SQL, not stored procedures, so there is some limitation on performance tuning in this regard.
Powercentre was designed to handle files sent in the new format. Some of the files are already being sent in the new format. For example:
GMDB.MF.PRICE.SOD[T].Publish.xml
GMDB.MF.PRICE.SOD[T].Exception.xml
This means: GMDB Multifonds price Start of day. (other files are end of day, ??)
Bad values in reports
There were several instances where I had to investigate problems reported by users where reports had incorrect values in them. This was standard SQL work, looking at the data and seeing why there was the error.
For example **Sheila Patterson had a report "DNB AO-OK" (a report for the Dutch central bank) that was missing some data. I traced it back to a stored procedure:
exec DWH_SP_EXP_DRA_FORM_AO_OK_M 'IF0951', '2009-04-30'
and found that the date criteria in a WHERE clause had a < operator, whereas it should have been a <=. I raised a change request to get it changed.
Relevant major tables in the Delta database are:
DWH_TB_GEG_PORTFOLIO_POSITION_DISCRETIONARY (250 million rows using 170Gb of space)
DM_IC_FACT_POSITION_DISCRETIONARY (250 million rows using 180Gb of space)
DWH_TB_GEG_INSTRUMENT_MASTER
DWH_TB_GEG_INSTRUMENT
DWH_TB_GEG_PORTFOLIO_POSITION
Doorpricken was a big problem. It was to identify, for a particular instrument, the "parent portfolios" for a particular instrument. This is defined by finding those portfolios that have instrument as a member, and then finding other portfolios that have the first portfolios themselves as members.
ISIN numbers.
Einstand and Beginstand (start balance and end balance values from one month to the next).
Other duties
When databases were restored from live to staging systems, it was necessary to reattach the databases, and reassigning the users to the SQL server logins. This is done by running for each user:
1. EXEC sp_change_users_login 'Auto_Fix', 'sdt_user'
Any changes that had to be made had to be approved by a change manager. Email to SI&D change request sends the request to a change person in SI&D who will check the request and forward the request to the change manager.
Problems faced and fixed
1. The night-run was failing one of the DTS packages DWH_PCK_LOAD_COUNTERPARTY, and the cause was an invalid country code in the source data which was resulting in a foreign key violation when the stored procedure tried to insert the new row into the GEG table (i.e. the country column had a foreign key link to the country table (DWH_TB_GEG_COUNTRY), and that country table was missing a code for that country.
2. Problem where files started to be sent in with/without header rows, causing the DTS imports to fail, or the first row not to be imported.
3. Duplicate rows in source files causing multi-column unique or primary key constraint violations. This occurred as a result of the practice of joining together the files of these names from different days, into a single file.