Role Autotrader

From MillerSql.com
Revision as of 18:49, 11 October 2024 by NeilM (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Introduction

I worked for Trader Media, the publisher of "AutoTrader" magazine and website, in Thatcham, Berkshire, from January 2008-January 2009

I was in the "Group IT" department.

Futureproof and CMS databases

The "Futureproof" (FP, or TABS) database is a 500Gb SQL Server 2005 database (after upgrade from SQL Server 2000) holding trade (not public) advert bookings for Autotrader magazine and the Autotrader website.

Data was read from the database either by heavy-duty report applications (the "cutting sheet report" used to create a blueprint file to be sent to the printers for creation of the magazine, or by, I think, a feed into downstream systems to go to the Autotrader website CMS supported by the Newton office (**Stuart Brave).

The main application used by users was a GUI application also called "Futureproof", and is produced by a software company in Wokingham or Bracknell called Miles33.

The CMS3000 (Credit Management System) database is smaller (150Gb) and holds customer accounts information. The CMS_Schema database just tells it which CMS3000 database to connect to.

For the backups, the database is backed up once a day, and full recovery mode is in use with transaction log backups every 20 minutes.

The backups are not normal SQL Server backups. Rather they are "filer snapshot backups" implemented with special operating system commands (I think) in the SQL Scheduled jobs themselves. This is a hardware-implemented backup methodology whereby the drive where the files are located is on NetApp (www.netapp.com) "filer" hardware connected to the servers by a SAN (Storage Area Network) using, I think, iSCSI tranport (SCSI over the network) where LUN is the number of the drive (LUN 1 for example). When a backup is run, the filer takes a "snapshot" of the drive and all the data on it. This does not mean it takes a copy of the data, rather it takes a note of where all the data is, on a data block-by-block basis. Then, subsequent to this, as any process adds or updates data on the filer, SQL Server or not, the filer writes the change to a new (empty) location on the filer disk - and marks the original data block as "archived" (or something similar) thereby preserving it for future use such as data recovery. As far as applications running on windows are concerned, this process is invisible and all it sees is the updated data. If data is deleted, the space is not freed-up to the operating system.

Then with each successive backup, database or transaction log, progressively more and more space is "frozen".

Of course you do eventually need to free up the space, and this is done within the SQL scheduled jobs by adding a parameter to the filer backup command to delete snapshots older than, say, 7 days, or 7 backups.

These backups can be managed manually using the SnapManager application.

Due to the backups being essentially additive on the system, it is necessary for the free disk space to be significantly larger than the size of the databases being backed up in this way; probably double. **Jeff frequently complained that drive space was getting low and causing errors to be logged on the filer controllers.

All this is in duplicate thanks to a Microsoft Cluster Server setup.

There is replication configured to replicate all data to a DR - disaster recovery site at Telecity in Manchester.

Replication involves the log reader on the distribution server reading the transaction log on the publisher database, and applying the transactions into the distribution database. The distribution process reads these transactions and commands from the distribution database and applies them to the subscriber on a data-row-by-row basis. To enable this the replication software installs stored procedures into the subscription database to be run by the distribution job to insert, update, or delete rows. There are these three stored procedures for each subscription table.

There was also replication into an Oracle database subscriber. This caused problems because the SQL statements run over the Oracle OleDB link on the Oracle database subscriber assumed a very basic table schema that did not reflect the reality of what SQL would be trying to insert into them. So if you set up replication to an Oracle subscriber and accepted the defaults, replication would install fine, but when you started to replicate data to it Oracle would give a table column mismatch error, roll back the stored proc (made over the Oracle OleDB link into the Oracle database) and cause the distribution process to fail with an error.

The way around this problem is to:

1. Create your own Oracle script with the correct column datatype definitions.

2. Run the SQL Snapshot job to generate the snapshot files in the snapshot directory (including the auto-generated scripts that create the bad Oracle tables)

3. Copy the correct Oracle scripts (created in 1. above) into the snapshot directory to overwrite the bad ones, and

4. Run the distribution job as normal. This job will run all the files in the directory on the destination Oracle server and thus set up the correct tables on the Oracle database.

In addition some other tweaks had to be done, such as adding the -QuotedIdentifier switch to the distribution job to deal with quotes correctly.

There were a number of instances of "slow running" of the system. These were often not properly diagnosed, but some of them were caused by a user running a report that locked a vital part of a table causing a queue of client requests blocked behind it, for several minutes. But part of slow running was attributable to a bad network on the filer connection (which had its own dedicated backbone network), and general network connectivity issues.

Typical server names are: GTC-SQLTAB01A and B, TCY-SQLTAB01b ??

Also Virtual Servers: TMG-SQLTAB01, TMG-SQLTAB02

And SQL Instances (both on SQL Server 2000 and SQL Server 2005): TMG-SQLTAB01\FPLIVE or TMG-SQLTAB02\CMS3000

AgentBooking application. Uses a Crystal Reports component.

Also the FutureProof_Reports database.

The PAF database is the 4Gb postcodes database. I wrote a .dtsx SSIS package to load these from six source files into the PAF database once a month (when the files were released by the post office).

CustomerCanvass database.

Also staging, training, dev, and test servers (databases also renamed to FPTraining etc...)

There were two instances of SQL Server running either side of the cluster, one with FP on it, and the other with CMS on it.

SSMS- SQL Server Management Studio.

Upgrade from SQL 2000 to SQL 2005 for Futureproof & CMS

The original production system was at the Telecity (TCY) externally-managed colocation data centre in Manchester. Due to insufficent power infrastructure there to support the servers' power and air conditioning requirements it was not possible to add the necessary new servers for the upgrade, so a decision was taken to run the production servers instead in the Getronic (GTC) managed datacentre in Runcorn, which is also an externally managed site.

GTC is a modern datacentre with very high levels of physical security requiring pin-code, fingerprint, and card-swipe authorisation to get into the main server room from the public areas of the building. To get from the outside into the building itself required human authentication by the reception security guard, and additional checks.

With this upgrade it was decided also to move to SQL Server 2005.

What were the upgrade considerations?

DTS to SSIS is the big part of any upgrade.

Upgrade considerations:

1. Convert existing DTS packages to SSIS

2. Establish logins on new SQL Server machine, and re-associate them with database users using EXEC sp_change_users_login 'Auto_Fix', 'sdt_user'

3. Identify all applications that use the system, and ensure their credentials (standard or integrated) are well understood and planned for in the migration. This would prevent a problem where a crystal reports application was not correctly reconfigured for the new system, and was unable to connect after the upgrade.

4. Do client apps work with the latest SQL client drivers? And indeed with the SQL 2005 database itself?

5. Do scheduled tasks continue to work? One problem we had was a cleanup task failed under SQL2005 causing the database to mushroom in size. The job was to archive old entries out of the main car images table - the largest table in the database, so causing the db to get bigger.

Agresso SQL 2000 database

Main duties were taking a look at reported anomalies in reports, and replicating the data by running the stored procedures that generate the data, to identify problems.

I created a job that was scheduled to run every minute to process text files (exported from Excel by users) and import them into the Agresso database. The job used VB Scripts to process each line from each file sequentially.

Other duties

I configured replication from the SQL Server databases to the Carzone system, which is a SQL Server in the republic of Ireland (so the data went over the WAN link to there).

Carzone is a big car sales website in Ireland, and carries data fed to it by Trader Media.

Two tables: Vehicle, and Image (contains an image datatype column, for the photos of the cars).

This is in fact an example (from above) of data coming FROM Oracle (at Trader Media) TO Carzone (SQL Server). The distribution database is SQL (server TCYWZD003).

iLo - "Integrated lights-out": http://en.wikipedia.org/wiki/HP_Integrated_Lights-Out

This is a second network connection to a server that remains powered on even if the main server is shut down. You can connect to the IP address associated with this second network port, and see a controller web page that gives you access to the basic functions of the server including reboot, boot, and shutdown features. You can also get console functionality, but you might need an advanced licence for this.

Websense, Citrix

Problems faced and fixed:

1. A common problem was where SQL had a row of data that was too long for the replication destination Oracle table to hold. It was found that the publisher table column containing the data was too large - larger than the spec for Futureproof allowed. So for example there was a column with 1024 characters, a VARCHAR(1024) where the spec said it should only hold 512 characters. It replicated into a 512-character Oracle table. The publisher table was allowed to be 1024 not 512 because it was stated that the Futureproof client GUI application (400 users) would restrict the number of characters to 512. In fact it didn't, and a 700 character string found its way into the table, and promptly broke the replication.

The workaround was to manually edit string down to under 512 in the publisher table, tell users not to use 512, put a daily check on the presence of >512 entries in the table, and get rid of the entry in the distribution database so as to allow replication to continue.

2. Replication distribution failing because of an update statement to a row that didn't exist on the subscriber. The workaround here was to add the row to the subscriber manually and let the command run through. Then try to find out why it was out of sync in the first place. The row might not itself be present on the publisher, and the reason for this is it might be a row that was created then deleted temporarily. So let replication catch up with itself one you might well see the row disappear from the subscriber again, this time completely legitimately.

Another thing you can do is add the -SkipErrors 20598 flag to the distribution job. But be careful because it will skip other errors that you DO want to investigate, leaving the subscription database in potential indeterminate state. You can see how many transactions have been skipped by looking in the distribution.dbo.MSrepl_errors table.

3. When applying the snapshot from the Oracle publisher to the Carzone SQL database, there was a single command lasting over 30 minutes on the network connection between them while the single batch was applied. Unfortunately there was a router in the path between them, and this router had a timeout of 30 minutes on traffic. So at 30 minutes the connection would abruptly drop - disconnected by the router.

The problem was worked around by re-configuring a different route for the data from the two servers to reach each other to avoid going through the router.

This was done by reconfiguration of the VLAN (virtual LAN).

4. Replication filter problem: When we installed Replication on the new 2005 server, the log reader started failing with message:

"log reader cannot run sp_replcmds" or similar.

The cause was an error in the manually-created scripts to install the publisher side of replication. Specifically the code to create the stored procedures used by the logreader to put data INTO the distribution database had missed out the one for one of the articles. In fact the code was there, but it gave an error when run because it in error had the wrong filter name assigned to it, a name that was already correctly used for a previous stored procedure. So it gave a "duplicate" error.

And since the filter applied to just one table that was quite non-volatile, the problem didn't come to light for some days.

5. Problems from not being able to alter Replication schema on tables that are marked for replication.

6. problems of large transaction log dumps. On investigation I found a SQL statement that was running every 5 minutes and putting in then taking out 170Mb each time. It turned out to be an Agresso stored procedure, and on speaking to the user it was pushing in all of its data, not increments. He changed its configuration and this reduced to 500Kb every 5 minutes, so resolving the main large transaction log dumps problem.

7. FP database got marked as Suspect. Solution was to detach + reattach it, or run: Exec Sp_resetstatus ‘FPTraining’

8. Stuart found that a default gateway on one of the TCY servers was in error pointing to a server in GTC not the correct machine in TCY. This would have cause lots of IP requests to go over the slow link from TCY to GTC. It was reconfigured and things ran much faster.