Role SSE Energy smart meter

From MillerSql.com
Revision as of 18:48, 2 October 2024 by NeilM (talk | contribs) (→‎Other notes)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Overview

This contract ran from Mon 1st July 2019, to Thu 30th Jan 2020

Around one million domestic smart meters installed in peoples' homes are SMETS1 meters. This enables them to act as smart meters when they are associated with the energy supplier (such as SSE) that initially installed them. However if the customer switches to another provider, they cease being smart, and carry on working as dumb meters.

To get around this problem, it is possible to configure all of these meters to instead connect to an independent third company called the "DCC" (Data Communications Company)- The DCC will be able to communicate with all of the meters and pass on this data to and fro from the energy supplier whosoever it is. This means that if the customer decides to switch supplier, the communication with the DCC will continue to operate, allowing the new company to continue to operate it as a smart meter.

The name of the project is the "SMETS1 Migration" project.

There are around 1 million SMETS1 meters in scope to upgrade. When the project goes live, they will be upgraded in tranches of around 100,000 every month.

My role was to build a database system to allow these meters to be managed, to produce a list of meters that will be ready for migration at particular points in time, and to produce reports about them (in Power BI).

The management of these lists were complicated by several factors:

1. Non-comm excludes - If the meters could not be connected to, they had to be excluded from the migration.

2. Manual_exclusion_list - meters may need to be excluded from the migration for any arbitrary reason

3. ToDo - A list of things that need to be done to a meter, before it can be a candidate for migration.

4. status_chng

5. DMCT

6. EPCL

7. exclusion_code - list of codes for each different type of reason.

Rules had to be applied, such as if a meter was excluded for any reason, then any other meters situated at the same site as the excluded meter, would also need to be excluded.

The process was that we would send a list of around 100,000 meters (each defined by a unique MSN - Meter Serial Number) obtained from our in-house CS - Client Services mainframe system, to the DCC. The DCC would then send back a list of these approved.

Then we would notify the CS and MDM/Fusion systems of the ones being migrated.

Each of these processed involves getting an output from our own system, called the MMT, in the form of a .csv file, and then getting a response from the external system (DCC, CS, MDM).

These are outbound, and inbound files respectively.

The 8 categories above each had their own outbound and inbound files, corresponding to a particular format (set of columns).

Each of the above 8 items, and others, would be represented by some data in a "technical" table (one table for each item). "technical" was the name of a database.

Databricks

This table was not in Microsoft SQL Server, it was a Spark SQL table running on Azure Delta Lake store (a cloud filesystem supporting relational tables).

The front end was Databricks, also a cloud service.

When data was to be loaded into one of these tables (inbound), it was first sent (ingested) as a .csv file with a defined format, into Azure blob storage (optimised for cheap file storage). This was done by the external systems (DCC, CS, MDM) sending the files, copied by the "FileMover" SSE system, into a folder on the S: drive at SSE. Then the Azure "streamsets" jobs would move the files from the S: drive into a folder on Azure storage.

Then databricks "notebooks" were written to read the data from these blob files, defined as database read-only tables (i.e. that a table object was defined having the file in Azure holding the physical data, updated using the REFRESH TABLE command), and copy them into Delta lake tables, after which the latter could be queried using SQL.

Note that Azure blob storage operated on the basis of write-only from the perspective of streamsets, and read-only from the perspective of Databricks. i.e. new files had to be sent in with datetime stamped filenames (to ensure uniqueness), with the new files sitting alongside the old ones - they could not replace the old ones. Then the databricks commands would do a left join when inserting into the destination table, to only import the records in the source that are not in the destination. Those records being marked with the source filename (obtainable by running a function on each record of the source).

This process was tolerant to large amounts of data due to the architecture of Azure, the cloud having the ability to spread the processing load over many processors and servers in the cloud.

Once in the destination table, called the "staging" tables (in the "staging" database), a second Databricks notebook would then run, copying the new data into the technical tables. With a column "Loaded_Stg" being used to show which rows were new.

Note, though, that the MMT loads (files manually created by us, as opposed to created by the DCC, CS, and MDM systems) we loaded straight into the technical tables from the source files.

For outbound files, the process went in reverse. Databricks notebooks would run on the technical tables, performing business logic, and create outbound files in Azure Storage containing the data. The Streamsets processes would copy these out to the S: drive, and then the filemovers would move them to DCC/CS/MDM

BUILD/SOGL/PROD: All the above processes were triplicated into these three "environments" corresponding to dev, pre-prod, and prod. Note that SOGL means "simulation of go live"

This means a typical folder structure for the files on the S: drive and in Azure Storage is:

cs\system_data_build\notes\inbound

with system_data_build being replaced by system_data_sogl for pre-prod, and system_data for prod

Business Upload templates: For the MMT inbound files, I wrote a series of spreadsheets (each based on a template) containing the data to go into the .csv file to be created. Users of the spreadsheets would paste this data into these files, then click a button to generate the file, the latter being placed into the appropriate destination directory ready to be ingested in.

Databricks notebooks. Most of these were SQL notebooks, running only SQL statements in Spark SQL.

https://docs.databricks.com/spark/latest/spark-sql/index.html

However some of them were Python notebooks - those involved in manipulating the files in Azure storage - although SQL could read the files, you needed Python to delete them, check whether they were present, and create them.

Reports. **James, mainly, wrote reports in PowerBI (PowerBI desktop) to output reports on the table contents to end users.

Notebooks of all types could be scheduled, and provided with parameters.

Parameters, or variables, were notebook specific and called widgets. They were used to denote environment (build, sogl, prod) and the environment part of the inbound and outbound file paths.

e.g. $env_tech_db

references database: dbsmets1mig02_technical_build

so for example:

dbsmets1mig02_technical_build.tbl_tech_variables

is referenced as:

$env_tech_db.tbl_tech_variables

Also:

dbutils.widgets.get("env_system_data") to get a string.

Other notes

mockaroo.com for source files initially

datetime for now: current_timestamp()

Sequence number, for ordering of inbound datasets

Tech_variables table, to provide a variable

Pipe delimited inbound files - loaded into the system as single column "csv"), then in SQL define a new table in staging referencing the file loaded in Azure Storage, having pipes as the delimiter.

Header and footer rows.

tbl_Tech_MMT is the master table of MSNs I think, and has the following status for each record (unique on MSN):

0 loaded

0S Eligible

0Z Excluded

1 Selected (for next week's migration)

1a CS Note issued

tbl_Tech_admin_data

holds admin data

Also one or two items written in Scala. But support for Scala was discontinued.

ifnull(a, b)

To get the filename from the input_file_name() function:

reverse(left(reverse(input_file_name()), instr(reverse(input_file_name()), "/")-1))

Use of views to get only the latest records.

HP QualityCentre tests

Filemover forms

Datatypes: timestamp, string, bigint, date

SQL MERGE command

Error table - records sequence number mismatches, null or blank MSNs

Activity_Log - records start end success/failure, number of records loaded

ADD/REMOVE/ALL load methodology. If the last column is ALL, then insert all. Else add or remove records on a row by row basis.

Truncate vs append load.

Initial delete partially loaded records from failed previous load.

Next, mark previous record in activity_log as having started but not completed.

Save errors into the tech variables table to record errors /bad sequence numbers