Submit social housing lettings and sales data (CORE)
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 

7.1 KiB

nav_order
7

Exporting to CDS

All data collected by the application needs to be exported to the Consolidated Data Store (CDS), which is a data warehouse based on MS SQL running in the DAP (Data Analytics Platform).

This is done via XML exports saved in an S3 bucket.
Currently, we export the following:

  • Lettings logs
  • Users
  • Organisations

The data mapping for these exports can be found in:

  • Lettings logs: app/services/exports/lettings_log_export_service.rb
  • Organisations: app/services/exports/organisation_export_service.rb
  • Users: app/services/exports/user_export_service.rb

Lettings logs exports are year-specific, so at any given time, there may be records exported for three different years, creating three distinct collections: previous, current, and next. While this is technically possible, it is very unlikely to happen in a production environment unless records are updated manually.

Typically, one lettings logs collection (current) is exported most of the time, and two collections (current and previous) are exported during the crossover period. Export objects are still created for all collections, but they are marked as empty when applicable, and no files are generated for these empty exports.

Users and organisations are not year-specific exports, so there will never be more than one collection for each of these in a single partial export.

Initially the application database field names and field types were chosen to match the existing CDS data as closely as possible to minimise the amount of transformation needed. This has led to a less than optimal data model though and increasingly we should look to transform at the mapping layer where beneficial for our application.

A cron job triggers the export service daily at 5 a.m. in the timezone configured in Rails ("London").


Files Generated by the Export

There is a number of files that need to be structured and named in a specific way to be successfully ingested by CDS.

These files include:

Master Manifest

The master manifest is a CSV file that lists all the collections generated during a specific export.

File name format:
Manifest_{today.year}_{month}_{day}_{increment_number}.csv

  • The increment number is included because exports are sometimes manually run multiple times a day. While rare, this may happen when re-exporting data.

Collection Archive

The collection archive contains all files for a single collection (e.g., 2024 lettings logs or users). This file is referenced in the master manifest. Each exported collection has its own archive.

File name format:

  • For year-specific collections:
    {collection_name}_{start_year}_{end_year}_apr_mar_{base_number}_{increment}.zip
    Example: core_2024_2025_apr_mar_f0001_inc0001.zip

    • core: The lettings log export collection name
    • 2024_2025_apr_mar: Collection year (apr_mar is hardcoded since the yearly log months are fixed)
    • f0001: Full export increment number, incremented with each full export of the collection
    • inc0001: Partial export increment number, incremented with each partial export. It resets to 0001 when a full export is run.
  • For non-year-specific collections:
    {collection_name}_{start_year}_{end_year}_apr_mar_{base_number}_{increment}.zip
    Example: users_2024_2025_apr_mar_f0001_inc0001.zip

The structure is the same, except the start and end years are hardcoded since they are not meaningful for these collections. The inclusion of years in the file name is necessary for CDS import.


Collection Archive Contents

  1. Manifest XML
    File name: manifest.xml

    This file contains the number of records exported as part of the collection. This count must match the number of records in the actual export files. Otherwise, the collection will not be ingested by CDS.

  2. Export XML
    There may be multiple export XML files in the archive, containing the actual collection data.
    Example files can be found in the spec/fixtures/exports folder. These are representative of real exports but are much smaller in size.

    File name format:
    {collection_name}_{start_year}_{end_year}_apr_mar_{base_number}_{increment}_{part_increment}.xml
    Example: core_2024_2025_apr_mar_f0001_inc0001_pt001.xml

    • pt001: Each file contains up to MAX_XML_RECORDS (10,000) records. If more records are exported, they are split into multiple files with incremented part numbers.

Navigating the Export Code

export_service.rb

Orchestrates all exports and generates the master manifest. Check this file when adding new collections to the daily export or modifying the master manifest.

xml_export_service.rb

Creates Export objects and writes them to S3. Use this file to see how Export objects are created, how export increment numbers are set, and how export records are batched, archived, and written.

{collection}_export_service.rb

Individual collection export service files (e.g., lettings_log_export_service.rb) construct the data export XML content. Modify these files to add new data to an existing collection or change the format of existing fields.

{collection}_export_constants.rb

These collection-specific files define the EXPORT_FIELDS constants. A field will not be exported unless added to this constant.

When adding new fields to year-specific exports, it is often necessary to include them starting from a specific year (typically the most recent). Constants like POST_2024_EXPORT_FIELDS are used for this purpose.


Partial Exports

Partial exports run daily, triggered via a cron job. These include all records updated since the last export.

To determine updated records, the service uses the updated_at and values_updated_at columns:

  • updated_at: Updated whenever a record is edited through the service.
  • values_updated_at: Used in rare cases when records are manually updated in bulk, and updated_at is not set. Not all collections include this field.

Triggering a Partial Export

The easiest way to trigger a partial export is through the Sidekiq console:

  1. Log in as a support user and navigate to /sidekiq in the service URL.
  2. Go to the Cron tab (last tab in the top navigation).
  3. Find the data_export_xml job (the only one listed) and click Enqueue Now.

Full Exports

A full re-export of an entire collection may be required if new fields are added or existing fields are re-coded.

Full exports can only be run via a rake task.

If the collection size is very large, full exports may fail due to memory issues. In such cases, it is better to batch exports into chunks of ~60,000 records and run several partial exports over multiple days. The values_updated_at field can help with this.

The simplest approach is to mark a batch of logs for export each day and allow scheduled morning exports to handle them.