Browse Source

Update exports documentation (#2909)

* Update exports documentation

* lint

* Update timezone info
pull/2907/head^2
kosiakkatrina 5 days ago committed by GitHub
parent
commit
f9f8c39633
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
  1. 141
      docs/exports.md

141
docs/exports.md

@ -4,16 +4,143 @@ nav_order: 7
# Exporting to CDS
All data collected by the application needs to be exported to the Consolidated Data Store (CDS) which is a data warehouse based on MS SQL running in the DAP (Data Analytics Platform).
All data collected by the application needs to be exported to the **Consolidated Data Store (CDS)**, which is a data warehouse based on **MS SQL** running in the **DAP (Data Analytics Platform)**.
This is done via XML exports saved in an **S3 bucket**.
Currently, we export the following:
- **Lettings logs**
- **Users**
- **Organisations**
This is done via XML exports saved in an S3 bucket.
We currently export lettings logs, users and organisations.
The data mapping for these exports can be found in:
- Lettings logs `app/services/exports/lettings_log_export_service.rb`
- Organisations `app/services/exports/organisation_export_service.rb`
- Users `app/services/exports/user_export_service.rb`
- **Lettings logs**: `app/services/exports/lettings_log_export_service.rb`
- **Organisations**: `app/services/exports/organisation_export_service.rb`
- **Users**: `app/services/exports/user_export_service.rb`
<!-- Add sales log when sales logs export is added -->
Lettings logs exports are year-specific, so at any given time, there may be records exported for three different years, creating three distinct collections: **previous**, **current**, and **next**. While this is technically possible, it is very unlikely to happen in a production environment unless records are updated manually.
Typically, **one lettings logs collection (current)** is exported most of the time, and **two collections (current and previous)** are exported during the crossover period. Export objects are still created for all collections, but they are marked as **empty** when applicable, and no files are generated for these empty exports.
**Users** and **organisations** are not year-specific exports, so there will never be more than one collection for each of these in a single partial export.
Initially the application database field names and field types were chosen to match the existing CDS data as closely as possible to minimise the amount of transformation needed. This has led to a less than optimal data model though and increasingly we should look to transform at the mapping layer where beneficial for our application.
We have a cron job triggering the export service daily at 5am.
A **cron job** triggers the export service daily at **5 a.m.** in the timezone configured in Rails ("London").
---
## Files Generated by the Export
There is a number of files that need to be structured and named in a specific way to be successfully ingested by CDS.
These files include:
### Master Manifest
The **master manifest** is a CSV file that lists all the collections generated during a specific export.
**File name format**:
`Manifest_{today.year}_{month}_{day}_{increment_number}.csv`
- The increment number is included because exports are sometimes manually run multiple times a day. While rare, this may happen when re-exporting data.
---
### Collection Archive
The **collection archive** contains all files for a single collection (e.g., `2024 lettings logs` or `users`). This file is referenced in the master manifest. Each exported collection has its own archive.
**File name format**:
- **For year-specific collections**:
`{collection_name}_{start_year}_{end_year}_apr_mar_{base_number}_{increment}.zip`
**Example**: `core_2024_2025_apr_mar_f0001_inc0001.zip`
- `core`: The lettings log export collection name
- `2024_2025_apr_mar`: Collection year (`apr_mar` is hardcoded since the yearly log months are fixed)
- `f0001`: Full export increment number, incremented with each full export of the collection
- `inc0001`: Partial export increment number, incremented with each partial export. It resets to `0001` when a full export is run.
- **For non-year-specific collections**:
`{collection_name}_{start_year}_{end_year}_apr_mar_{base_number}_{increment}.zip`
**Example**: `users_2024_2025_apr_mar_f0001_inc0001.zip`
The structure is the same, except the start and end years are hardcoded since they are not meaningful for these collections. The inclusion of years in the file name is necessary for CDS import.
---
### Collection Archive Contents
1. **Manifest XML**
**File name**: `manifest.xml`
This file contains the number of records exported as part of the collection. This count must match the number of records in the actual export files. Otherwise, the collection will not be ingested by CDS.
2. **Export XML**
There may be multiple export XML files in the archive, containing the actual collection data.
Example files can be found in the `spec/fixtures/exports` folder. These are representative of real exports but are much smaller in size.
**File name format**:
`{collection_name}_{start_year}_{end_year}_apr_mar_{base_number}_{increment}_{part_increment}.xml`
**Example**: `core_2024_2025_apr_mar_f0001_inc0001_pt001.xml`
- `pt001`: Each file contains up to `MAX_XML_RECORDS` (10,000) records. If more records are exported, they are split into multiple files with incremented part numbers.
---
## Navigating the Export Code
### `export_service.rb`
Orchestrates all exports and generates the master manifest. Check this file when adding new collections to the daily export or modifying the master manifest.
### `xml_export_service.rb`
Creates Export objects and writes them to S3. Use this file to see how Export objects are created, how export increment numbers are set, and how export records are batched, archived, and written.
### `{collection}_export_service.rb`
Individual collection export service files (e.g., `lettings_log_export_service.rb`) construct the data export XML content. Modify these files to add new data to an existing collection or change the format of existing fields.
### `{collection}_export_constants.rb`
These collection-specific files define the `EXPORT_FIELDS` constants. A field will not be exported unless added to this constant.
When adding new fields to year-specific exports, it is often necessary to include them starting from a specific year (typically the most recent). Constants like `POST_2024_EXPORT_FIELDS` are used for this purpose.
---
## Partial Exports
Partial exports run daily, triggered via a cron job. These include all records updated since the last export.
To determine updated records, the service uses the `updated_at` and `values_updated_at` columns:
- **`updated_at`**: Updated whenever a record is edited through the service.
- **`values_updated_at`**: Used in rare cases when records are manually updated in bulk, and `updated_at` is not set. Not all collections include this field.
### Triggering a Partial Export
The easiest way to trigger a partial export is through the **Sidekiq** console:
1. Log in as a support user and navigate to `/sidekiq` in the service URL.
2. Go to the **Cron** tab (last tab in the top navigation).
3. Find the `data_export_xml` job (the only one listed) and click **Enqueue Now**.
---
## Full Exports
A full re-export of an entire collection may be required if new fields are added or existing fields are re-coded.
Full exports can only be run via a **rake task**.
<!-- Update this section when sales exports are added, as they will affect rake tasks -->
If the collection size is very large, full exports may fail due to memory issues. In such cases, it is better to batch exports into chunks of ~60,000 records and run several partial exports over multiple days. The `values_updated_at` field can help with this.
The simplest approach is to mark a batch of logs for export each day and allow scheduled morning exports to handle them.

Loading…
Cancel
Save