Browse Source

lint

pull/2909/head
Kat 4 months ago
parent
commit
2c642040d1
  1. 71
      docs/exports.md

71
docs/exports.md

@ -7,12 +7,13 @@ nav_order: 7
All data collected by the application needs to be exported to the **Consolidated Data Store (CDS)**, which is a data warehouse based on **MS SQL** running in the **DAP (Data Analytics Platform)**.
This is done via XML exports saved in an **S3 bucket**.
Currently, we export the following:
- **Lettings logs**
- **Users**
- **Organisations**
Currently, we export the following:
The data mapping for these exports can be found in:
- **Lettings logs**
- **Users**
- **Organisations**
The data mapping for these exports can be found in:
- **Lettings logs**: `app/services/exports/lettings_log_export_service.rb`
- **Organisations**: `app/services/exports/organisation_export_service.rb`
@ -34,7 +35,7 @@ A **cron job** triggers the export service daily at **5 a.m.**.
## Files Generated by the Export
There is a number of files that need to be structured and named in a specific way to be successfully ingested by CDS.
There is a number of files that need to be structured and named in a specific way to be successfully ingested by CDS.
These files include:
@ -43,7 +44,7 @@ These files include:
The **master manifest** is a CSV file that lists all the collections generated during a specific export.
**File name format**:
`Manifest_{today.year}_{month}_{day}_{increment_number}.csv`
`Manifest_{today.year}_{month}_{day}_{increment_number}.csv`
- The increment number is included because exports are sometimes manually run multiple times a day. While rare, this may happen when re-exporting data.
@ -53,18 +54,20 @@ The **master manifest** is a CSV file that lists all the collections generated d
The **collection archive** contains all files for a single collection (e.g., `2024 lettings logs` or `users`). This file is referenced in the master manifest. Each exported collection has its own archive.
**File name format**:
**File name format**:
- **For year-specific collections**:
`{collection_name}_{start_year}_{end_year}_apr_mar_{base_number}_{increment}.zip`
**Example**: `core_2024_2025_apr_mar_f0001_inc0001.zip`
- `core`: The lettings log export collection name
- `2024_2025_apr_mar`: Collection year (`apr_mar` is hardcoded since the yearly log months are fixed)
- `f0001`: Full export increment number, incremented with each full export of the collection
- `inc0001`: Partial export increment number, incremented with each partial export. It resets to `0001` when a full export is run.
**Example**: `core_2024_2025_apr_mar_f0001_inc0001.zip`
- `core`: The lettings log export collection name
- `2024_2025_apr_mar`: Collection year (`apr_mar` is hardcoded since the yearly log months are fixed)
- `f0001`: Full export increment number, incremented with each full export of the collection
- `inc0001`: Partial export increment number, incremented with each partial export. It resets to `0001` when a full export is run.
- **For non-year-specific collections**:
`{collection_name}_{start_year}_{end_year}_apr_mar_{base_number}_{increment}.zip`
**Example**: `users_2024_2025_apr_mar_f0001_inc0001.zip`
**Example**: `users_2024_2025_apr_mar_f0001_inc0001.zip`
The structure is the same, except the start and end years are hardcoded since they are not meaningful for these collections. The inclusion of years in the file name is necessary for CDS import.
@ -73,7 +76,7 @@ The structure is the same, except the start and end years are hardcoded since th
### Collection Archive Contents
1. **Manifest XML**
**File name**: `manifest.xml`
**File name**: `manifest.xml`
This file contains the number of records exported as part of the collection. This count must match the number of records in the actual export files. Otherwise, the collection will not be ingested by CDS.
@ -83,24 +86,29 @@ The structure is the same, except the start and end years are hardcoded since th
**File name format**:
`{collection_name}_{start_year}_{end_year}_apr_mar_{base_number}_{increment}_{part_increment}.xml`
**Example**: `core_2024_2025_apr_mar_f0001_inc0001_pt001.xml`
**Example**: `core_2024_2025_apr_mar_f0001_inc0001_pt001.xml`
- `pt001`: Each file contains up to `MAX_XML_RECORDS` (10,000) records. If more records are exported, they are split into multiple files with incremented part numbers.
---
## Navigating the Export Code
### `export_service.rb`
### `export_service.rb`
Orchestrates all exports and generates the master manifest. Check this file when adding new collections to the daily export or modifying the master manifest.
### `xml_export_service.rb`
### `xml_export_service.rb`
Creates Export objects and writes them to S3. Use this file to see how Export objects are created, how export increment numbers are set, and how export records are batched, archived, and written.
### `{collection}_export_service.rb`
### `{collection}_export_service.rb`
Individual collection export service files (e.g., `lettings_log_export_service.rb`) construct the data export XML content. Modify these files to add new data to an existing collection or change the format of existing fields.
### `{collection}_export_constants.rb`
These collection-specific files define the `EXPORT_FIELDS` constants. A field will not be exported unless added to this constant.
### `{collection}_export_constants.rb`
These collection-specific files define the `EXPORT_FIELDS` constants. A field will not be exported unless added to this constant.
When adding new fields to year-specific exports, it is often necessary to include them starting from a specific year (typically the most recent). Constants like `POST_2024_EXPORT_FIELDS` are used for this purpose.
@ -108,28 +116,31 @@ When adding new fields to year-specific exports, it is often necessary to includ
## Partial Exports
Partial exports run daily, triggered via a cron job. These include all records updated since the last export.
Partial exports run daily, triggered via a cron job. These include all records updated since the last export.
To determine updated records, the service uses the `updated_at` and `values_updated_at` columns:
- **`updated_at`**: Updated whenever a record is edited through the service.
To determine updated records, the service uses the `updated_at` and `values_updated_at` columns:
- **`updated_at`**: Updated whenever a record is edited through the service.
- **`values_updated_at`**: Used in rare cases when records are manually updated in bulk, and `updated_at` is not set. Not all collections include this field.
### Triggering a Partial Export
The easiest way to trigger a partial export is through the **Sidekiq** console:
1. Log in as a support user and navigate to `/sidekiq` in the service URL.
2. Go to the **Cron** tab (last tab in the top navigation).
The easiest way to trigger a partial export is through the **Sidekiq** console:
1. Log in as a support user and navigate to `/sidekiq` in the service URL.
2. Go to the **Cron** tab (last tab in the top navigation).
3. Find the `data_export_xml` job (the only one listed) and click **Enqueue Now**.
---
## Full Exports
A full re-export of an entire collection may be required if new fields are added or existing fields are re-coded.
A full re-export of an entire collection may be required if new fields are added or existing fields are re-coded.
Full exports can only be run via a **rake task**.
Full exports can only be run via a **rake task**.
<!-- Update this section when sales exports are added, as they will affect rake tasks -->
If the collection size is very large, full exports may fail due to memory issues. In such cases, it is better to batch exports into chunks of ~60,000 records and run several partial exports over multiple days. The `values_updated_at` field can help with this.
If the collection size is very large, full exports may fail due to memory issues. In such cases, it is better to batch exports into chunks of ~60,000 records and run several partial exports over multiple days. The `values_updated_at` field can help with this.
The simplest approach is to mark a batch of logs for export each day and allow scheduled morning exports to handle them.

Loading…
Cancel
Save