GOV.UK PaaS is being decomissioned at the end of this year and by 23 December 2023 all services hosted on GOV.UK PaaS will need to have migrated to an alternate hosting platform.
Like other DLUHC services, we are moving our service directly to DLUHC-owned AWS infrastructure.
@ -12,8 +12,6 @@ Each data collection window runs from 1 April to 1 April the following year (plu
ADD (Analytics & Data Directorate) statisticians are the other primary users of the service. The data collected is transferred to DLUHCs consolidated data store (CDS) via nightly XML exports to an S3 bucket. CDS ingests and transforms this data, ultimately storing it in a MS SQL database and exposing it to analysts and statisticians via Amazon Workspaces.
![Diagram of the CORE system architecture](https://raw.githubusercontent.com/communitiesuk/submit-social-housing-lettings-and-sales-data/main/docs/images/architecture.drawio.png)
## Users
External data providing organisations have 2 main user types:
On [GOV.UK PaaS](https://www.cloud.service.gov.uk/), service credentials are appended to the environment variable `VCAP_SERVICES` when services [are bound](https://docs.cloud.service.gov.uk/deploying_services/s3/#bind-an-aws-s3-bucket-to-your-app) to an application.
Such services include datastores and S3 buckets.
Our application uses S3 and Redis clients and supports two different ways of parsing their configuration:
* Via the environment variable `VCAP_SERVICES` using the `PaasConfigurationService` class
* Via the environment variables `S3_CONFIG` and `REDIS_CONFIG` using the `EnvConfigurationService` class
`S3_CONFIG` and `REDIS_CONFIG` are populated using a similar structure than `VCAP_SERVICES`:
S3_CONFIG:
```json
[
{
"instance_name": "bucket_1",
"credentials": {
"aws_access_key_id": "123",
"aws_secret_access_key": "456",
"aws_region": "eu-west-1",
"bucket_name": "my-bucket"
}
}
]
```
REDIS_CONFIG:
```json
[
{
"instance_name": "redis_1",
"credentials": {
"uri": "redis_uri"
}
}
]
```
In order to switch from using [GOV.UK PaaS](https://www.cloud.service.gov.uk/) provided services to external ones, instances of `PaasConfigurationService` need to be replaced by `EnvConfigurationService`.
This assumes that `S3_CONFIG` or/and `REDIS_CONFIG` are available.
Please check `full_import.rake` and `rack_attack.rb` for examples of how the configuration is used.
## Current infrastructure
Currently, there are four environments with infrastructure:
- Meta
- Development (Review Apps)
- Staging
- Production
### Meta
This holds the Terraform “backend” and the ECR(s).
The Terraform “backend” consists of:
- S3 buckets - for storing Terraform state files. One for all non-production environments (including the meta environment itself), and another just for production.
- DynamoDB - for managing access and locking of all state files.
The ECR(s) are:
- core - holds the application Docker images.
- db-migration - holds the Docker images curated to help migrate a DB from PaaS to AWS.
- s3-migration - holds the Docker images curated to help migrate S3 files from PaaS to AWS.
N.B. the migration ECRs may or may not be present, depending on if the Terraform has been configured to create migration infrastructure. The migration infrastructure is only used to help migrate the DB and S3 from PaaS to AWS, so is usually therefore only temporarily present.
### Development / Staging / Production
These are the main environments holding the “application” infrastructure.
Though not exhaustive, each of them will generally contain the following key components:
- ECS Fargate cluster
- RDS (PostgreSQL database)
- ElastiCache (Redis data store)
- S3 buckets
- One for Bulk upload (sometimes also to referred to as the CSV bucket)
- One for CDS Export
- VPC
- Private subnets
- Public subnets
- Load Balancer
- Other misc. networking components (e.g. routing tables, gateways)
- CloudFront (Content Delivery Network)
- AWS Shield (DDoS protection, when enabled)
- WAF (Firewall)
### Development / Review Apps
The development environment is used for Review Apps, and has some infrastructure that is created per-review-app and some that is shared by all apps.
In general, each review app has its own ECS Fargate cluster and Redis instances (plus any infrastructure to enable this), while the rest is shared.
Where to find the Infrastructure?
The infrastructure is managed as code.
In the terraform folder of the codebase, there will be dedicated sub-folders for each of the aforementioned environments, where all the infrastructure for them is defined.
## Deployment (Pipeline — Recommended)
@ -61,224 +64,50 @@ To deploy you need to:
6. Post success message on Slack.
7. Tag tickets as ‘Released’ and move tickets to done on JIRA.
## Deployment (Manual)
It is unlikely you will need to deploy manually as the GitHub actions method supersedes this one. This application is running on [GOV.UK PaaS](https://www.cloud.service.gov.uk/). To deploy you need to:
1. Contact your organisation manager to get an account in `dluhc-core` organization and in the relevant spaces (staging/production).
2. [Install the Cloud Foundry CLI](https://docs.cloudfoundry.org/cf-cli/install-go-cli.html)
3. Login:
```bash
cf login -a api.london.cloud.service.gov.uk -u <your_username>
```
4. Set your deployment target (staging/production):
```bash
cf target -o dluhc-core -s <deploy_environment>
```
5. Deploy:
```bash
cf push dluhc-core --strategy rolling
```
This will use the [manifest file](https://github.com/communitiesuk/submit-social-housing-lettings-and-sales-data/blob/main/manifest.yml)
A failed Github deployment action will occasionally leave a Cloud Foundry deployment in a broken state. As a result all subsequent Github deployment actions will also fail with the message `Cannot update this process while a deployment is in flight`.
```bash
cf cancel-deployment dluhc-core
```
You would then need to check the logs and fix the issue that caused the initial deployment to fail.
## CI/CD
When a commit is made to `main` the following GitHub action jobs are triggered:
1. **Test**: RSpec runs our test suite
2. **Deploy**: If the Test stage passes, this job will deploy the app to our GOV.UK PaaS account using the Cloud Foundry CLI
2. **AWS Deploy**: If the Test stage passes, this job will deploy the app to AWS
When a pull request is opened to `main` only the Test stage runs.
## Review apps
When a pull request is opened a review app will be spun up. The reviews apps connect to their own PostgreSQL and Redis instances with its own worker.
When a pull request is opened a review app will be spun up. Each review app has its own ECS Fargate cluster and Redis instances (plus any infrastructure to enable this), while the rest is shared.
The review app github pipeline is independent of any test pipeline and therefore it will attempt to deploy regardless of the state the code is in.
The usual seeding process takes place when the review app boots so there will be some minimal data that can be used to login with. 2FA has been disabled in the review apps for easier access.
The app boots in a new environment called `review`. As such this is the environment you should filter by for sentry errors or to change any config.
The app boots in a new environment called `development`. As such this is the environment you should filter by for sentry errors or to change any config.
After a sucessful deployment a comment will be added to the pull request with the URL to the review app for your convenience. When a pull request is updated e.g. more code is added it will re-deploy the new code.
Once a pull request has been closed the review app infrastructure will be tore down to save on any costs. Should you wish to re-open a closed pull request the review app will be spun up again.
### How to fix review app deployment failures
One reason a review app deployment might fail is that it is attempting to run migrations which conflict with data in the database. For example you might have introduced a unique constraint, but the database associated with the review app has duplicate data in it that would violate this constraint, and so the migration cannot be run. There are two main ways to remedy this:
**Method 1 - Edit database via console**
1. Log in to Cloud Foundry
```bash
cf login -a api.london.cloud.service.gov.uk -u <your_username>
```
* Your username should be the email address you signed up to GOVUK PaaS with.
* Choose the dev environment whilst logging in.
2. If you were already logged in then Cloud Foundry, then instead just target the dev environment
```bash
cf target -o dluhc-core -s dev
```
3. Find the name of your app
```bash
cf apps
```
* The app name will be in this format: `dluhc-core-review-<pull-request-number>`.
One reason a review app deployment might fail is that it is attempting to run migrations which conflict with data in the database. For example you might have introduced a unique constraint, but the database associated with the review app has duplicate data in it that would violate this constraint, and so the migration cannot be run.
## Destroying/recreating infrastructure
Things to watch out for when destroying/creating infra:
- All resources
- The lifecycle meta-argument prevent_destroy will stop you destroying things. Best to set this to false before trying to destroy!
- Database
- skip_final_snapshot being false will prevent you from destroying the db without creating a final snapshot.
- Load Balancer
- Sometimes when creating infra, you may see the error message: failure configuring LB attributes: InvalidConfigurationRequest: Access Denied for bucket: <load-balancer-access-log-bucket-name>. Please check S3bucket permission during a terraform apply. To get around this you may have wait a few minutes and try applying again to ensure everything is fully updated (the error shouldn’t appear on the second attempt). It’s unclear what the exact cause is, but as this is related to infra that enables load balancer access logging, it is suspected there might be a delay with the S3 bucket permissions being realised or the load balancer recognising it can access the bucket.
- S3
- Terraform won’t let you delete buckets that have objects in them.
- Secrets
- If you destroy secrets, they will actually be marked as ‘scheduled to delete’ which will take effect after a minimum of 7 days. You can’t recreate secrets with the same name during this period. If you want to destroy immediately, you need to do it from the command line (using your staging developer role, rather than your MHCLG-wide role used to apply Terraform) with this command: aws secretsmanager delete-secret --force-delete-without-recovery --secret-id <secret-arn>. (Note that if a secret is marked as scheduled to delete, you can undo this in the console to make it an ‘active’ secret again.)
- You may need to manually re-enter secret values into Secrets Manager at some point. When you do, just paste the secret value as plain text (don’t enter a key name, or format it as JSON).
- ECS
- Sometimes task definitions don’t get deleted. You may need to manually delete them.
- After destroying the db, you’ll need to make sure the ad hoc ECS task which seeds the database gets run in order to set up the database correctly.
- SNS
- When creating an email subscription in an environment, Terraform will look up the email to use as the subscription endpoint from Secrets Manager. If you haven’t already created this (e.g. by running terraform apply -target="module.monitoring" -var="create_secrets_first=true") then this will lead to the subscription creation erroring, because it can’t retrieve the value of the secret (because it doesn’t exist yet). If this happens, remember you’ll need to go to Secrets Manager in the console and enter the desired email (as plaintext, no quotation marks or anything else required) as the value of the secret (which is most likely called MONITORING_EMAIL). Then run another apply with Terraform and this time it should succeed.
- AWS CloudWatch (for general application / infrastructure logging)
- Sentry (for application error logging)
We use self-hosted Prometheus and Grafana for monitoring infrastructure metrics. These are run in a dedicated Gov PaaS space called "monitoring" and are deployed as Docker images using GitHub action pipelines. The repository for these and more information is here: [dluhc-data-collection-monitoring](https://github.com/communitiesuk/dluhc-data-collection-monitoring).
### CloudWatch
The CloudWatch service can be accessed from the AWS Console. You should authenticate onto the infrastructure environment whose logs you want to check.
From CloudWatch, navigate to the desired log group (e.g. for the app task running on ECS) and open the desired log stream, in order to read its log “events”.
Alternatively, you can also navigate to a specific AWS service / resource in question (e.g. ECS tasks), selecting the instance of interest (e.g. a specific ECS task), and finding the “logs” tab (or similar) to view the log “events”.
## Performance monitoring and alerting
### Sentry
To access Sentry, ensure you have been added to the DLUHC account.
Generally error logs in Sentry will also be present somewhere in the CloudWatch logs, but they will be easier to assess here (e.g. number of occurrences over a time period). The logs in Sentry are created by the application when it makes Rails.logger.error calls.
For application error and performance monitoring we use managed [Sentry](https://sentry.io/organizations/dluhc-core). You will need to be added to the DLUHC account to access this. It triggers slack notifications to the #team-data-collection-alerts channel for all application errors in staging and production and for any controller endpoints that have a P95 transaction duration > 250ms over a 24 hour period.
## Debugging
### Application infrastructure
For debugging / investigating infrastructure issues you can use the AWS CloudWatch automatic dashboards. (e.g. is there a lack of physical space on the database, how long has the ECS had very high compute usage for etc.)
They can be found in the CloudWatch service on AWS console, by going to dashboards → automatic dashboards, and selecting the desired dashboard (e.g. Elastic Container Service).
Alternatively, you can also navigate to the AWS resource in question (e.g. RDS database), selecting the instance of interest, and selecting the “monitoring” / ”metrics” tab (or similar), as this can provide alternate useful information also.
## Logs
### Exec into a container
You can open a terminal directly on a running container / app, in order to run some commands that may help with debugging an issue.
To do this, you will need to “exec” into the container.
#### Prerequisites
- AWS CLI
- AWS Session manager plugin Install the Session Manager plugin for the AWS CLI - AWS Systems Manager
- AWS access
For log persistence we use a managed ELK (Elasticsearch, Logstash, Kibana) stack provided by [Logit](https://logit.io/). You will need to be added to the DLUHC account to access this. Longs are retained for 14 days with a daily limit of 2GB.
#### Accessing the rails console
1. Find the cluster name of the relevant cluster
2. Find the task arn of a relevant task
3. In a shell using suitable AWS credentials for the relevant account (e.g. the development, staging, or production account), run `aws ecs execute-command --cluster cluster-name --task task-arn --interactive --command "rails c"`
Logs are also available from Gov PaaS directly via CLI:
N.B. You can run other commands on the container similarly.