CLDC-2556 Update infrastructure documentation (#2045)

* Update gov paas - aws documentation * Remove architecture diagram
2 years ago · 01d9ab0de4
6 changed files with 122 additions and 262 deletions
--- a/README.md
+++ b/README.md
@ -13,10 +13,6 @@ Ruby on Rails app that handles the submission of lettings and sales of social ho
 * [API browser](https://communitiesuk.github.io/submit-social-housing-lettings-and-sales-data/api) (using this [OpenAPI specification](docs/api/v1.json))
 * [Design history](https://core-design-history.herokuapp.com)
 ## System architecture
 ![View of system architecture](docs/images/architecture.drawio.png)
 ## User interface
 ![View of the logs list](docs/images/service.png)
--- a/docs/adr/adr-020-migration-to-aws.md
+++ b/docs/adr/adr-020-migration-to-aws.md
@ -0,0 +1,9 @@
 ---
 parent: Architecture decisions
 ---
 # 020: Migration to AWS
 GOV.UK PaaS is being decomissioned at the end of this year and by 23 December 2023 all services hosted on GOV.UK PaaS will need to have migrated to an alternate hosting platform.
 Like other DLUHC services, we are moving our service directly to DLUHC-owned AWS infrastructure.
--- a/docs/images/architecture.drawio.png
+++ b/docs/images/architecture.drawio.png
--- a/docs/index.md
+++ b/docs/index.md
@ -12,8 +12,6 @@ Each data collection window runs from 1 April to 1 April the following year (plu
 ADD (Analytics & Data Directorate) statisticians are the other primary users of the service. The data collected is transferred to DLUHCs consolidated data store (CDS) via nightly XML exports to an S3 bucket. CDS ingests and transforms this data, ultimately storing it in a MS SQL database and exposing it to analysts and statisticians via Amazon Workspaces.
 ![Diagram of the CORE system architecture](https://raw.githubusercontent.com/communitiesuk/submit-social-housing-lettings-and-sales-data/main/docs/images/architecture.drawio.png)
 ## Users
 External data providing organisations have 2 main user types:
--- a/docs/infrastructure.md
+++ b/docs/infrastructure.md
@ -4,48 +4,51 @@ nav_order: 5
 # Infrastructure
-## Configuration
+## Current infrastructure
-
+
-On [GOV.UK PaaS](https://www.cloud.service.gov.uk/), service credentials are appended to the environment variable `VCAP_SERVICES` when services [are bound](https://docs.cloud.service.gov.uk/deploying_services/s3/#bind-an-aws-s3-bucket-to-your-app) to an application.
+Currently, there are four environments with infrastructure:
-Such services include datastores and S3 buckets.
+- Meta
-
+- Development (Review Apps)
-Our application uses S3 and Redis clients and supports two different ways of parsing their configuration:
+- Staging
-* Via the environment variable `VCAP_SERVICES` using the `PaasConfigurationService` class
+- Production
-* Via the environment variables `S3_CONFIG` and `REDIS_CONFIG` using the `EnvConfigurationService` class
+
-
+### Meta
-`S3_CONFIG` and `REDIS_CONFIG` are populated using a similar structure than `VCAP_SERVICES`:
+This holds the Terraform “backend” and the ECR(s).
-
+The Terraform “backend” consists of:
-S3_CONFIG:
+- S3 buckets - for storing Terraform state files. One for all non-production environments (including the meta environment itself), and another just for production.
-```json
+- DynamoDB - for managing access and locking of all state files.
-[
+
-  {
+The ECR(s) are:
-    "instance_name": "bucket_1",
+- core - holds the application Docker images.
-    "credentials": {
+- db-migration - holds the Docker images curated to help migrate a DB from PaaS to AWS.
-      "aws_access_key_id": "123",
+- s3-migration - holds the Docker images curated to help migrate S3 files from PaaS to AWS.
-      "aws_secret_access_key": "456",
+N.B. the migration ECRs may or may not be present, depending on if the Terraform has been configured to create migration infrastructure. The migration infrastructure is only used to help migrate the DB and S3 from PaaS to AWS, so is usually therefore only temporarily present. 
-      "aws_region": "eu-west-1",
+
-      "bucket_name": "my-bucket"
+### Development / Staging / Production
-    }
+These are the main environments holding the “application” infrastructure. 
-  }
+Though not exhaustive, each of them will generally contain the following key components:
-]
+- ECS Fargate cluster
-```
+- RDS (PostgreSQL database)
-
+- ElastiCache (Redis data store)
-REDIS_CONFIG:
+- S3 buckets
-```json
+    - One for Bulk upload (sometimes also to referred to as the CSV bucket)
-[
+    - One for CDS Export 
-  {
+- VPC
-    "instance_name": "redis_1",
+- Private subnets
-    "credentials": {
+- Public subnets
-      "uri": "redis_uri"
+- Load Balancer
-    }
+- Other misc. networking components (e.g. routing tables, gateways)
-  }
+- CloudFront (Content Delivery Network)
-]
+- AWS Shield (DDoS protection, when enabled)
-```
+- WAF (Firewall)
-
+
-In order to switch from using [GOV.UK PaaS](https://www.cloud.service.gov.uk/) provided services to external ones, instances of `PaasConfigurationService` need to be replaced by `EnvConfigurationService`.
+### Development / Review Apps
-This assumes that `S3_CONFIG` or/and `REDIS_CONFIG` are available.
+The development environment is used for Review Apps, and has some infrastructure that is created per-review-app and some that is shared by all apps. 
-
+In general, each review app has its own ECS Fargate cluster and Redis instances (plus any infrastructure to enable this), while the rest is shared.
-Please check `full_import.rake` and `rack_attack.rb` for examples of how the configuration is used.
+
 Where to find the Infrastructure?
 The infrastructure is managed as code. 
 In the terraform folder of the codebase, there will be dedicated sub-folders for each of the aforementioned environments, where all the infrastructure for them is defined.
 ## Deployment (Pipeline — Recommended)
@ -61,224 +64,50 @@ To deploy you need to:
 6. Post success message on Slack.
 7. Tag tickets as ‘Released’ and move tickets to done on JIRA.
 ## Deployment (Manual)
 It is unlikely you will need to deploy manually as the GitHub actions method supersedes this one. This application is running on [GOV.UK PaaS](https://www.cloud.service.gov.uk/). To deploy you need to:
 1. Contact your organisation manager to get an account in `dluhc-core` organization and in the relevant spaces (staging/production).
 2. [Install the Cloud Foundry CLI](https://docs.cloudfoundry.org/cf-cli/install-go-cli.html)
 3. Login:
    ```bash
    cf login -a api.london.cloud.service.gov.uk -u <your_username>
    ```
 4. Set your deployment target (staging/production):
    ```bash
    cf target -o dluhc-core -s <deploy_environment>
    ```
 5. Deploy:
    ```bash
    cf push dluhc-core --strategy rolling
    ```
    This will use the [manifest file](https://github.com/communitiesuk/submit-social-housing-lettings-and-sales-data/blob/main/manifest.yml)
 Once the app is deployed:
 1. Get a Rails console:
    ```bash
    cf ssh dluhc-core-staging -t -c "/tmp/lifecycle/launcher /home/vcap/app 'rails console' ''"
    ```
 2. Check logs:
    ```bash
    cf logs dluhc-core-staging --recent
    ```
 ### Troubleshooting deployments
 A failed Github deployment action will occasionally leave a Cloud Foundry deployment in a broken state. As a result all subsequent Github deployment actions will also fail with the message `Cannot update this process while a deployment is in flight`.
 ```bash
 cf cancel-deployment dluhc-core
 ```
 You would then need to check the logs and fix the issue that caused the initial deployment to fail.
 ## CI/CD
 When a commit is made to `main` the following GitHub action jobs are triggered:
 1. **Test**: RSpec runs our test suite
-2. **Deploy**: If the Test stage passes, this job will deploy the app to our GOV.UK PaaS account using the Cloud Foundry CLI
+2. **AWS Deploy**: If the Test stage passes, this job will deploy the app to AWS
 When a pull request is opened to `main` only the Test stage runs.
 ## Review apps
-When a pull request is opened a review app will be spun up. The reviews apps connect to their own PostgreSQL and Redis instances with its own worker.
+When a pull request is opened a review app will be spun up. Each review app has its own ECS Fargate cluster and Redis instances (plus any infrastructure to enable this), while the rest is shared.
 The review app github pipeline is independent of any test pipeline and therefore it will attempt to deploy regardless of the state the code is in.
 The usual seeding process takes place when the review app boots so there will be some minimal data that can be used to login with. 2FA has been disabled in the review apps for easier access.
-The app boots in a new environment called `review`. As such this is the environment you should filter by for sentry errors or to change any config.
+The app boots in a new environment called `development`. As such this is the environment you should filter by for sentry errors or to change any config.
 After a sucessful deployment a comment will be added to the pull request with the URL to the review app for your convenience. When a pull request is updated e.g. more code is added it will re-deploy the new code.
 Once a pull request has been closed the review app infrastructure will be tore down to save on any costs. Should you wish to re-open a closed pull request the review app will be spun up again.
-### How to fix review app deployment failures 
+### Review app deployment failures 
-
+
-One reason a review app deployment might fail is that it is attempting to run migrations which conflict with data in the database. For example you might have introduced a unique constraint, but the database associated with the review app has duplicate data in it that would violate this constraint, and so the migration cannot be run. There are two main ways to remedy this:
+One reason a review app deployment might fail is that it is attempting to run migrations which conflict with data in the database. For example you might have introduced a unique constraint, but the database associated with the review app has duplicate data in it that would violate this constraint, and so the migration cannot be run.
-
+
-**Method 1 - Edit database via console**
+## Destroying/recreating infrastructure
-1. Log in to Cloud Foundry
+
-    ```bash
+Things to watch out for when destroying/creating infra:
-    cf login -a api.london.cloud.service.gov.uk -u <your_username>
+- All resources
-    ```
+    - The lifecycle meta-argument prevent_destroy will stop you destroying things. Best to set this to false before trying to destroy!
-    * Your username should be the email address you signed up to GOVUK PaaS with.
+- Database
-    * Choose the dev environment whilst logging in.
+    - skip_final_snapshot being false will prevent you from destroying the db without creating a final snapshot.
-2. If you were already logged in then Cloud Foundry, then instead just target the dev environment
+- Load Balancer
-    ```bash
+    - Sometimes when creating infra, you may see the error message: failure configuring LB attributes: InvalidConfigurationRequest: Access Denied for bucket: <load-balancer-access-log-bucket-name>. Please check S3bucket permission during a terraform apply. To get around this you may have wait a few minutes and try applying again to ensure everything is fully updated (the error shouldn’t appear on the second attempt). It’s unclear what the exact cause is, but as this is related to infra that enables load balancer access logging, it is suspected there might be a delay with the S3 bucket permissions being realised or the load balancer recognising it can access the bucket.
-    cf target -o dluhc-core -s dev
+- S3
-    ```
+    - Terraform won’t let you delete buckets that have objects in them.
-3. Find the name of your app
+- Secrets
-    ```bash
+    - If you destroy secrets, they will actually be marked as ‘scheduled to delete’ which will take effect after a minimum of 7 days. You can’t recreate secrets with the same name during this period. If you want to destroy immediately, you need to do it from the command line (using your staging developer role, rather than your MHCLG-wide role used to apply Terraform) with this command: aws secretsmanager delete-secret --force-delete-without-recovery --secret-id <secret-arn>. (Note that if a secret is marked as scheduled to delete, you can undo this in the console to make it an ‘active’ secret again.)
-    cf apps
+    - You may need to manually re-enter secret values into Secrets Manager at some point. When you do, just paste the secret value as plain text (don’t enter a key name, or format it as JSON).
-    ```
+- ECS
-    * The app name will be in this format: `dluhc-core-review-<pull-request-number>`.
+    - Sometimes task definitions don’t get deleted. You may need to manually delete them.
-4. Open a console for your app
+    - After destroying the db, you’ll need to make sure the ad hoc ECS task which seeds the database gets run in order to set up the database correctly.
-    ```bash
+- SNS
-    cf ssh <app-name-here> -t -c "/tmp/lifecycle/launcher /home/vcap/app 'rails console' ''"
+    - When creating an email subscription in an environment, Terraform will look up the email to use as the subscription endpoint from Secrets Manager. If you haven’t already created this (e.g. by running terraform apply -target="module.monitoring" -var="create_secrets_first=true") then this will lead to the subscription creation erroring, because it can’t retrieve the value of the secret (because it doesn’t exist yet). If this happens, remember you’ll need to go to Secrets Manager in the console and enter the desired email (as plaintext, no quotation marks or anything else required) as the value of the secret (which is most likely called MONITORING_EMAIL). Then run another apply with Terraform and this time it should succeed.
    ```
 5. Edit the database as appropriate, e.g. delete dodgy data and recreate correctly
 **Method 2 - Nuke and restart**
 1. Find the name of your app
    ```bash
    cf apps
    ```
    * The app name will be in this format: `dluhc-core-review-<pull-request-number>`.
 2. Delete the app
    ```bash
    cf delete <app-name-here>
    ```
 3. Find the name of the matching Postgres service
 	```bash
    cf services
    ```
 	* The service name will be in this format: `dluhc-core-review-<pull-request-number>-postgres`.
 4. Delete the service
 	```bash
    cf delete-service <service-name-here>
    ```
    * Use `cf services` or `cf service <service-name-here>` to check the operation status.
    * There's no need to delete the Redis service.
 5. Re-run the whole review app pipeline in GitHub
    * If it fails it's likely that the deletion from the previous step hadn't completed yet. So just wait a few minutes and re-run the pipeline again.
 ## Setting up Infrastructure for a new environment
 ### Staging
 1. Login:
    ```bash
    cf login -a api.london.cloud.service.gov.uk -u <your_username>
    ```
 2. Set your deployment target (staging):
    ```bash
    cf target -o dluhc-core -s staging
    ```
 3. Create required Postgres, Redis and S3 bucket backing services (this will take ~15 mins to finish creating):
    ```bash
    cf create-service postgres tiny-unencrypted-13 dluhc-core-staging-postgres
    cf create-service redis micro-6.x dluhc-core-staging-redis
    cf create-service aws-s3-bucket default dluhc-core-staging-csv-bucket
    cf create-service aws-s3-bucket default dluhc-core-staging-import-bucket
    cf create-service aws-s3-bucket default dluhc-core-staging-export-bucket
    ```
 4. Deploy manifest:
    ```bash
    cf push dluhc-core-staging --strategy rolling
    ```
 5. Bind S3 services to app:
    ```bash
    cf bind-service dluhc-core-staging dluhc-core-staging-csv-bucket
    cf bind-service dluhc-core-staging dluhc-core-staging-redis
    cf bind-service dluhc-core-staging dluhc-core-staging-import-bucket -c '{"permissions": "read-write"}'
    cf bind-service dluhc-core-staging dluhc-core-staging-export-bucket -c '{"permissions": "read-write"}'
    ```
 6. Create a service keys for accessing the S3 bucket from outside Gov PaaS:
    ```bash
    cf create-service-key dluhc-core-staging-csv-bucket csv-bucket -c '{"allow_external_access": true}'
    cf create-service-key dluhc-core-staging-import-bucket data-import -c '{"allow_external_access": true}'
    cf create-service-key dluhc-core-staging-export-bucket data-export -c '{"allow_external_access": true, "permissions": "read-only"}'
    ```
 ### Production
 1. Login:
    ```bash
    cf login -a api.london.cloud.service.gov.uk -u <your_username>
    ```
 2. Set your deployment target (production):
    ```bash
    cf target -o dluhc-core -s production
    ```
 3. Create required Postgres, Redis and S3 bucket backing services (this will take ~15 mins to finish creating):
    ```bash
    cf create-service postgres small-ha-13 dluhc-core-production-postgres
    cf create-service redis micro-ha-6.x dluhc-core-production-redis
    cf create-service aws-s3-bucket default dluhc-core-production-csv-bucket
    cf create-service aws-s3-bucket default dluhc-core-production-import-bucket
    cf create-service aws-s3-bucket default dluhc-core-production-export-bucket
    ```
 4. Deploy manifest:
    ```bash
    cf push dluhc-core-production --strategy rolling
    ```
 5. Bind S3 services to app:
    ```bash
    cf bind-service dluhc-core-production dluhc-core-production-csv-bucket
    cf bind-service dluhc-core-production dluhc-core-production-redis
    cf bind-service dluhc-core-production dluhc-core-production-import-bucket -c '{"permissions": "read-write"}'
    cf bind-service dluhc-core-production dluhc-core-production-export-bucket -c '{"permissions": "read-write"}'
    ```
 6. Create a service keys for accessing the S3 bucket from outside Gov PaaS:
    ```bash
    cf create-service-key dluhc-core-production-csv-bucket dluhc-core-production-csv-bucket-service-key -c '{"allow_external_access": true}'
    cf create-service-key dluhc-core-production-import-bucket data-import -c '{"allow_external_access": true}'
    cf create-service-key dluhc-core-production-export-bucket data-export -c '{"allow_external_access": true, "permissions": "read-only"}'
    ```
--- a/docs/monitoring.md
+++ b/docs/monitoring.md
@ -2,20 +2,48 @@
 nav_order: 6
 ---
-# Monitoring
+# Logs and Debugging
 ## Logs
 Logs can be found in two locations:
 - AWS CloudWatch (for general application / infrastructure logging)
 - Sentry (for application error logging)
-We use self-hosted Prometheus and Grafana for monitoring infrastructure metrics. These are run in a dedicated Gov PaaS space called "monitoring" and are deployed as Docker images using GitHub action pipelines. The repository for these and more information is here: [dluhc-data-collection-monitoring](https://github.com/communitiesuk/dluhc-data-collection-monitoring).
+### CloudWatch
 The CloudWatch service can be accessed from the AWS Console. You should authenticate onto the infrastructure environment whose logs you want to check.
 From CloudWatch, navigate to the desired log group (e.g. for the app task running on ECS) and open the desired log stream, in order to read its log “events”.
 Alternatively, you can also navigate to a specific AWS service / resource in question (e.g. ECS tasks), selecting the instance of interest (e.g. a specific ECS task), and finding the “logs” tab (or similar) to view the log “events”.
-## Performance monitoring and alerting
+### Sentry
 To access Sentry, ensure you have been added to the DLUHC account.
 Generally error logs in Sentry will also be present somewhere in the CloudWatch logs, but they will be easier to assess here (e.g. number of occurrences over a time period). The logs in Sentry are created by the application when it makes Rails.logger.error calls.
-For application error and performance monitoring we use managed [Sentry](https://sentry.io/organizations/dluhc-core). You will need to be added to the DLUHC account to access this. It triggers slack notifications to the #team-data-collection-alerts channel for all application errors in staging and production and for any controller endpoints that have a P95 transaction duration > 250ms over a 24 hour period.
+## Debugging
 ### Application infrastructure
 For debugging / investigating infrastructure issues you can use the AWS CloudWatch automatic dashboards. (e.g. is there a lack of physical space on the database, how long has the ECS had very high compute usage for etc.)
 They can be found in the CloudWatch service on AWS console, by going to dashboards → automatic dashboards, and selecting the desired dashboard (e.g. Elastic Container Service). 
 Alternatively, you can also navigate to the AWS resource in question (e.g. RDS database), selecting the instance of interest, and selecting the “monitoring” / ”metrics” tab (or similar), as this can provide alternate useful information also.
-## Logs
+### Exec into a container
 You can open a terminal directly on a running container / app, in order to run some commands that may help with debugging an issue. 
 To do this, you will need to “exec” into the container.
 #### Prerequisites
 - AWS CLI
 - AWS Session manager plugin Install the Session Manager plugin for the AWS CLI - AWS Systems Manager 
 - AWS access
-For log persistence we use a managed ELK (Elasticsearch, Logstash, Kibana) stack provided by [Logit](https://logit.io/). You will need to be added to the DLUHC account to access this. Longs are retained for 14 days with a daily limit of 2GB.
+#### Accessing the rails console
 1. Find the cluster name of the relevant cluster
 2. Find the task arn of a relevant task
 3. In a shell using suitable AWS credentials for the relevant account (e.g. the development, staging, or production account), run `aws ecs execute-command --cluster cluster-name --task task-arn --interactive --command "rails c"`
-Logs are also available from Gov PaaS directly via CLI:
+N.B. You can run other commands on the container similarly.
 ```bash
 cf logs <gov-paas-space-name> --recent
 ```
 env=staging
 taskArns=$(aws ecs list-tasks --cluster "core-$env-app" --query "taskArns[*]")
 aws ecs describe-tasks --cluster "core-$env-app" --tasks "${taskArns[@]}" --query "tasks[*].{arn:taskArn, status:lastStatus, startedAt:startedAt, group:group, image:containers[0].image}" --output text
 ```
 ### Database
 In order to investigate or look more closely at the database, you can exec into a container as above, and use the rails console to query the database.