This repository contains a solution for Multimodal PII (Personally Identifiable Information) Redaction using Amazon Bedrock. The solution provides a workflow for detecting and redacting PII from emails and attachments using various AWS services including Amazon Bedrock Data Automation, Amazon Bedrock Guardrails, Lambda, S3, DynamoDB, and more.
The following diagram outlines the solution architecture.
The diagram illustrates the backend PII detection and redaction workflow and the frontend application user interface orchestrated by AWS Lambda and Amazon EventBridge. The process follows these steps:
- The workflow starts with the user sending an email to the incoming email server hosted on Amazon Simple Email Service (Amazon SES)(optional).
- Amazon SES then stores the emails and attachments in an Amazon Simple Storage Service (S3) landing bucket.
- An S3 event notification triggers the initial processing AWS Lambda function that generates a unique case ID and creates a tracking record in Amazon DynamoDB.
- Lambda orchestrates the PII detection and redaction workflow by extracting email body and attachments from the email and saving in raw email bucket followed by invoking Amazon Bedrock Data Automation and Amazon Bedrock Guardrails for detecting and redacting PII.
- Amazon Bedrock Data Automation processes attachments to extract text from the files.
- Amazon Bedrock Guardrails detects and redacts the PII from both email body and text from attachments, and then stores the redacted content in another S3 bucket.
- DynamoDB tables are updated with email messages, folders metadata, and email filtering rules.
- An Amazon EventBridge Scheduler is used to run the Rules Engine Lambda on a schedule which will process new emails that have yet to be categorized into folders based on enabled email filtering rules criteria.
- The Rules Engine Lambda also communicates with DynamoDB to access the messages table and the rules table.
- Users can access the application user interface where Amazon API Gateway manages user API requests and routes requests to render the user interface through S3 static hosting.
- A Portal API Lambda fetches the case details based on user requests.
- The static assets served by API Gateway are stored in a private S3 bucket.
- Amazon CloudWatch and AWS CloudTrail provide visibility into the PII detection and redaction process, while Amazon Simple Notification Service delivers real-time alerts for any failures, ensuring immediate attention to issues.
This application requires the installation of the following software tools:
- Python v3.12 or higher
- Node v18 or higher
- NPM v9.8 or higher
- AWS CDK v2.166 or higher
- Terminal/CLI such as macOS Terminal, PowerShell or Windows Terminal, or the Linux command line. AWS CloudShell can also be used when all code is located within an AWS account.
All CloudFormation stacks need to be deployed within the same AWS account.
An existing VPC that contains 3 private subnets with no internet access is needed.
The solution contains 3 stacks (2 required, 1 optional) that will be deployed in your AWS account:
-
S3Stack - Provisions the core infrastructure including S3 buckets for raw and redacted email storage with automatic lifecycle policies, a DynamoDB table for email metadata tracking with time-to-live (TTL) and global secondary indexes, and VPC security groups for secure Lambda function access. It also creates IAM roles with comprehensive permissions for S3, DynamoDB, and Bedrock services, forming the secure foundation for the entire PII detection and redaction workflow.
-
ConsumerStack - Provisions the core processing infrastructure including Amazon Bedrock Data Automation projects for document text extraction and Bedrock Guardrails configured to anonymize comprehensive PII entities, along with Lambda functions for email and attachment processing with SNS topics for success/failure notifications. It also creates SES receipt rules for incoming email handling when a domain is configured, Lambda layers for dependency management, and S3 event notifications to trigger the email processing workflow automatically.
-
PortalStack (optional) - Provisions the optional web interface including a regional API Gateway, DynamoDB tables for folders and rules management, and S3 buckets for static web assets. It also creates EventBridge schedulers for automated rules processing, lambda functions for portal API handling and email forwarding functionality when configured.
Move directly to the Solution Deployment section below if you are not using Amazon SES Below Amazon SES Setup is optional. One can test the code without this setup as well. Steps to test the application with or without Amazon SES is covered in Testing section.
Set up Amazon SES with prod access and verify the domain/email identities for which the solution is to work. We also need to add the MX records in the DNS provider maintaining the domain. Please refer to the links below:
Create credentials for SMTP and save it in AWS Secrets Manager secret with name SmtpCredentials
. Please note that an IAM user is created for this process
If using any other name for secret update the context.json
line secret_name
with the name of the secret created.
Key for the user name in the secret should be smtp_username
and key for password should be smtp_password
when storing the same in AWS Secrets Manager
Run all of the following commands from within a terminal/CLI environment.
Clone the repository
git clone https://github.com/aws-samples/sample-bda-redaction.git
The infra/cdk.json
file tells the CDK Toolkit how to execute your app.
cd sample-bda-redaction/infra/
Optional: Create and activate a new Python virtual environment (make sure you are using python 3.12 as lambda is in CDK is configured for same. If using some other python version update CDK code to reflect the same in lambda runtime)
python3 -m venv .venv
. .venv/bin/activate
Upgrade pip
pip install --upgrade pip
Install Python packages
pip install -r requirements.txt
Create context.json
file
cp context.json.example context.json
Update the context.json
file with the correct configuration options for the environment.
Property Name | Default | Description | When to Create |
---|---|---|---|
vpc_id | "" |
VPC ID where resources will be deployed | VPC needs to be created prior to execution |
raw_bucket | "" |
S3 bucket storing raw messages and attachments | Created during CDK deployment |
redacted_bucket_name | "" |
S3 bucket storing redacted messages and attachments | Created during CDK deployment |
inventory_table_name | "" |
DynamoDB table name storing redacted message details | Created during CDK deployment |
resource_name_prefix | "" |
Prefix used when naming resources during the stack creation | During stack creation |
retention | 90 |
Number of days for retention of the messages in the redacted and raw S3 buckets | During stack creation |
The following properties are only required when the portal is being provisioned.
Property Name | Default | Description |
---|---|---|
environment | development |
The type of environment where resources will be provisioned. Values are development or production |
Use cases that require the usage of Amazon SES to manage redacted email messages will need to set the following configuration variables. Otherwise, these are optional.
Property Name | Description | Comment |
---|---|---|
domain | The verified domain or email name that is used for Amazon SES | This can be left blank if not setting up Amazon SES |
auto_reply_from_email | Email address of the "from" field of the email message. Also used as the email address where emails are forwarded from the Portal application | This can be left blank if not setting up the Portal |
secret_name | AWS Secrets Manager secret containing SMTP credentials for forward email functionality from the portal |
Run the following commands from the root of the infra
directory:
Bootstrap the AWS account to use AWS CDK
cdk bootstrap
At this point you can now synthesize the CloudFormation template for this code. Additional environment variables before the cdk synth suppresses the warnings. The deployment process should take approximately 10 min for a first-time deployment to complete.
JSII_DEPRECATED=quiet JSII_SILENCE_WARNING_UNTESTED_NODE_VERSION=quiet cdk synth --no-notices
Replace <<resource_name_prefix>>
with its chosen value and then run:
JSII_DEPRECATED=quiet JSII_SILENCE_WARNING_UNTESTED_NODE_VERSION=quiet cdk deploy <<resource_name_prefix>>-S3Stack <<resource_name_prefix>>-ConsumerStack --no-notices
Before starting the test make sure Amazon SES Email Receiving rule set created by the <<resource_name_prefix>>-ConsumerStack
stack is active. We can check by executing below command and make sure name in the output is <<resource_name_prefix>>-rule-set
aws ses describe-active-receipt-rule-set
If the name does not match or the output is blank, execute below to activate the same:
# Replace <<resource_name_prefix>> with resource_name_prefix used in context.json
aws ses set-active-receipt-rule-set --rule-set-name <<resource_name_prefix>>-rule-set
Once we have the correct rule set active, to test the application using Amazon SES, you just need to send email to the verified email or domain in Amazon SES, which will automatically trigger the redaction pipeline.
You can track the progress in the DynamoDB table <<inventory_table_name>>
. Inventory table name can be found on the resources tab in the AWS CloudFormation Console for the <<resource_name_prefix>>-S3Stack
stack and Logical ID EmailInventoryTable
.
A unique <<case_id>>
is generated and used in the DynamoDB inventory table for each email being processed. Once redaction is completed you can find the redacted email body in <<redacted_bucket_name>>/redacted/<<today_date>>/<<case_id>>/email_body/
and redacted attachments in <<redacted_bucket_name>>/redacted/<<today_date>>/<<case_id>>/attachments/
As described earlier the solution is used to redact any PII data in email body and attachments so to test the application we need to provide an email file which needs to be redacted.
We can do that without Amazon SES as well by directly uploading an email file to the raw S3 bucket. Raw bucket name can be found on the output tab in the AWS CloudFormation Console for <<resource_name_prefix>>-S3Stack stack and Export Name RawBucket
.
This will trigger the workflow of redacting the email body and attachment by S3 event notification triggering the Lambda.
For convenience a sample email is available in "infra/pii_redaction/sample_email" directory of the repository. Below are the steps to test application without Amazon SES using the same email file.
# Replace <<raw_bucket>> with raw bucket name created during deployment:
aws s3 cp pii_redaction/sample_email/ccvod0ot9mu6s67t0ce81f8m2fp5d2722a7hq8o1 s3://<<raw_bucket>>/domain_emails/
The above will trigger the redaction of email process. You can track the progress in the DynamoDB table <<inventory_table_name>>
. A unique <<case_id>>
is generated and used in DynamoDB inventory table for each email being processed.
Inventory table name can be found on the resources tab in the AWS CloudFormation Console for <<resource_name_prefix>>-S3Stack stack and Logical ID EmailInventoryTable
.
Once redaction is completed you can find the redacted email body in <<redacted_bucket_name>>/redacted/<<today_date>>/<<case_id>>/email_body/
and redacted attachments in <<redacted_bucket_name>>/redacted/<<today_date>>/<<case_id>>/attachments/
IMPORTANT: The installation of the portal is completely optional. You can skip this section and check the AWS console of the AWS account where the solution is deployed to view the resources created.
The portal serves as a web interface to manage the PII-redacted emails processed by the backend AWS infrastructure, allowing users to organize, view, and forward sanitized email content.
- List Messages: View processed emails with redacted content
- Forward Messages: Send redacted emails to other recipients (when SES is configured)
- Export Messages: Export email data for external use
- Message Details: View individual email content and attachments
- Create Folders: Organize emails into custom folders
- List Folders: View and manage existing folder structure
- Delete Folders: Remove folders when no longer needed
- Create Rules: Define automated email categorization rules
- List Rules: View existing email filtering rules
- Toggle Rules: Enable/disable rules without deletion
- Delete Rules: Remove rules permanently
This application requires the installation of the following software tools:
Synthesize the CloudFormation template for this code by navigating to the directory root of the solution. Then run the following command:
cd sample-bda-redaction/infra/
Optional: Create and activate a new Python virtual environment (if the virtual environment has not been created previously):
python3 -m venv .venv
. .venv/bin/activate
pip install -r requirements.txt
At this point you can synthesize the CloudFormation template for this code.
JSII_DEPRECATED=quiet JSII_SILENCE_WARNING_UNTESTED_NODE_VERSION=quiet cdk synth --no-notices
Deploy the React-based portal. Replace <<resource_name_prefix>>
with its chosen value:
JSII_DEPRECATED=quiet JSII_SILENCE_WARNING_UNTESTED_NODE_VERSION=quiet cdk deploy <<resource_name_prefix>>-PortalStack --no-notices
The first-time deployment should take approximately 10 minutes to complete.
The portal is protected by Basic Authentication. When using Basic Access Authentication the credentials are stored in AWS Secrets Manager using the secret provisioned in the PortalStack
CloudFormation stack that has been created via AWS CDK. The CloudFormation stack resource is named PiiRedactionPortalAuthSecret
.
Create a new environment file by navigating to the root of the app
directory and update the following variables in the .env
file (by copying the .env.example
file to .env
) using the following command to create the .env
file using a terminal/CLI environment:
cp .env.example .env
You can also create the file using your preferred text editor as well.
Environment Variable Name | Default | Description | Required |
---|---|---|---|
VITE_APIGW | "" |
Hostname from the API Gateway invoke URL without the path (remove /portal from the value) |
Yes |
VITE_BASE | /portal |
It specifies the path used to request the static files needed to render the portal | Yes |
VITE_API_PATH | /api |
It specifies the path needed to send requests to the API Gateway | Yes |
VITE_EMAIL_ENABLED | false |
It enables/disables the forward email function. Values are true to enable the feature or false to disable it. It should be set to false if you have not set up Amazon SES |
Yes |
Run the following commands from within a terminal/CLI environment:
Before running any of the following commands, navigate to the root of the app
directory to build this application for production by running the following commands:
Install NPM packages (first time only)
npm install
Build the files
npm run build
After the build succeeds, transfer all of the files within the dist/ directory into the Amazon S3 bucket that is designated for these assets (specified in the PortalStack provisioned via CDK).
Example:
aws s3 sync dist/ s3://<<name-of-s3-bucket>> --delete
<<name-of-s3-bucket>>
is the S3 bucket that was created in the <<resource-name-prefix>>-PortalStack
CloudFormation stack with the Logical ID of PrivateWebHostingAssets. This value can be obtained from the Resources tab of the CloudFormation stack in the AWS Console.
This value is also output during the cdk deploy
process when the PortalStack has been successfully completed.
Use the API Gateway invoke URL from the API Gateway that has been created during the cdk deploy
process to access the portal from a web browser. You can find this URL by following these steps:
- Navigate to the AWS Console
- Navigate to API Gateway and find the API Gateway that has been created during the
cdk deploy
process. The name of the API Gateway can be found in the Resources section of the<<resource-name-prefix>>-PortalStack
CloudFormation stack. - Click on the Stages link in the left-hand menu.
- Ensure that the ** portal** stage is selected
- Find the Invoke URL and copy that value
- Enter that value in the address bar of your web browser
You should now see the portal's user interface visible within the web browser. If any emails have been processed, they will be listed on the home page of the portal.
To avoid incurring future charges, follow these steps to remove the resources created by this solution:
- Delete the contents of the S3 buckets created by the solution:
- Raw email bucket
- Redacted email bucket
- Portal static assets bucket (if portal was deployed)
- Delete or disable the Amazon SES rule step created by the solution using below cli command:
#to disable the rule set use below command
aws ses set-active-receipt-rule-set
#to delete the rule set use below command
# Replace <<resource_name_prefix>> with resource_name_prefix used in context.json
aws ses delete-receipt-rule-set --rule-set-name <resource_name_prefix>>-rule-set
- Remove the CloudFormation stacks in the following order: (if deployed)
cdk destroy <<resource_name_prefix>>-PortalStack
cdk destroy <<resource_name_prefix>>-ConsumerStack
cdk destroy <<resource_name_prefix>>-S3Stack
- CDK Destroy does not remove the access log Amazon S3 bucket created as part of the deployment. You can get access log bucket name in the output tab of stack
<<resource_name_prefix>>-S3Stack
with export nameAccessLogsBucket
. Execute below steps to delete the access log bucket:
- Delete the contents of the access log bucket, follow link S3
- Access log bucket is version enabled and deleting the content of the bucket in the above step does not delete versioned objects in the bucket. That need to removed separately using below aws cli commands:
#to remove versioned objects use below aws cli command aws s3api delete-objects --bucket ${accesslogbucket} --delete "$(aws s3api list-object-versions --bucket ${accesslogbucket} --query='{Objects: Versions[].{Key:Key,VersionId:VersionId}}')" #once versioned objects are removed we need to remove the delete markers of the versioned objects using below aws cli command aws s3api delete-objects --bucket ${accesslogbucket} --delete "$(aws s3api list-object-versions --bucket ${accesslogbucket} --query='{Objects: DeleteMarkers[].{Key:Key,VersionId:VersionId}}')"
- Delete the access log Amazon S3 bucket using below aws cli command:
#delete the access log bucket itself using below aws cli command aws s3api delete-bucket --bucket ${accesslogbucket}
- If you have configured Amazon SES:
- Remove the verified domain/email identities
- Delete the MX records from your DNS provider
- Delete the SMTP credentials from AWS Secrets Manager
- Delete any CloudWatch Log groups created by the Lambda functions
Note: The VPC and its associated resources as prerequisites for this solution may not be deleted if they may be used by other applications.