Best of all: Automatic log output scheme for AWS IoT solutions

Best of all: Automatic log output scheme for AWS IoT solutions

·

8 min read

Abstract

  • The number of solutions using AWS IoT is growing every day!

    • The number of IoT devices connecting to AWS is also increasing
  • With the increase in the number of devices, AWS server developers are spending more time investigating logs when there is a failure between a device and AWS IoT.

  • DevOps operators need to create optimal and easy log output schemes and automate them so that they can focus on development

As Is

  • Manually utilizing AWS CLI to execute log outputs from S3.

  • Development work is halted due to inquiries on Slack.

  • Anyone can't retrieve logs at any time, which results in extended investigation time.

To Be

  • Realize automatic log output while keeping costs as low as possible with a serverless architecture.

  • Reduce communication time on Slack to the bare minimum.

  • Enable anyone to execute log outputs from anywhere as long as they have Slack installed on their PC or phone.

Sequence

  • I will adopt AWS Batch to generate a single Zip file consolidating a large number of log files.

  • Please note that the Slack slash command requires a response within 3 seconds.

AWS Architecture

Command Specifications

Guidelines

/iot-log [Identifier] [Category] [Date(UTC)]

Identifier:

  • ThingName

  • Certificate ID

  • etc...

Category:

Set a limit for the category parameter specification according to the specifications of the slash command and AWS Batch

  • telemetry

  • command

  • lifecycle

  • etc...

Date:

  • 2023/01/01

  • Note that as per AWS S3 directory specifications, the date here is in UTC

Examples of Slash Commands

# If you want to retrieve telemetry for THING00001 on 2023/10/01
/iot-log THING00001 telemetry 2023/10/01
# If you want to retrieve telemetry and command for THING00001 on 2023/10/01
/iot-log THING00001 telemetry,command 2023/10/01
# If you want to retrieve all devices such as airframes and apps that connected/disconnected to AWS IoT Core on 2023/10/01
/iot-log client_id lifecycle 2023/10/01

Display Image on Slack

We are issuing signed URLs for the zipped objects on S3. Please specify an appropriate value for the expiration time.

Here, we describe an example of how the Slack slash command will appear when generating a zip file for telemetry and command logs for the ThingName THING00001 on 2023/10/01.

syuhei-honma 12:21 PM
/iot-log THING00001 telemetry,command 2023/10/01 dev
-----------------------------------------------


IoT Log Output App 12:21 PM
Target Identifier: THING00001
Log Category: telemetry,command
Specified Log Date: 2023/10/01

Generating log file...
-----------------------------------------------


IoT Log Output App 12:22 PM
Target Identifier: THING00001
Log Category: telemetry,command
Specified Log Date: 2023/10/01

Output Progress: xxx%
-----------------------------------------------


IoT Log Output App 12:22 PM
Output Progress: 100%
Target Identifier: THING00001
Log Category: telemetry,command
Specified Log Date: 2023/10/01

Log file generation complete!!

Please access the link below to download the log file.
The download link is valid for up to 60 minutes.
https://xxxxx.s3.ap-northeast-1.amazonaws.com/slack/iot-logs/YYYYMMMDDhhmmss/20231001_THING00001.zip??X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=xxxxx&X-Amz-Date=xxxxx&X-Amz-Expires=xxxx&X-Amz-SignedHeaders=host&X-Amz-Signature=xxxxxx
-----------------------------------------------

Specification for Log Output Generation Progress

Please decide whether to define the progress status in % or string during implementation.

Assumed Progress (Tentative):

  • starting: 10%

  • progress: 10~99%

    • is starting

    • is done

  • completed: 100%

Directory Structure of the zip file

The directory structure post-decompression of the downloaded zip file is described below.

The directory structure varies depending on the parameters of the slash command.

# This is just a reference example
.
├── telemetry_20231001.log
└── command_20231001.log

Construction Flow

  1. Creating Slack App:

    • Create a new app on Slack API's Your Apps page.

    • Create a slash command and specify the request URL. This URL will later become the endpoint for your AWS Lambda function.

  2. Creating AWS Lambda Function:

    • Create a Lambda function to receive requests from Slack.

    • Within this function, write the code to initiate AWS Batch jobs.

  3. Building & Implementing AWS Batch:

    • Construct AWS Batch job definitions and job queues.

    • Within the job's Docker container, download multiple log files from S3, aggregate them into a single log file, and re-upload it to S3.

    • Depending on the process, notify the progress status to Slack.

    • Generate signed URLs for the Zip files uploaded to S3.

  4. Launching Batch Job from Lambda:

    • Within the Lambda function, call the API to initiate the Batch job.
  5. Notification of Log File to Slack:

    • Once the Batch job is completed, notify Slack of the signed URL of the log file in S3.
  6. Security:

    • Verify Slack's signature to ensure that the request is genuinely from Slack.

    • Appropriately set S3 bucket policies and IAM roles to prevent unauthorized access.

  7. Testing:

    • Execute the slash command in Slack, check if AWS Batch is activated, a Zip file is generated, and Slack is notified.

    • Access the signed URL to ensure that the Zip file can be downloaded.

AWS S3 Specifications

The bucket management settings below are examples. Please design optimally as needed.

Logs between devices and AWS IoT are output to the dev-iot-logs bucket.

The output destination for log output is also managed in the dev-iot-logs bucket.

Bucket NameInput Directory NameMQTT TopicRemarks
dev-iot-logslifecycle$aws/events/presence/connected/+,$aws/events/presence/disconnected/+Connection/Disconnection detection to AWS IoT Core
dev-iot-logstelemetryiot/telemetry/#Telemetry topics from devices
dev-iot-logscommandsiot/commands/#Command topics to devices
# Reference: Input Directory Structure
dev-iot-logs
├── lifecycle
├── telemetry
└── commands
# Reference: Output Directory Structure
dev-iot-logs
└── slack
    └── iot-logs
          └── Request Time (e.g., YYYYMMMDDhhmmss)
              └── 20231001_THING00001.zip

S3 Lifecycle Rule Setting

Here are the lifecycle settings details.

The settings below are examples. Please design optimally as needed.

KeyValue
Lifecycle Rule NameSlack2IoTLogs
Rule Scope SelectionRestrict the scope of this rule using one or more filters
Filter Type Prefixslack
Lifecycle Rule ActionsExpire the current version of objects, Permanently delete the non-current versions of objects
Current Version Expiration: Days from object creation1 day
Non-current Version Permanent Deletion: Days since object became non-current version1 day

AWS Lambda Specifications

  • Only one AWS Lambda will be prepared to be triggered by a Slack slash command.

  • To achieve the minimum implementation, use Lambda HTTPS Endpoint instead of API Gateway + Lambda.

  • Utilize convenient tools like CDK, SAM, etc., for infrastructure construction.

  • Implement in Python.

  • When submitting jobs to multiple AWS Batch from one AWS Lambda, define AssumeRole in IAM role.

Example Source Code

To trigger AWS Batch jobs from AWS Lambda, you can follow the steps below using the AWS SDK for Python (boto3):

  1. Setting up IAM Role:

    • It's crucial to assign an appropriate IAM role to the AWS Lambda function to allow communication with AWS Batch. This role should include the batch:SubmitJob permission along with any other necessary permissions.
  2. Creating AWS Lambda Function:

    • Create a new Lambda function using either the Lambda console or AWS CLI.
  3. Installing Dependencies:

    • Install the dependencies, including the boto3 library, as needed.
  4. Implementing the Code:

    • The Python code snippet below illustrates an example of an AWS Lambda function that triggers an AWS Batch job:
import boto3

def lambda_handler(event, context):
    # Create AWS Batch client
    batch_client = boto3.client('batch')

    # Specify job queue and job definition
    job_queue = 'your-job-queue-name'
    job_definition = 'your-job-definition-name'

    # Specify job name and priority
    job_name = 'example-job-name'
    job_priority = 1  # Priority can take a value between 1 and 99

    # Submit the job
    response = batch_client.submit_job(
        jobName=job_name,
        jobQueue=job_queue,
        jobDefinition=job_definition,
        priority=job_priority
    )

    # Log the Job ID
    print(f'Job ID: {response["jobId"]}')

    return {
        'statusCode': 200,
        'body': f'Job {response["jobId"]} submitted successfully.'
    }

In this code snippet, boto3 is utilized to create an AWS Batch client and the submit_job method is used to submit a new job. It's necessary to specify the job queue, job definition, job name, and job priority.

When used as a handler for an AWS Lambda function, this code snippet will submit a new AWS Batch job each time the function is triggered. Furthermore, this function logs the job ID and returns the job ID in the response.

AWS Batch Specifications

The settings below are examples. Please design optimally as needed.

A serverless architecture is adopted, and computing is prepared in Fargate.

ECR

  • Create in a private repository

  • ECR repository name: slack2log (example)

  • Scan frequency: on push

Job Queue

KeyValue
Job Queue Nameslack
Priority1
Scheduled Policy ARN - OptionalNot specified
Orchestration TypeFargate
Enable Job QueueEnabled
Select Computing Environmentdefault(FARGATE_SPOT)

Job Definition

KeyValue
Job TypeSingle-node
Nameexample-job-name
Execution Timeout3600 (1 hour)
Schedule PriorityDisabled
Platform TypeFargate
Fargate Platform VersionDefault value
Assign Public IPEnabled
Image<your_account_id>.dkr.ecr.ap-northeast-1.amazonaws.com/slack:latest
Command (JSON)["python3","main.py","generate-ziplog","--identifier","Ref::Identifier","--category","Ref::Category","--date","Ref::Date"]
vCPU1.0
Memory2 GB (2048MB)
Job Role Configurationarn:aws:iam::<your_account_id>:role/JobRole
Execution Rolearn:aws:iam::<your_account_id>:role/TaskExecutionRole
Log Configurationawslogs: /batch/slack

Add additional settings if there are others

Creation of Cloudwatch Log Group

KeyValue
Log Group Name/batch/slack
Retention Period Setting30 days

During Operation (Prepare during implementation)

  • Describe usage and documentation on Slack channel's Canvas

  • List frequently used slash commands (such as telemetry and commands)

Conclusion

Troubleshooting in IoT solutions requires real-time investigation and root cause analysis.
Let's utilize the proposal from this time to achieve swift DevOps.