Automated Cloud Based Recon
Brevity In Motion - Automated Cloud-based Recon

Automated Cloud Based Recon

Two years ago, I embarked on a journey to become a better, well-rounded security practitioner and really ramped up side projects. Whether you are red team, blue team, purple team, a hacker, an aspiring practitioner, or a seasoned expert, there is a huge opportunity to leverage cloud, automation, and modern technologies to leap frog information security programs and capabilities. Although this topic focuses on bug bounty, the principles, examples, and underlying foundation are applicable to the majority of technology domains.

My hope as you read this is that it may spark ideas within your own roles, responsibilities, and research as well as paving a path to begin learning the benefits of a cloud-based ecosystem. I attempted to extract relevant code snippets that you may be able to modify or at least leverage as a useful guide

Part 1 - The psychology

First off, this project has been one of the longest daily grind projects that I have ever embarked on. I have had, as STÖK would refer to as "bug fever", after participating in the H1-2010 virtual event last year as well as combining this with my extreme competitiveness, relentless ambition, continuous learning, and a curiosity to understand just about everything.

Additionally, I quickly realized that my current approach to bug bounties was not compatible with the level of results that I wanted. After prioritizing family, balancing a full-time job, and maintaining life's responsibilities, it left me with a window of time from about 10 pm to 1 am to bounty hunt. Let's be honest, having two young kids can be exhausting, so I've found that I could typically allocate 4 days max per week because I still needed sleep. With around 12 hours per week, I fell into a constant cycle that I will refer to as the bug bounty dilemma:

This unproductive cycle continued and I decided that I needed to change the approach.

Objective

Establish a scalable, cost-effective, cloud-based platform that fully automates bug bounty program reconnaissance, transforming raw data into actionable intelligence.

The overall cloud-based platform requires one API call containing the program name and intended operation and it will initiate an automated recon process and output usable files, charts, and dashboards to hone in on potential bugs.

Principles

  • A framework that scales horizontally to support all available bug bounty programs.
  • A framework that supports modular components to continually add depth. Depth may consist of new tooling, bug classes, or expansive automated checks.
  • Data analysis across all programs that can highlight anomalies, identify similar configurations across programs, and recursively leverage outputs.
  • 100% cloud-based with no personal device dependencies.
  • Codified infrastructure as code (IaC) so that the environment is reusable, ephemeral whenever possible, can be re-deployed, and dynamically generated.
  • The environment must comply with security best practices (i.e. no secrets in source code, least privilege, and data encryption in-transit and at-rest).
  • Robust documentation for clarity around inputs, outputs, code comments, and overall objectives.
  • The base framework needs to have minimal cost. Preferably, the base functionality costs to be covered by one low severity bounty (under $250/month for cloud costs?).
  • Version control is essential to maintain code continuity, avoid accidental code loss, and share with the industry.

Reference Architecture

The platform leverages a multitude of cloud-native services, woven together to deliver modular data products which are processed and transformed into actionable intelligence.

The reference architecture contains components which are modularized into three primary automation categories.

  • Program Generation - Management, maintenance, and tracking of bug bounty programs.
  • Program Operations - Technical reconnaissance functions such as API calls to third-party data sources, DNS enumeration, web crawling, port scanning, and web fuzzing.
  • Program Analysis - Processing raw output from operations, preparing data inputs for recursive operations, normalizing data for visualization and reporting, and creation of intelligence through data correlation.
Recon Automation Reference Architecture

Process Overview

Each of these processes are described in more detail later in the article.

  1. Programs are both bulk loaded from platform integrations while also supporting manual loading of programs.
  2. Static program information is stored in a DynamoDB database.
  3. Operations are initiated via API Gateway GET and POST requests.
  4. The requests are proxied to an initial Lambda function.
  5. The Lambda initiates a Step Functions state machine and the parameters define the intended workflow path.
  6. Operations are orchestrated using Step Functions.
  7. Each operation has its own Lambda to minimize complexity.
  8. Active operations are initiated using ephemeral Digital Ocean Droplets.
  9. Environment specific variables are stored in AWS Parameter Store while API keys and credentials are stored in AWS Secrets Manager.
  10. All generated data is stored in S3.
  11. Python Pandas DataFrames are utilized to process and clean data.
  12. The output is aggregated and viewable via AWS QuickSight.
  13. Visual relationships are generated using GraphDB technology.
  14. Data is indexed using AWS Glue and searched using AWS Athena.
  15. Initial development is performed in SageMaker Jupyter Notebooks for establishing new functions.
  16. Stable development is performed in an AWS Cloud9 IDE and incorporated into a Python package prior to deployment.
  17. All source code is stored and managed within a GitHub repository.

Reference Architecture Details

1.  Programs are both bulk loaded from platform integrations while also supporting manual loading of programs.

The first step is to import and load programs across the major bug bounty platforms. Future updates will focus on direct platform integrations. However, the bulk load functionality is currently using data from Arkadiy Tetelman's GitHub repository (https://github.com/arkadiyt/bounty-targets-data) which is updated hourly. Thank you Arkadiy for the valuable resources that you've developed and published! The bulk load functionality of the automation is leveraging the raw data files within the repo.

The bulk load is initiated using an AWS Lambda function. It utilizes Pandas DataFrames to store, normalize, and load the data into a DynamoDB table. Here is an example of the Lambda function.

The additional functions being called are related to scope parsing. For successful automation, each variation of scope needs to be handled via code. Further background for this is described within the challenges section of this article. The code and parsing is not yet perfect, but handles the majority of variations.

2.  Static program information is stored in a DynamoDB database.

Each program is loaded and stored into a DynamoDB table. There are still additional programs that need added, but it currently has 653 programs available to initiate recon against.

Example code (not holistic) to load the programs into DynamoDB is:

The same general functions are leveraged for manual, individual program loads and can be loaded via the AWS API Gateway combined with a Lambda function. A POST request can be initiated using the following payload.

The section of Lambda that receives the post payload is:

Once programs are loaded into DynamoDB, they are ready for use within the automated recon operations.

3.  Operations are initiated via API Gateway GET and POST requests.

The ease of use and flexibility offered with AWS API Gateway makes it an extremely valuable service for automation (AWS API Gateway documentation).

The AWS API Gateway supports a REST based url structure where requests can be submitted against defined paths. Additionally, parameters can be provided via POST and GET requests. Each path can map to a different Lambda to expose functions directly through the API.

The primary path is /operation/recon and it accepts two parameters: the program name and the intended operation.

The initiating request would look like:

curl -X GET -H "x-api-key: fakeapikeyvalue" -H "Content-Type: application/json" "https://api.brevityinmotion.com/operation/recon?program=tesla&operation=initial"

Although the screenshot shows that authorization is not applied (and will be delivered in later phases), the API Gateway is configured to require a valid API token for authentication. It is important to configure authentication for the API, otherwise, the only form of security will be based on the API ID within the URL (which also disappears if using a custom domain name). The API key value is passed as a header value "x-api-key" (exemplified in the above curl command).

The API Gateway supports custom domain names and can be configured within the "Custom domain names" section of the service. It requires adding entries within the authoritative zone records and applying a certificate to the API and can be quickly generated utilizing the AWS Certificate Manager (ACM) service.

4.  The requests are proxied to an initial Lambda function.

The API Gateway can be configured to passthrough (proxy) traffic directly to an AWS Lambda function. This adds to the ease of transforming Python functions with arguments into API calls with parameters; requiring less than 5 minutes of configuration.

5.  The Lambda initiates a Step Functions state machine and the parameters define the intended workflow path.

The receiving Lambda function has a minimal code base with a primary intent to capture the API parameters and pass them as variables to the AWS Step Functions service.

6.  Operations are orchestrated using Step Functions.

The initial input into the AWS Step Functions workflow includes the program name and operation. Choices can be added within Step Functions to bypass steps or go directly to a specific action in response to the provided operation parameter.

AWS Step Functions workflow

Step functions are a powerful tool for orchestration of a workflow, providing out-of-band processes using callbacks, initiating other native AWS services, and initiating relevant notifications. It is extremely modular and new capabilities can be added as they are developed.

Additional information on Step Functions usage, benefits, and approach is detailed at https://www.brevityinmotion.com/aws-step-functions-to-accelerate-bug-bounty-recon-workflows/.

By replicating this approach, it enables the ability to transform Python code or any other functionality supported by AWS Lambda and making it callable as an API. This is extremely powerful for an unlimited amount of use cases.

7.  Each operation has its own Lambda to minimize complexity.

By following many of the modular principles that Daniel Miessler shared in his Mechanizing the Methodology presentation, there is no piping of tools. Each component of the workflow has one function to perform. As tempting as it is to pipe a big one-liner (some enticing examples here), there is value to be gained from every component being isolated and the modularity makes troubleshooting easier.

With this approach, data can be reused, programs can be monitored, and exploitation can be tailored to data driven decisions. Operations still leverage an input/output pipeline, but the chaining is after post-processing the raw output of the tooling. The post-processing adds additional metadata, correlations, and insights to the data artifacts that are produced. For example, here is a script specific to Project Discovery's httpx software which generates the shell script specific to the program, installs the software, normalizes the output, and loads it into S3.

8.  Active operations are initiated using ephemeral Digital Ocean Droplets.

Some of the orchestration steps initiate active recon against a program. All of the active reconnaissance is performed from ephemeral Digital Ocean droplets.

Each recon action has a dedicated Lambda function to perform the task. The Lambda function dynamically generates a startup script to install the applicable tools, it generates program specific configuration files for the tools, and it calls the Digital Ocean API to create a Droplet and run the startup file.

If the need to debug and monitor the output directly on the droplet server arises, the command to monitor when connected to the droplet is:

The Lambda generates a script that installs the local tools. For the droplet to establish connectivity to an AWS S3 bucket, the Lambda pulls an Access key/Secret key from AWS Secrets Manager and writes it into the script. There is tremendous value in the just-in-time approach because the secret token can be rotated centrally and routinely while being incorporated automatically into the workflow. A userdata startup script is generated to build the directory structure and then downloads the uniquely generated operation scripts from S3 using the AWS cli.

All functionality is initiated and run within the bounds of the startup script so it is primarily a script that executes more scripts. Upon performing the intended operation and any necessary data processing, the output from the operation is uploaded to the raw and refined S3 buckets.

Once all of the data is uploaded, the final command of the startup script is to shutdown the Droplet. There is a Lambda function in AWS that does a query against the Digital Ocean API and if any Droplets are in an "off" state, they are destroyed. At some point, filtering functionality needs to be added based on a naming convention so that it does not accidentally delete other unrelated Droplets. The AWS EventBridge service maintains the ephemeral (cron) job that runs this Lambda every 5 minutes so there is never an ephemeral Droplet in a shutdown state for longer than 5 minutes, which supports the cost principle.

9. Environment specific variables are stored in AWS Parameter Store while API keys and credentials are stored in AWS Secrets Manager.

Amazon provides two services for persistently storing variables and both have APIs for codified retrieval. AWS Secrets Manager can be utilized to store secret values and are where the Digital Ocean API key, Amass config file API keys, and AWS access key/secret keys are stored. With the ease of API integration, the secrets can be retrieved and utilized for the just-in-time (JIT) design principle.

Some of this example code is taken directly from the pre-canned code within the AWS Console.

For persistent storage of configuration variables, the AWS Parameter Store (part of  AWS Systems Center) service is utilized. Parameter Store is much less expensive and similarly scales to fulfill the use cases. Example parameters include bucket names and step functions arns.

Each Lambda function typically begins with a function to retrieve any applicable variables necessary to perform the task.

10.  All generated data is stored in S3.

The data tier is a critical component of the entire automation platform. The data is grouped into the following categories:

  • Datasets - Bucket containing copies of datasets aggregated from external sources which may not be accessible or integrated directly. Examples include downloads of MaxMind database and csv files.
  • Input data - As programs are generated, any configuration files and shell scripts are loaded into a file path of input/<program>/<files>. These files are copied from the program directory level at runtime for the ephemeral operations and to the local running server or container.
  • Raw data - This bucket contains the raw output from the tools. Additionally, all of the raw web responses are stored in the bucket. The raw data bucket is synchronized to the persistent EC2 server using aws sync commands. This persistent server can always be running searches across the new data or across the entirety of data as new functionality, keywords, and capabilities are added. The current approach utilizes sift (keyword/regex search), semgrep (static code analysis), and nuclei templates (using the --passive flag).
  • Refined data - This bucket contains post-processed output files established from the tooling. This could be files containing added fields such as program name, base urls, or aggregate/combined data.
  • Presentation data - This bucket contains files formatted and ready for reporting and dashboards. Either processes utilize the data directly to generate charts and graphs using libraries like Seaborn, Matplotlib, and NetworkX or functioning as a location where data presentation tools (AWS QuickSight, AWS Glue, Neo4j) can point directly to the repository.

11.  Python Pandas DataFrames are utilized to process and clean data.

Python Pandas library is a powerful data analysis library that is utilized extensively by data scientists. For additional examples of working with Pandas, there is a walkthrough using the Rapid7 Sonar dataset at https://www.brevityinmotion.com/external-ip-domain-reconnaissance-and-attack-surface-visualization-in-under-2-minutes/ which utilizes Jupyter notebooks.

Python Pandas can load various filetypes including .csv and .json files into an in-memory data structure referred to as a DataFrame. It can be manipulated, transformed, and reshaped to enhance or augment the initial data. It can also integrate directly with S3 buckets. Lastly, it can output the data back into a .csv or .json file. When preparing HTTPX output for a GoSpider site crawl, the HTTPX output is loaded into a Pandas DataFrame and manipulated in preparation for GoSpider.

Here is an example use case of loading the raw JSON HTTPX output, adding the program name, parsing the URL field to add a baseURL column, and then outputting it back to S3 as a JSON file.

12.  The output is aggregated and viewable via AWS QuickSight.

AWS QuickSight is a powerful dashboard and reporting capability that can transform refined data into Super-fast, Parallel, In-memory Calculation Engine (SPICE) data.

The dashboards are customizable and can show various views, data elements, and filtered queries. Here are some example dashboards using HTTPX output data.

13.  Visual relationships are generated using GraphDB technology.

The initial visualizations developed for this were using AWS Neptune. However, due to some limitations with visualizations of Neptune, it was swapped out for the third-party PaaS Neo4j hosted graph service sold through Google Cloud. It provides robust relationship mappings between programs, domains, IP addresses, URLs, and ASNs.

Due to cost concerns at scale, it has not been extensively matured and integrated, but the lowest tier options are effective for analyzing and visualizing individual or a handful of programs. This will eventually become a larger focus through applying fraud analysis type techniques to discovery anomalies or shared dependencies and third-party libraries between projects. Code samples for Neo4j will be shared at a later time once further matured.

Neo4j Graph Analysis

To at least establish some graph functionality, NetworkX in combination with Bokeh can be leveraged within a Jupyter notebook to run ad-hoc analytics against single programs.

NetworkX displayed using Bokeh

The code that can be utilized and adapted is:

14.  Data is indexed using AWS Glue and searched using AWS Athena.

Another powerful tool/service for data processing is the combination of AWS Glue for indexing files and searching the indexed data using AWS Athena. Running a multitude of tools across hundreds of programs quickly results in an overload of output files that can be overwhelming to manage and search.

AWS Glue is very near to a point-and-click solution to index a repository of data. It can be pointed to a S3 directory path and recursively index similar filetypes. For example, hundreds of HTTPX output json files may be stored in the structure of:

  • S3://bucketname/httpx/programA/httpx-output.json
  • S3://bucketname/httpx/programB/httpx-output.json
  • S3://bucketname/httpx/programC/httpx-output.json

The Glue crawler can be pointed to the S3://bucketname/httpx/* path, it will crawl the files, auto-discover the columns, and generate a searchable table to be utilized with Athena.

With 91 programs loaded, there are just over 10 million urls with corresponding metadata that is searchable and indexed.

Overview of AWS Glue table

At the completion of every recon workflow, the Glue crawler is initiated to update the table in preparation for data analysis.

Once the table is updated, the files can be searched using Presto syntax within AWS Athena. The following is an example query. Each query can also be initiated via API and the results are written to a defined S3 bucket location.

For example, a use case may be to query all urls across every program for a specific URL path or web server type and then output the list of baseURLs to a S3 bucket. Then a scanning tool such as Nuclei can leverage the results as an input file to run the applicable templates against the identified targets. This follows the objective of transforming 10 million random URLs into a much more finite subset that have a higher likelihood of a discovered vulnerability. Not only does this make more efficient use of vulnerability discovery, it decreases the volume of noise against the program while increasing the likelihood of signal for the researcher.

15.  Initial development is performed in SageMaker Jupyter Notebooks for establishing new functions.

When developing within the cloud environments, leveraging Jupyter notebooks has been an accelerator for debugging, troubleshooting, and extensive native cloud service integration. It may not be as common for an initial development environment, and does result in some technical debt when porting the code into a more robust IDE, but it makes it extremely easy to develop quick integrations, scalable in-memory data analysis, or for someone trying to improve their skills with languages such as Python or working with APIs. Since a Jupyter notebook can break apart code into individual cells that can run independently (similar to debugger breakpoints), it is extremely easy to resolve errors within code.

When working within the AWS ecosystem, the AWS SageMaker service provides a managed Jupyter notebook environment. When integrating with other AWS services, it removes the challenges of handling authentication and authorization as it can run using an assigned execution IAM role; thus eliminating the need for managing access within the code. The downside is that this environment can be expensive if left running persistently.

Google Cloud provides a free managed Jupyter notebook environment and is a good option to investigate (http://colab.research.google.com/).

For local installations, Anaconda is a popular Jupyter based open source data science toolkit (https://www.anaconda.com/products/individual).

16.  Stable development is performed in an AWS Cloud9 IDE and incorporated into a Python package prior to deployment.

As the codebase grew and needed to be managed within a more robust development IDE with capabilities such as linting and GitHub integration, it was migrated into an AWS Cloud9 workspace. One of the principles was to avoid any personal device dependencies (except for access to a browser), Cloud9 has been leveraged. It is a browser based IDE, supports persistent storage, automatically turns off upon a timeout period for cost savings, and includes a command line. So far, Cloud9 has supported the project flawlessly. Similar to the Jupyter notebooks within SageMaker, the Cloud9 IDE supports running under an instance profile which provides access and authorization across the AWS ecosystem. This is beneficial when using the cli for deployments or testing code integrations.

AWS Cloud9 IDE browser-based interface

17.  All source code is stored and managed within a GitHub repository.

To maintain resiliency, continuity, and avoid loss of data, all code is stored and managed within GitHub. It is currently within a private repository, but the majority of code will be migrated to an open-source public repository after the upcoming DEF CON 29 Recon Village presentation on Aug 7th.

Challenges Overcome/Lessons Learned

Difficulty keeping scripts and programs up-to-date as syntax, secrets, and scopes evolve.

Solution - Utilize a Just-In-Time (JIT) model by closing the gap between generation and runtime.

  • Input files for the tools and technologies are generated at runtime, rather than storing pre-generated scripts and commands.
  • With the JIT approach, command syntax can be maintained in one location and variables such as program name, scope, and secrets can be written immediately before operation execution.

Object based storage solutions (i.e. S3) are difficult for searching across unstructured and semi-structured data.

Solution - Run a persistent EC2 server that runs a S3 sync cli command to maintain a copy of S3 bucket files on an EBS storage volume. This also seems to be the most cost effective option. Although it makes a copy of S3 data to EBS, the amount of non-stop processing, searching, and analysis that can be applied against the data should ultimately have a consistent return on investment (ROI).

How should files and programs be structured?

Solution - Separate programs as much as possible with a directory per program and then keep them organized within the same directory structure. Some situations for holistic metrics need to remain within the same path but could still be separated by directory.

  • Without directories per program, there were issues with pulling program files to ephemeral systems beyond the intended scope and operation.
  • There were issues with too many files in directories if raw data and outputs were not separated out via programs, domains, operations, etc.

Maintaining scope across automation is difficult

There are many different ways that scope can be defined and articulated. There is a growing list of variations that must be programmed into the automation. Examples of variations that occur and must be handled.

- https://site.com/inscope
- site.com/inscope
- .site.com
- https://site.com/inscope/*
- <bughunterdomain>.site.com
- 192.168.0.1/16
- 192.168.0.45
- https://github.com/inscoperepo
- com.application.id
- Unstructured text describing scope

These in-scope examples are also similar to how out-of-scope is defined.

Solution - Review outputs for unintended out-of-scope or broken functionality each time new data is generated. Continue to add the handling and logic for the variations, covering both in-scope and out-of-scope.

Conclusion

Although this is not a step-by-step tutorial for building the ecosystem, the goal was to provide numerous example, ideas, and strategies that have led to success. Although the bug bounty community is super competitive, we are all ultimately working towards the same goal of improving the security of software and systems across the planet.

Certainly reach out to me if you have questions or find success in applying these concepts to your own specific use case. This article will continue to be updated as the automation systems evolve.

The source code will be made public and available at https://github.com/brevityinmotion/brevityrecon.

Be sure to checkout my talk with added commentary and walkthroughs of this ecosystem at my DEF CON 29 Recon Village presentation - "Let the bugs come to me - How to build cloud-based recon automation at scale" on Aug 7th.