Building a Serverless Web-crawler (AWS SAM)

The definition if a web crawler from Wikipedia is: A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).

Which Serverless framework?

When designing this project, I considered several Serverless architectures and frameworks. I found a very good article (here) that describes the difference between some of the most used ones: Serverless.com, AWS SAM, AWS CDK. I decided to try AWS SAM. Learning SAM was pretty straightforward and it makes very easy to define, test and deploy a CloudFormation stack with Lambdas, SQSs, permission policies, DynamoDB tables.

BFS Algorithm

If we think of the web as a directed graph where pages are nodes and hyperlinks are edges, then a web-crawler can be implemented by using a Breath-First Search (BFS) algorithm that:

Extracts a URL from a queue.
Marks it as visited and store the result.
Adds all links found on the URL to the queue.
Repeats until the queue is empty or a maximum number of levels are visited (important for multi-domain only).

Serverless Design

The following diagram shows a distribute version of the BFS algorithm explained above.

In blue are represented workers, in yellow are represented message queues. For more basic information on these components, please refer to the post “Designing high scalable systems“.

The following sections describes the involved lambda in details:

1. PostJobs

This is a simple AWS Lambda needed to start the web-crawling process. This lambda is triggered by the REST endpoint POST /jobs. This lambda adds a single message to the UrlFrontier AWS SQS queue.

POST /jobs with payload:

{ "url": "https://<start-url>.com/", "source": "", "level": 0 }

2. UrlFrontierWorker

This Lambda is triggered by new messages added to the AWS SQSL queue NewUrlQueue. The purpose of this function is to filter URLs already visited and store the ones that have been visited.

If the URL has not been visited already, it is then added to the DB and the URL is forwarded to the parser function by added a message to the SQS queue UrlParserQueue. A key-value storage such as AWS Dynamo DB can be used to filter and store URLs.

Note that this queue has a delay of 1 second to avoid sending too many requests to the same host from the same IP.
For each message in the queue (default batch size is 10), the function checks if the URL has already been visited using a database built in DynamoDB.

3. UrlParserWorker

The purpose of this lambda is to extract URLs from the pages identified by URLs added to the UrlParserQueue. Libraries such as “axios” and “cheerio” can be used to download the HTML and extract the links from the html code and more advanced techniques can be used to render the HTML to identify URLs generate dynamically by the Javascript part of the web page.
Each new URL that is extracted from the HTML is added back to UrlFrontierQueue for analysis, storage and further parsing.

4. GetResultFunction

This function just reads the URLs stored in the DB and return then as JSON via the endpoint “GET /results”. Here is an example of the endpoint output:

{
    "items": [
        {
            "source": "",
            "url": "https://<base_domain>.com/",
            "level": 0
        },
        {
            "source": "https://<base_domain>.com/",
            "url": "https://<base_domain>.com/i/switch/",
            "level": 1
        },
        [...]
        {
            "source": "https://<base_domain>.com/i/coronavirus-business-hub/",
            "url": "https://<base_domain>.com/blog/government-support-start-ups-coronavirus/",
            "level": 4
        }
    ],
    "count": 30
}

A few considerations

It is clear from the diagram that the dependency between UrlParserWorker and UrlFrontierWorker is cyclic and this can result in expensive infinite loops. To mitigate this risk the following precautions have been taken:
- Level is increased on every iteration of the UrlParserWorker.
- The maximum time a message can stay in the queue has been limited to a few minutes.
SQS “Exactly once” VS “At most once”. The message queues in this project use the “at least once”, meaning duplicates are possible. This is acceptable in this context as “exactly once” would cost more in terms of resources and billing and would not bring substantial advantages.
DynamoDB simpleTable is used meaning the structure is defined by what data is written in the table. This is acceptable for the complexity of this project.

SAM configuration

The following YAML defines the architecture described in the previous section:

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: >
  bf-web-crawler

  Sample SAM Template for bf-web-crawler
  
# More info about Globals: https://github.com/awslabs/serverless-application-model/blob/master/docs/globals.rst
Globals:
  Function:
    Timeout: 30
    Runtime: nodejs14.x
    Environment: # Inject environment variables
      Variables:
        URL_FRONTIER_QUEUE_URL: 
          Ref: UrlFrontierQueue
        URL_PARSER_QUEUE_URL: 
          Ref: UrlParserQueue
        TABLE_NAME:
          Ref: UrlsTable
        MAX_BFS_LEVELS: 4
        MAX_NEW_URLS: 50

Resources:
  UrlFrontierQueue:
    Type: AWS::SQS::Queue
    Properties:
      QueueName: sam-url-frontier-queue
      DelaySeconds: 1 # For "politeness"
      MessageRetentionPeriod: 60 # 10 min

  UrlParserQueue:
    Type: AWS::SQS::Queue
    Properties:
      QueueName: sam-url-parser-queue
      MessageRetentionPeriod: 60 # 10 min

  GetResultsFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: src/handlers/get-results.lambdaHandler
      Architectures:
        - x86_64
      Events:
        WebCrawler:
          Type: Api
          Properties:
            Path: /results
            Method: get
      Policies:
        - DynamoDBCrudPolicy:
            TableName: 
              Ref: UrlsTable

  PostJobsFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: src/handlers/post-jobs.lambdaHandler
      Architectures:
        - x86_64
      Events:
        WebCrawler:
          Type: Api
          Properties:
            Path: /jobs
            Method: post
      Policies:
        - SQSSendMessagePolicy:
            QueueName: !GetAtt UrlFrontierQueue.QueueName
        - DynamoDBCrudPolicy:
            TableName: 
              Ref: UrlsTable

  UrlFrontierWorker:
    Type: AWS::Serverless::Function
    Properties:
      Handler: src/handlers/url-frontier-worker.lambdaHandler
      Events: 
        SQSQueueEvent:
          Type: SQS
          Properties:
            Queue: !GetAtt UrlFrontierQueue.Arn
            BatchSize: 10
      Policies:
        - SQSSendMessagePolicy:
            QueueName: !GetAtt UrlParserQueue.QueueName
        - DynamoDBCrudPolicy:
            TableName: 
              Ref: UrlsTable

  UrlParserWorker:
    Type: AWS::Serverless::Function
    Properties:
      Handler: src/handlers/url-parser-worker.lambdaHandler
      Events: 
        SQSQueueEvent:
          Type: SQS
          Properties:
            Queue: !GetAtt UrlParserQueue.Arn
      Policies:
        - SQSSendMessagePolicy:
            QueueName: !GetAtt UrlFrontierQueue.QueueName
            BatchSize: 1
  
  UrlsTable:
    Type: AWS::Serverless::SimpleTable
    Properties:
      PrimaryKey:
        Name: url
        Type: String

Outputs:
  WebCrawlerApi:
    Description: "API Gateway endpoint URL for Prod stage for Hello World function"
    Value: !Sub "https://${ServerlessRestApi}.execute-api.${AWS::Region}.amazonaws.com/Prod/hello/"

Deploy

The following command deploys the application to AWS.

sam deploy — guided

The application is zipped and stored in S3 and then executed on different lambdas. This step assumes the AWS CLI is installed and configured with aws configure.

Seeing it in action

Run locally

The following command allows you to run any Lambda locally. This command needs AWS CLI installed and configured, and Docker installed and running locally.

sam local invoke UrlFrontierWorker –event events/event-sqs-url-frontier.json

In the cloud

It was particularly exciting to see the system in action and have the Lambda correctly exchange messages.

This is system is currently deployed at: https://wr3xz0pkej.execute-api.eu-west-1.amazonaws.com/Prod

To launch a new job:

curl --location --request POST 'https://<base_lambda_url>.amazonaws.com/Prod/jobs/' \
--header 'Content-Type: application/json' \
--data-raw '{
    "url": "https://<base_domain>.com/",
    "source": "",
    "level": 0
}'

Get results

curl --location --request GET 'https://<base_lambda_url>.amazonaws.com/Prod/results'

Bernardino Frola

Building a Serverless Web-crawler (AWS SAM)

Which Serverless framework?

BFS Algorithm

Serverless Design

1. PostJobs

2. UrlFrontierWorker

3. UrlParserWorker

4. GetResultFunction

A few considerations

SAM configuration

Deploy

Seeing it in action

Run locally

In the cloud

Leave a comment Cancel reply

Which Serverless framework?

BFS Algorithm

Serverless Design

1. PostJobs

2. UrlFrontierWorker

3. UrlParserWorker

4. GetResultFunction

A few considerations

SAM configuration

Deploy

Seeing it in action

Run locally

In the cloud

Share this:

Related

Leave a comment Cancel reply