The definition if a web crawler from Wikipedia is: A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).
Which Serverless framework?
When designing this project, I considered several Serverless architectures and frameworks. I found a very good article (here) that describes the difference between some of the most used ones: Serverless.com, AWS SAM, AWS CDK. I decided to try AWS SAM. Learning SAM was pretty straightforward and it makes very easy to define, test and deploy a CloudFormation stack with Lambdas, SQSs, permission policies, DynamoDB tables.
BFS Algorithm
If we think of the web as a directed graph where pages are nodes and hyperlinks are edges, then a web-crawler can be implemented by using a Breath-First Search (BFS) algorithm that:
- Extracts a URL from a queue.
- Marks it as visited and store the result.
- Adds all links found on the URL to the queue.
- Repeats until the queue is empty or a maximum number of levels are visited (important for multi-domain only).
Serverless Design
The following diagram shows a distribute version of the BFS algorithm explained above.
In blue are represented workers, in yellow are represented message queues. For more basic information on these components, please refer to the post “Designing high scalable systems“.
The following sections describes the involved lambda in details:
1. PostJobs
This is a simple AWS Lambda needed to start the web-crawling process. This lambda is triggered by the REST endpoint POST /jobs. This lambda adds a single message to the UrlFrontier AWS SQS queue.
POST /jobs with payload:
{ "url": "https://<start-url>.com/", "source": "", "level": 0 }
2. UrlFrontierWorker
This Lambda is triggered by new messages added to the AWS SQSL queue NewUrlQueue. The purpose of this function is to filter URLs already visited and store the ones that have been visited.
If the URL has not been visited already, it is then added to the DB and the URL is forwarded to the parser function by added a message to the SQS queue UrlParserQueue. A key-value storage such as AWS Dynamo DB can be used to filter and store URLs.
Note that this queue has a delay of 1 second to avoid sending too many requests to the same host from the same IP.
For each message in the queue (default batch size is 10), the function checks if the URL has already been visited using a database built in DynamoDB.
3. UrlParserWorker
The purpose of this lambda is to extract URLs from the pages identified by URLs added to the UrlParserQueue. Libraries such as “axios” and “cheerio” can be used to download the HTML and extract the links from the html code and more advanced techniques can be used to render the HTML to identify URLs generate dynamically by the Javascript part of the web page.
Each new URL that is extracted from the HTML is added back to UrlFrontierQueue for analysis, storage and further parsing.
4. GetResultFunction
This function just reads the URLs stored in the DB and return then as JSON via the endpoint “GET /results”. Here is an example of the endpoint output:
{
"items": [
{
"source": "",
"url": "https://<base_domain>.com/",
"level": 0
},
{
"source": "https://<base_domain>.com/",
"url": "https://<base_domain>.com/i/switch/",
"level": 1
},
[...]
{
"source": "https://<base_domain>.com/i/coronavirus-business-hub/",
"url": "https://<base_domain>.com/blog/government-support-start-ups-coronavirus/",
"level": 4
}
],
"count": 30
}
A few considerations
- It is clear from the diagram that the dependency between UrlParserWorker and UrlFrontierWorker is cyclic and this can result in expensive infinite loops. To mitigate this risk the following precautions have been taken:
- Level is increased on every iteration of the UrlParserWorker.
- The maximum time a message can stay in the queue has been limited to a few minutes.
- SQS “Exactly once” VS “At most once”. The message queues in this project use the “at least once”, meaning duplicates are possible. This is acceptable in this context as “exactly once” would cost more in terms of resources and billing and would not bring substantial advantages.
- DynamoDB simpleTable is used meaning the structure is defined by what data is written in the table. This is acceptable for the complexity of this project.
SAM configuration
The following YAML defines the architecture described in the previous section:
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: >
bf-web-crawler
Sample SAM Template for bf-web-crawler
# More info about Globals: https://github.com/awslabs/serverless-application-model/blob/master/docs/globals.rst
Globals:
Function:
Timeout: 30
Runtime: nodejs14.x
Environment: # Inject environment variables
Variables:
URL_FRONTIER_QUEUE_URL:
Ref: UrlFrontierQueue
URL_PARSER_QUEUE_URL:
Ref: UrlParserQueue
TABLE_NAME:
Ref: UrlsTable
MAX_BFS_LEVELS: 4
MAX_NEW_URLS: 50
Resources:
UrlFrontierQueue:
Type: AWS::SQS::Queue
Properties:
QueueName: sam-url-frontier-queue
DelaySeconds: 1 # For "politeness"
MessageRetentionPeriod: 60 # 10 min
UrlParserQueue:
Type: AWS::SQS::Queue
Properties:
QueueName: sam-url-parser-queue
MessageRetentionPeriod: 60 # 10 min
GetResultsFunction:
Type: AWS::Serverless::Function
Properties:
Handler: src/handlers/get-results.lambdaHandler
Architectures:
- x86_64
Events:
WebCrawler:
Type: Api
Properties:
Path: /results
Method: get
Policies:
- DynamoDBCrudPolicy:
TableName:
Ref: UrlsTable
PostJobsFunction:
Type: AWS::Serverless::Function
Properties:
Handler: src/handlers/post-jobs.lambdaHandler
Architectures:
- x86_64
Events:
WebCrawler:
Type: Api
Properties:
Path: /jobs
Method: post
Policies:
- SQSSendMessagePolicy:
QueueName: !GetAtt UrlFrontierQueue.QueueName
- DynamoDBCrudPolicy:
TableName:
Ref: UrlsTable
UrlFrontierWorker:
Type: AWS::Serverless::Function
Properties:
Handler: src/handlers/url-frontier-worker.lambdaHandler
Events:
SQSQueueEvent:
Type: SQS
Properties:
Queue: !GetAtt UrlFrontierQueue.Arn
BatchSize: 10
Policies:
- SQSSendMessagePolicy:
QueueName: !GetAtt UrlParserQueue.QueueName
- DynamoDBCrudPolicy:
TableName:
Ref: UrlsTable
UrlParserWorker:
Type: AWS::Serverless::Function
Properties:
Handler: src/handlers/url-parser-worker.lambdaHandler
Events:
SQSQueueEvent:
Type: SQS
Properties:
Queue: !GetAtt UrlParserQueue.Arn
Policies:
- SQSSendMessagePolicy:
QueueName: !GetAtt UrlFrontierQueue.QueueName
BatchSize: 1
UrlsTable:
Type: AWS::Serverless::SimpleTable
Properties:
PrimaryKey:
Name: url
Type: String
Outputs:
WebCrawlerApi:
Description: "API Gateway endpoint URL for Prod stage for Hello World function"
Value: !Sub "https://${ServerlessRestApi}.execute-api.${AWS::Region}.amazonaws.com/Prod/hello/"
Deploy
The following command deploys the application to AWS.
sam deploy — guided
The application is zipped and stored in S3 and then executed on different lambdas. This step assumes the AWS CLI is installed and configured with aws configure.
Seeing it in action
Run locally
The following command allows you to run any Lambda locally. This command needs AWS CLI installed and configured, and Docker installed and running locally.
sam local invoke UrlFrontierWorker –event events/event-sqs-url-frontier.json
In the cloud
It was particularly exciting to see the system in action and have the Lambda correctly exchange messages.
This is system is currently deployed at: https://wr3xz0pkej.execute-api.eu-west-1.amazonaws.com/Prod
To launch a new job:
curl --location --request POST 'https://<base_lambda_url>.amazonaws.com/Prod/jobs/' \
--header 'Content-Type: application/json' \
--data-raw '{
"url": "https://<base_domain>.com/",
"source": "",
"level": 0
}'
Get results
curl --location --request GET 'https://<base_lambda_url>.amazonaws.com/Prod/results'
