babyspider documentation

babyspider

Babyspider is a simple web crawler for single domain websites.

Installation

Babyspider is available as a Python wheel package installable with pipx (https://github.com/pypa/pipx) where X.Y.Z is the version number:

pipx install babyspider-X.Y.X-py3-none-any.whl

If pipx is not available and cannot be installed, you can also use pip (https://github.com/pypa/pip), which is installed as a part of Python. To install babyspider version X.Y.Z for a single user with pip, run:

pip install --user babyspider-X.Y.X-py3-none-any.whl

Installing babyspider with pip globally is not recommended and may not be allowed by your system, but can be done by removing the --user option.

To verify installtion was successful, run:

babyspider --version

This should display the version number and exit:

babyspider 1.0.0

Usage

Once installed, the simplest usage is to just give babyspider a domain:

babyspider www.example.com

This will crawl https://www.example.com and download the URLs it finds to local folder with a timestamp suffix of the form: www.example.com_20250629_1231

Babyspider also supports various options, such as providing more feedback while crawling with --verbose, leaving off the timestamp with --no-timestamp and setting a base directory for output with --output-path. For example, running the following commands:

babyspider -vTo /opt/website_backups www.example.com
babyspider -vTo /opt/website_backups http://example.org
babyspider -vTo /opt/website_backups https://school.example.edu/links

would crawl three websites, the second using HTTP instead of HTTPS and the third starting from the links page, and save the output in three folders with names based on the date and time, something like:

/opt/website_backups/www.example.com_20250131_1200
/opt/website_backups/example.org_20250131_1232
/opt/website_backups/school.example.edu_20250131_1258

More or less detail

For even more detail while running, you can add additional --verbose or -v options, such as this very noisy example:

babyspider -vvvv example.com

Conversely, you can limit output to just errors with -q or --quiet or even hide errors and critical messages with -qqq.

Downloading images

The --images or -i option will tell babyspider to download not only links, but also any image tags:

babyspider --images example.com

Note that, currently, this will not download background images set with style sheets, just images included directly.

Sitemaps

If the website provides them, babyspider download all links listed in their sitemaps. This is done before crawling for links. You can download just sitemap links or ignore them by setting the --use-sitemaps option to only or never, respectively.

If needed, Babyspider may also be run as a module:

python3 -m babyspider www.example.com

Viewing downloaded content

You can generally view the downloaded content in your local web browser by opening the index.html file in the download directory.

Babyspider doesn’t rewrite anything it downloads, so any links or images that point to full URLs will still be pulled from the network. Currently, it also doesn’t pull down style sheets or images they reference.

So while the results won’t be the same as if the content was hosted on a webserver, the main content should be available for basic sites.

Respecting robots.txt settings

Babyspider supports and respects robots.txt files (https://www.robotstxt.org/).

Requests per second are converted into a minimum crawl delay and the maximum between it, robot.txt crawl delay, and use requested crawl delay is used.

It uses a useragent string of babyspider/ followed by the version number and will not download any resource forbidden by the robots.txt file.

Unlike sitemap usage, neither of these behaviors can be overridden.

Using different HTML parsers

Babyspider uses BeautifulSoup for HTML parsing and defaults to Python’s built in HTML parser.

If you need more web-browser like (html5lib) or faster (lxml) parsing, you can install additional parsers as described in:

https://beautiful-soup-4.readthedocs.io/en/latest/#installing-a-parser

These can then be used by passing the the BeautifulSoup parser name to babyspider with the -p or --parser flag. For convenience, the --fast option is an alias for -p lxml and --good is likewise an alias for the -p html5lib.

Exit codes

Babyspider will set an exit code of 1 if any errors occurred or 0 if there were no errors.

Additional options

For more details, babyspider supports the standard --help option:

babyspider --help

Architectural Design

Babyspider is intended as a simple web crawler suitable for backing up simple content sites where important resources are either directly linked, or referenced in sitemaps. It makes keeping timestamped sets of backups easy, and allows for manual browsing of files without a webserver, aside from having to manually select the index page for directories.

Sites which use dynamic content that is generated by services based on form input or other posted data are specifically not meant for babyspider to crawl.

It does not curently grab stylesheet content or javascript but could be easily expanded to do so in the same way it can optionally download image links.

It is structured as a single module with the interface and primary entrypoint in the top of the module (babyspider.__init__) and the crawling logic in a submodule (babyspider.crawler).

Web crawling is naturally suited to recursion. Babyspider’s crawler is implemented as a class primarily to allow for a larger amount of state information to be available without having to pass an excessive amount of arguments to the main recursive function. The downside to the use of a single class is that all of the methods are in single file, unlike multiple modules of just functions. If additional features are added, it may be worth exploring either moving some functionality into functions instead of class methods, or breaking some of the funcitonality into one or more support classes. As always, the end goal is to make the code more readable and any such changes should be made keeping this goal in mind.

While the babyspider.crawler.Crawler class does need a fair amount of internal state, care has been taken to make methods public where possible to allow for code reuse.

The code is also deliberately single-threaded. The primary bottleneck in downloading a large amount of content from a website is the web server itself and not overloading it, or appearing as malicious and getting restricted. This means that multiple processees would only make a more complicated design since respecting download limits for simpler (and likely smaller traffic) sites would not yield major speed improvements.

The final noteworthy aspect of babyspider’s design is a focused on providing a friendly user interface. It requires minimal options and has responsive defaults based on the options provided. It strives to be like Python itself by making simple things easy and harder things possible.

Development

Babyspider uses poetry (https://python-poetry.org) for package management. The pyproject.toml includes a development package group with black, pytest, pylint, and ipython included. To setup a development virtual environment, run:

poetry install

You can build the package with:

poetry build

Dependencies can be updated with:

poetry update

After any code changes, it is recommended to run black for style formating, pylint for code linting, and mypy for type checking:

poetry run black src tests
poetry run pylint src tests
poetry run mypy

Tests can be run with:

poetry run pytest

The HTML documentation can be generated with a fe more steps:

$(poetry env activate)
cd docs; make html; cd -
deactivate

The results will be in the docs/build/html directory. Other build targets can also be used if the appropriate software is installed.

Version management is also easily handled by poetry. For details, run:

poetry version --help