babyspider package

Module contents

babyspider - A single-domain web crawler example project

babyspider.build_parser()

Build and return a command line argument parser for the babyspider webcrawler.

Return type:

ArgumentParser

Returns:

The configured babyspider argument parser

babyspider.run(args=None)

Run the babyspider web crawler as a command line tool with the given arguments.

Parameters:

args (Optional[Sequence[str]]) – An sequence of command line argument strings. This defaults to sysy.argv

Return type:

int

Returns:

Either 0 if no errors or 1 if errors occurred

Submodules

babyspider.crawler module

babyspider.crawler - Functions to crawl a web domain and download resources

class babyspider.crawler.Crawler(base_output_dir=PosixPath('.'), parser='html.parser', crawl_delay=3, use_sitemaps='first')

Bases: object

A simple webcrawler that limits itself to a single domain.

See crawl() to begin crawling a site.

base_output_dir

The base directory where the download directory will be created. Defaults to :class:pathlib.Path()

parser

A valid BeautifulSoup HTML parser, typically one of html.parser, lxml, or ‘html5lib’. Defaults to DEFAULT_HTML_PARSER

crawl_delay

The number of seconds to pause between download requests. Defaults to DEFAULT_CRAWL_DELAY. Minimum value of 0

use_sitemaps

When sitemaps should be used, one of ‘first’, ‘never’, or ‘only’. Defaults to ‘first’

max_redirects

The maximum number of redirects to follow when resolving URLs before downloading. Defaults to 10. Minimum 5 recommned by RFC 1945

http_timeout

Amount of seconds to wait before giving up on any requests. Defaults to 300

chunk_size

Size of chunks used when streaming downloads. Defaults to 8192

index_filename

The name assigned when downloading directory indexes. Defaults to ‘index.html’

static check_parser(parser)

Return a ValueError if the given parser is not available in BeautifulSoup.

Parameters:

parser (str) – The parser string to validate

Raises:

ValueError – If the requested parser is not available

Return type:

None

chunk_size = 8192
crawl(url_or_domain, output_dir_name=None, images=False, add_timestamp=None)

Download the resource at the given URL and, if it is HTML, do the same for all linked resources on the same domain. Returns the full output path, the canonicalized starting URL, and the number of errors.

Parameters:
  • url_or_domain (str) – The URL or domain to start the web crawl from

  • output_dir_name (str | None) – The output directory to create under base output director where found resources will be downloaded to. Defaults to the domain name of the starting URL

  • images (bool) – If True, scan image tags for URLs to download, not just links

  • add_timestamp (bool | None) – If True, add a timestamp to the end of the output directory name. This defaults to False except when no output directory name is given

Raises:
  • ValueError – If the starting URL is forbidden by the site’s robots.txt file, has too

  • many redirects, or if use_sitemaps is set to only but the are no sitemaps

Return type:

tuple[str, str, int]

Returns:

The name of the full output directory, the canonical starting URL, and the total number of errors during the process

This is a wrapper function around self.__crawl() that validates the determines the starting URL, resets internal found URL tracking, sets up the robots.txt parser before starting the crawl.

static domain_from_url(url)

Return just the domain from a URL or None if there is no domain

Without a scheme, the URL is just considered a path and no domain will be found!

Parameters:

url (str) – The URL to extract a domain from

Raises:

ValueError – Typically only raise when a non-string is given or urllib.parse.urlparse() otherwise fails

Return type:

str | None

Returns:

The URL’s domain, stripped of anhy username, password, and port

download_resource(url, output_dir)

Download the URL to the appropriate path in the output directory, creating subdirectories as needed. Return the path to the created file its canoncial source URL. Skip downloading if the file already exists.

Parameters:
  • url (str) – The URL of the resources to download

  • output_dir (str | Path) – The output directory to start at when determining where to place the resource locally. The full path will include any subdirectories from the URL path. For example, if url is http://example.com/some/file.html and the output_dir is /tmp, the file will be saved as /tmp/some/file.html.

Raises:
  • IOError – If the URL could not be normalized by get_canonical_url() or there is a problem creating intermidiate directories or writing the file to disk

  • ValueError – If the URL cannot be stripped, is on a different domain after redirects, or already exists locally but as a directory instead of a file

  • RequestException – If there is a network issue downloading the file

Return type:

tuple[Path, str]

Returns:

A tuple of the path where the file was downloaded and the final URL it was pulled from

download_sitemap_urls(sitemap_url, output_dir)

Download all the URLs from the given site map URL.

Parameters:
  • sitemap_url (str) – The URL of the sitemap to download, read, and then download the URLs in it

  • output_dir (str | Path) – The parent directory where downloaded sitemap URLs will be saved. Passed to download_resource()

Return type:

int

Returns:

The number of errors that occurred. This may be one for every URL in the sitemap if they all failed, or literlly only one if downloading the sitemap itself failed

get_canonical_url(url)

Get the final URL location following any redirects.

This requires that the internal robots.txt file be configured. See :method`babyspider.crawler.Crawler.read_robots_txt()`

Parameters:

url (str) – The URL to resolve

Raises:

requests.RequestException – Raise if any URL is forbidden by the robots.txt or if there are more than :attr`self.max_redirects` redirects

Return type:

str

Returns:

The canonicalized URL after following any redirects

get_local_path(url)

Determine a relative local path and file name for a given URL.

Parameters:

url (str) – The URL to determine a local file path from

Raises:

ValueError – Typically only raise when a non-string is given or urllib.parse.urlparse() otherwise fails

Return type:

Path

Returns:

A relative local path based on the URL with the default self.index_filename added if the path from the URL is a directory.filename.

get_new_domain_urls(file_path, base_url, images=False, known_urls=None)

Scan the given file and, if it is parsable as HTML, extract and canonicalize any URLs to other resources for the base URLs domain not already known.

Parameters:
  • file_path (Path) – The path to the file to scan for URLs

  • base_url (str) – The original URL the file was downloaded from. Used to convert relative links in the file into complete URLs and to determine if absolute URLs are on the same domain and can be included

  • images (bool) – If True, Also extract URLs for image tags

  • known_urls (set[str] | None) – A set of URLs to exclude from the returned result

Raises:

IOError – If there are problems reading the file

Return type:

set[str]

Returns:

The set of new URLs found in the file. This will be empty if the file could not be parsed as HTML

get_urls_from_sitemap(sitemap)

Return all the URLs from a site map in a set.

Parameters:

sitemap (bytes) – The raw bytes of the contents of a sitemap to extract URLs from.

Return type:

set[str]

Returns:

The set of URLs extracted

http_timeout = 300
index_filename = 'index.html'
max_redirects = 10
read_robots_txt(robots_url)

Read the robots.txt file into a new internal parser for use.

This is required before using other methods such as get_canonical_url().

Parameters:

robots_url (str) – The URL of the robots.txt file to read

Raises:

URLError – Usually only if something goes wrong with an SSL handshake while downloading

Return type:

None

static strip_url(url)

Return a URL without any params, query, or fragment components.

Parameters:

url (str) – The URL to strip

Raises:

ValueError – Typically only raise when a non-string is given or urllib.parse.urlparse() otherwise fails

Return type:

str

Returns:

The URL stripped of params, query, and fragment, if any