babyspider package¶
Module contents¶
babyspider - A single-domain web crawler example project
- babyspider.build_parser()¶
Build and return a command line argument parser for the babyspider webcrawler.
- Return type:
ArgumentParser- Returns:
The configured babyspider argument parser
- babyspider.run(args=None)¶
Run the babyspider web crawler as a command line tool with the given arguments.
- Parameters:
args (
Optional[Sequence[str]]) – An sequence of command line argument strings. This defaults tosysy.argv- Return type:
int- Returns:
Either 0 if no errors or 1 if errors occurred
Submodules¶
babyspider.crawler module¶
babyspider.crawler - Functions to crawl a web domain and download resources
- class babyspider.crawler.Crawler(base_output_dir=PosixPath('.'), parser='html.parser', crawl_delay=3, use_sitemaps='first')¶
Bases:
objectA simple webcrawler that limits itself to a single domain.
See
crawl()to begin crawling a site.- base_output_dir¶
The base directory where the download directory will be created. Defaults to :class:
pathlib.Path()
- parser¶
A valid BeautifulSoup HTML parser, typically one of html.parser, lxml, or ‘html5lib’. Defaults to DEFAULT_HTML_PARSER
- crawl_delay¶
The number of seconds to pause between download requests. Defaults to DEFAULT_CRAWL_DELAY. Minimum value of 0
- use_sitemaps¶
When sitemaps should be used, one of ‘first’, ‘never’, or ‘only’. Defaults to ‘first’
- max_redirects¶
The maximum number of redirects to follow when resolving URLs before downloading. Defaults to 10. Minimum 5 recommned by RFC 1945
- http_timeout¶
Amount of seconds to wait before giving up on any requests. Defaults to 300
- chunk_size¶
Size of chunks used when streaming downloads. Defaults to 8192
- index_filename¶
The name assigned when downloading directory indexes. Defaults to ‘index.html’
- static check_parser(parser)¶
Return a ValueError if the given parser is not available in BeautifulSoup.
- Parameters:
parser (
str) – The parser string to validate- Raises:
ValueError – If the requested parser is not available
- Return type:
None
- chunk_size = 8192¶
- crawl(url_or_domain, output_dir_name=None, images=False, add_timestamp=None)¶
Download the resource at the given URL and, if it is HTML, do the same for all linked resources on the same domain. Returns the full output path, the canonicalized starting URL, and the number of errors.
- Parameters:
url_or_domain (
str) – The URL or domain to start the web crawl fromoutput_dir_name (
str|None) – The output directory to create under base output director where found resources will be downloaded to. Defaults to the domain name of the starting URLimages (
bool) – If True, scan image tags for URLs to download, not just linksadd_timestamp (
bool|None) – If True, add a timestamp to the end of the output directory name. This defaults to False except when no output directory name is given
- Raises:
ValueError – If the starting URL is forbidden by the site’s robots.txt file, has too
many redirects, or if use_sitemaps is set to only but the are no sitemaps –
- Return type:
tuple[str,str,int]- Returns:
The name of the full output directory, the canonical starting URL, and the total number of errors during the process
This is a wrapper function around self.__crawl() that validates the determines the starting URL, resets internal found URL tracking, sets up the robots.txt parser before starting the crawl.
- static domain_from_url(url)¶
Return just the domain from a URL or None if there is no domain
Without a scheme, the URL is just considered a path and no domain will be found!
- Parameters:
url (
str) – The URL to extract a domain from- Raises:
ValueError – Typically only raise when a non-string is given or
urllib.parse.urlparse()otherwise fails- Return type:
str|None- Returns:
The URL’s domain, stripped of anhy username, password, and port
- download_resource(url, output_dir)¶
Download the URL to the appropriate path in the output directory, creating subdirectories as needed. Return the path to the created file its canoncial source URL. Skip downloading if the file already exists.
- Parameters:
url (
str) – The URL of the resources to downloadoutput_dir (
str|Path) – The output directory to start at when determining where to place the resource locally. The full path will include any subdirectories from the URL path. For example, if url is http://example.com/some/file.html and the output_dir is /tmp, the file will be saved as /tmp/some/file.html.
- Raises:
IOError – If the URL could not be normalized by
get_canonical_url()or there is a problem creating intermidiate directories or writing the file to diskValueError – If the URL cannot be stripped, is on a different domain after redirects, or already exists locally but as a directory instead of a file
RequestException – If there is a network issue downloading the file
- Return type:
tuple[Path,str]- Returns:
A tuple of the path where the file was downloaded and the final URL it was pulled from
- download_sitemap_urls(sitemap_url, output_dir)¶
Download all the URLs from the given site map URL.
- Parameters:
sitemap_url (
str) – The URL of the sitemap to download, read, and then download the URLs in itoutput_dir (
str|Path) – The parent directory where downloaded sitemap URLs will be saved. Passed todownload_resource()
- Return type:
int- Returns:
The number of errors that occurred. This may be one for every URL in the sitemap if they all failed, or literlly only one if downloading the sitemap itself failed
- get_canonical_url(url)¶
Get the final URL location following any redirects.
This requires that the internal robots.txt file be configured. See :method`babyspider.crawler.Crawler.read_robots_txt()`
- Parameters:
url (
str) – The URL to resolve- Raises:
requests.RequestException – Raise if any URL is forbidden by the robots.txt or if there are more than :attr`self.max_redirects` redirects
- Return type:
str- Returns:
The canonicalized URL after following any redirects
- get_local_path(url)¶
Determine a relative local path and file name for a given URL.
- Parameters:
url (
str) – The URL to determine a local file path from- Raises:
ValueError – Typically only raise when a non-string is given or
urllib.parse.urlparse()otherwise fails- Return type:
Path- Returns:
A relative local path based on the URL with the default self.index_filename added if the path from the URL is a directory.filename.
- get_new_domain_urls(file_path, base_url, images=False, known_urls=None)¶
Scan the given file and, if it is parsable as HTML, extract and canonicalize any URLs to other resources for the base URLs domain not already known.
- Parameters:
file_path (
Path) – The path to the file to scan for URLsbase_url (
str) – The original URL the file was downloaded from. Used to convert relative links in the file into complete URLs and to determine if absolute URLs are on the same domain and can be includedimages (
bool) – If True, Also extract URLs for image tagsknown_urls (
set[str] |None) – A set of URLs to exclude from the returned result
- Raises:
IOError – If there are problems reading the file
- Return type:
set[str]- Returns:
The set of new URLs found in the file. This will be empty if the file could not be parsed as HTML
- get_urls_from_sitemap(sitemap)¶
Return all the URLs from a site map in a set.
- Parameters:
sitemap (
bytes) – The raw bytes of the contents of a sitemap to extract URLs from.- Return type:
set[str]- Returns:
The set of URLs extracted
- http_timeout = 300¶
- index_filename = 'index.html'¶
- max_redirects = 10¶
- read_robots_txt(robots_url)¶
Read the robots.txt file into a new internal parser for use.
This is required before using other methods such as get_canonical_url().
- Parameters:
robots_url (
str) – The URL of the robots.txt file to read- Raises:
URLError – Usually only if something goes wrong with an SSL handshake while downloading
- Return type:
None
- static strip_url(url)¶
Return a URL without any params, query, or fragment components.
- Parameters:
url (
str) – The URL to strip- Raises:
ValueError – Typically only raise when a non-string is given or
urllib.parse.urlparse()otherwise fails- Return type:
str- Returns:
The URL stripped of params, query, and fragment, if any