buildxml.xmlgetter.spider.IndexSpider

The most important function is getNextPage. See the documentation there for a detailed description of the retrieval process of a page.

init(self, source_name, start_url, depth=2, content_selectors=[], mime_types=set(), etags=set(), md5hashes=set(), hashes=set(), headers={u'User-Agent': USER_AGENT}, last_update=LAST_QUERY_DEFAULT)
(Constructor)

source code

Initialize the IndexSpider and its variables.

Parameters:

source_name (string) - The sources name.
start_url (string) - The URL from which to start the crawling.
depth (int) - The maximum depth to which the spider should follow links into the page.
content_selectors (list) - A list of content selectors, which are used to get the relevant content region from the page.
mime_types (set) - A set of acceptable MIME types.
etags (set) - A set of ETags.
md5hashes (set) - A set of MD5 hashes.
hashes (set) - A set of sha256 hashes.
headers (dict) - A dictionary of header fields to send to the server.
last_update (datetime) - The date of the last run.

Overrides: log.BaseLogger.__init__

getNextPage(self)

source code

Get the next page.

Pops the first URL from the URLStack, retrieves the page, filters for content, finds all links in the content that do not leave the site and pushes them on the stack and finally returns a five-tuple containing url, soup, content, head and data .

The soup is an instance of BeautifulSoup.BeautifulSoup, content also, but only for the part of the page that is specified through the _content_selectors variable, while head contains the HTTP headers of the server's response. Finally, data contains the raw data of the server's response.

If the server redirects, the spider follows and tries to retrieve the redirection's target. It does so only once, to avoid loops.

If the page retrieved is not in _mime_types, the content's hash value is in one of the lists _hashes, _md5hashes or the ETag is in _etags, the function returns False . Also, if the value of the Last-Modified header is smaller than _last_update, returns False .

Checking for the correct MIME type, the ETag and the Last-Modified header is performed on the result of a HTTP HEAD request.

Returns: tuple or bool: Five-tuple or False, see above.

_getPage(self, url)

source code

Get a page as a BeautifulSoup.BeautifulSoup instance.

Returns: tuple or False: A four-tuple containing the page wrapped as a BeautifulSoup instance, the content also as a BeautifulSoup instance, the headers of the response, and the raw data.

Class IndexSpider

init(self, source_name, start_url, depth=2, content_selectors=[], mime_types=set(), etags=set(), md5hashes=set(), hashes=set(), headers={u'User-Agent': USER_AGENT}, last_update=LAST_QUERY_DEFAULT)
(Constructor)

hasMorePages(self)

getNextPage(self)

_getHead(self, url)

_getPage(self, url)

Class IndexSpider

__init__(self, source_name, start_url, depth=2, content_selectors=[], mime_types=set(), etags=set(), md5hashes=set(), hashes=set(), headers={u'User-Agent': USER_AGENT}, last_update=LAST_QUERY_DEFAULT) (Constructor)

hasMorePages(self)

getNextPage(self)

_getHead(self, url)

_getPage(self, url)

init(self, source_name, start_url, depth=2, content_selectors=[], mime_types=set(), etags=set(), md5hashes=set(), hashes=set(), headers={u'User-Agent': USER_AGENT}, last_update=LAST_QUERY_DEFAULT)
(Constructor)