Package buildxml :: Package xmlgetter :: Module spider :: Class IndexSpider
[hide private]
[frames] | no frames]

Class IndexSpider

source code


A minimalistic webcrawler.

The most important function is getNextPage. See the documentation there for a detailed description of the retrieval process of a page.

Instance Methods [hide private]
 
__init__(self, source_name, start_url, depth=2, content_selectors=[], mime_types=set(), etags=set(), md5hashes=set(), hashes=set(), headers={u'User-Agent': USER_AGENT}, last_update=LAST_QUERY_DEFAULT)
Initialize the IndexSpider and its variables.
source code
bool
hasMorePages(self)
Are there still pages to crawl?
source code
tuple or bool
getNextPage(self)
Get the next page.
source code
HTTPResponse or False
_getHead(self, url)
Issues a HEAD request to the given URL.
source code
tuple or False
_getPage(self, url)
Get a page as a BeautifulSoup.BeautifulSoup instance.
source code

Inherited from request.BaseRequester (private): _requestURL

Inherited from log.BaseLogger: logger

Inherited from log.BaseLogger (private): _getLogger

Class Variables [hide private]

Inherited from log.BaseLogger (private): _loggers

Instance Variables [hide private]
URLStack _urls = None
The url stack.
string _start_url = None
The starting point for the crawl.
URLStack _base_url = None
The start urls base.
datetime _last_update = None
The date of the last run.
list _mime_types = None
A white list of MIME types to be returned.
set _etags = None
A set of etags for pages.
set _md5hashes = None
A set of MD5 hashes of the pages content.
list _content_selectors = None
A list of content selectors.
dict _headers = None
Dictionary of headers to send to the server.
set _hashes = None
A set of sha256 hashes of the pages content.

Inherited from log.BaseLogger (private): _source_name

Method Details [hide private]

__init__(self, source_name, start_url, depth=2, content_selectors=[], mime_types=set(), etags=set(), md5hashes=set(), hashes=set(), headers={u'User-Agent': USER_AGENT}, last_update=LAST_QUERY_DEFAULT)
(Constructor)

source code 

Initialize the IndexSpider and its variables.

Parameters:
  • source_name (string) - The sources name.
  • start_url (string) - The URL from which to start the crawling.
  • depth (int) - The maximum depth to which the spider should follow links into the page.
  • content_selectors (list) - A list of content selectors, which are used to get the relevant content region from the page.
  • mime_types (set) - A set of acceptable MIME types.
  • etags (set) - A set of ETags.
  • md5hashes (set) - A set of MD5 hashes.
  • hashes (set) - A set of sha256 hashes.
  • headers (dict) - A dictionary of header fields to send to the server.
  • last_update (datetime) - The date of the last run.
Overrides: log.BaseLogger.__init__

hasMorePages(self)

source code 

Are there still pages to crawl?

Returns: bool
Returns True if the spider has more pages to crawl, False otherwise.

getNextPage(self)

source code 

Get the next page.

Pops the first URL from the URLStack, retrieves the page, filters for content, finds all links in the content that do not leave the site and pushes them on the stack and finally returns a five-tuple containing url, soup, content, head and data .

The soup is an instance of BeautifulSoup.BeautifulSoup, content also, but only for the part of the page that is specified through the _content_selectors variable, while head contains the HTTP headers of the server's response. Finally, data contains the raw data of the server's response.

If the server redirects, the spider follows and tries to retrieve the redirection's target. It does so only once, to avoid loops.

If the page retrieved is not in _mime_types, the content's hash value is in one of the lists _hashes, _md5hashes or the ETag is in _etags, the function returns False . Also, if the value of the Last-Modified header is smaller than _last_update, returns False .

Checking for the correct MIME type, the ETag and the Last-Modified header is performed on the result of a HTTP HEAD request.

Returns: tuple or bool
Five-tuple or False, see above.

_getHead(self, url)

source code 

Issues a HEAD request to the given URL.

Returns: HTTPResponse or False
The response to the HEAD request or False, if an error occured.

_getPage(self, url)

source code 

Get a page as a BeautifulSoup.BeautifulSoup instance.

Returns: tuple or False
A four-tuple containing the page wrapped as a BeautifulSoup instance, the content also as a BeautifulSoup instance, the headers of the response, and the raw data.