Home | Trees | Indices | Help |
---|
|
A minimalistic webcrawler.
The most important function is getNextPage. See the documentation there for a detailed description of the retrieval process of a page.
|
|||
|
|||
bool |
|
||
tuple or bool |
|
||
HTTPResponse or False
|
|
||
tuple or False
|
|
||
Inherited from Inherited from Inherited from |
|
|||
Inherited from |
|
|||
URLStack |
_urls = None The url stack. |
||
string |
_start_url = None The starting point for the crawl. |
||
URLStack |
_base_url = None The start urls base. |
||
datetime |
_last_update = None The date of the last run. |
||
list |
_mime_types = None A white list of MIME types to be returned. |
||
set |
_etags = None A set of etags for pages. |
||
set |
_md5hashes = None A set of MD5 hashes of the pages content. |
||
list |
_content_selectors = None A list of content selectors. |
||
dict |
_headers = None Dictionary of headers to send to the server. |
||
set |
_hashes = None A set of sha256 hashes of the pages content. |
||
Inherited from |
|
Initialize the
|
Are there still pages to crawl?
|
Get the next page. Pops the first URL from the URLStack, retrieves the page, filters for content, finds
all links in the content that do not leave the site and pushes them on
the stack and finally returns a five-tuple containing The If the server redirects, the spider follows and tries to retrieve the redirection's target. It does so only once, to avoid loops. If the page retrieved is not in _mime_types, the content's
hash value is in one of the lists _hashes, _md5hashes or the ETag is in
_etags, the function returns
Checking for the correct MIME type, the ETag and the Last-Modified header is performed on the result of a HTTP HEAD request.
|
Issues a HEAD request to the given URL.
|
Get a page as a
|
Home | Trees | Indices | Help |
---|
Generated by Epydoc 3.0.1 on Thu Sep 16 13:42:04 2010 | http://epydoc.sourceforge.net |