| Home | Trees | Indices | Help |
|---|
|
|
A minimalistic webcrawler.
The most important function is getNextPage. See the documentation there for a detailed description of the retrieval process of a page.
|
|||
|
|||
| bool |
|
||
| tuple or bool |
|
||
HTTPResponse or False
|
|
||
tuple or False
|
|
||
|
Inherited from Inherited from Inherited from |
|||
|
|||
|
Inherited from |
|||
|
|||
| URLStack |
_urls = NoneThe url stack. |
||
| string |
_start_url = NoneThe starting point for the crawl. |
||
| URLStack |
_base_url = NoneThe start urls base. |
||
| datetime |
_last_update = NoneThe date of the last run. |
||
| list |
_mime_types = NoneA white list of MIME types to be returned. |
||
| set |
_etags = NoneA set of etags for pages. |
||
| set |
_md5hashes = NoneA set of MD5 hashes of the pages content. |
||
| list |
_content_selectors = NoneA list of content selectors. |
||
| dict |
_headers = NoneDictionary of headers to send to the server. |
||
| set |
_hashes = NoneA set of sha256 hashes of the pages content. |
||
|
Inherited from |
|||
|
|||
Initialize the
|
Are there still pages to crawl?
|
Get the next page. Pops the first URL from the URLStack, retrieves the page, filters for content, finds
all links in the content that do not leave the site and pushes them on
the stack and finally returns a five-tuple containing The If the server redirects, the spider follows and tries to retrieve the redirection's target. It does so only once, to avoid loops. If the page retrieved is not in _mime_types, the content's
hash value is in one of the lists _hashes, _md5hashes or the ETag is in
_etags, the function returns
Checking for the correct MIME type, the ETag and the Last-Modified header is performed on the result of a HTTP HEAD request.
|
Issues a HEAD request to the given URL.
|
Get a page as a
|
| Home | Trees | Indices | Help |
|---|
| Generated by Epydoc 3.0.1 on Thu Sep 16 13:42:04 2010 | http://epydoc.sourceforge.net |