Package buildxml :: Package plugins :: Module studentenwerk :: Class SyncPlugin_studentenwerk

Class SyncPlugin_studentenwerk

Spiders the webpage of the Studentenwerk starting from the sitemap under http://www.studentenwerk.uni-freiburg.de/index.php?id=272 .

Finds all URLs in the content region of the sitemap's HTML and recurses to a level of DEPTH. It will currently generate a XMLEntry only for pages that have a content-type of u'text/html' - indexing of PDF, DOC, PS, XSL, PPT is not yet supported.

Instance Methods

[hide private]

__init__(self, source_name, url, NO_NET=False)
Initialize the plugin.

source code

bool

_getData(self)
We request our page data from the IndexSpider, until it runs out of new pages.

source code

Inherited from xmlgetter.plugin.BaseSyncPlugin: entries_written, run, source_name, stats, url

Inherited from xmlgetter.plugin.BaseSyncPlugin (private): _consolidate, _loadState, _writeEntries, _writeState

Inherited from xmlgetter.request.BaseRequester (private): _requestURL

Inherited from xmlgetter.log.BaseLogger: logger

Inherited from xmlgetter.log.BaseLogger (private): _getLogger

Class Variables

[hide private]

Inherited from xmlgetter.log.BaseLogger (private): _loggers

Instance Variables

[hide private]

_index_spider = None
A instance of IndexSpider, which will handle the spidering.

Inherited from xmlgetter.plugin.BaseSyncPlugin (private): _NO_NET, _base_url, _entries, _entries_written, _from_date, _intermediate_temp_filename, _intermediate_xml_filename, _stats, _temp_filename, _url, _xml_filename

Inherited from xmlgetter.log.BaseLogger (private): _source_name

Method Details

[hide private]

init(self, source_name, url, NO_NET=False)
(Constructor)

source code

Initialize the plugin.

An IndexSpider instance with the appropriate initial values is created and assigned to _index_spider. During indexing we only want to consider the div with an id of middle - for content gathering as for link extraction alike.

Parameters:

source_name - The name of the source.
url - The starting point for the retrieval of the source's data.
NO_NET - Should we operate on the data of the last run, or retrieve new data?

Overrides: xmlgetter.log.BaseLogger.__init__

_getData(self)

source code

We request our page data from the IndexSpider, until it runs out of new pages. The page coming from the spider is guaranteed to have a MIME type we have requested and to be unique (no duplicates). So we can just focus on the extraction of the content and the generation of an XMLEntry per page.

Returns: bool: False, if an error occured and the data is unusable, True otherwise.
Overrides: xmlgetter.plugin.BaseSyncPlugin._getData

Class SyncPlugin_studentenwerk

__init__(self, source_name, url, NO_NET=False) (Constructor)

_getData(self)

init(self, source_name, url, NO_NET=False)
(Constructor)