Package buildxml :: Package plugins :: Module studentenwerk :: Class SyncPlugin_studentenwerk
[hide private]
[frames] | no frames]

Class SyncPlugin_studentenwerk

source code


Spiders the webpage of the Studentenwerk starting from the sitemap under http://www.studentenwerk.uni-freiburg.de/index.php?id=272 .

Finds all URLs in the content region of the sitemap's HTML and recurses to a level of DEPTH. It will currently generate a XMLEntry only for pages that have a content-type of u'text/html' - indexing of PDF, DOC, PS, XSL, PPT is not yet supported.

Instance Methods [hide private]
 
__init__(self, source_name, url, NO_NET=False)
Initialize the plugin.
source code
bool
_getData(self)
We request our page data from the IndexSpider, until it runs out of new pages.
source code

Inherited from xmlgetter.plugin.BaseSyncPlugin: entries_written, run, source_name, stats, url

Inherited from xmlgetter.request.BaseRequester (private): _requestURL

Inherited from xmlgetter.log.BaseLogger: logger

Inherited from xmlgetter.log.BaseLogger (private): _getLogger

Class Variables [hide private]

Inherited from xmlgetter.log.BaseLogger (private): _loggers

Instance Variables [hide private]
  _index_spider = None
A instance of IndexSpider, which will handle the spidering.

Inherited from xmlgetter.log.BaseLogger (private): _source_name

Method Details [hide private]

__init__(self, source_name, url, NO_NET=False)
(Constructor)

source code 

Initialize the plugin.

An IndexSpider instance with the appropriate initial values is created and assigned to _index_spider. During indexing we only want to consider the div with an id of middle - for content gathering as for link extraction alike.

Parameters:
  • source_name - The name of the source.
  • url - The starting point for the retrieval of the source's data.
  • NO_NET - Should we operate on the data of the last run, or retrieve new data?
Overrides: xmlgetter.log.BaseLogger.__init__

_getData(self)

source code 

We request our page data from the IndexSpider, until it runs out of new pages. The page coming from the spider is guaranteed to have a MIME type we have requested and to be unique (no duplicates). So we can just focus on the extraction of the content and the generation of an XMLEntry per page.

Returns: bool
False, if an error occured and the data is unusable, True otherwise.
Overrides: xmlgetter.plugin.BaseSyncPlugin._getData