Package buildxml :: Package xmlgetter :: Module spider :: Class URLStack
[hide private]
[frames] | no frames]

Class URLStack

source code


URL stack for spidering websites.

This is a simple stack, that takes an additional argument: the level of the parent document of the element to be pushed. If the level of the new element would exceed the limit, it is not added. It also keeps track of popped elements and refuses to add an element that was already on the stack in the past.

Instance Methods [hide private]
 
__init__(self, source_name, max_level)
Initialize the URLStack.
source code
 
push(self, url, parent_level)
Push the URL on top of stack, if parent_level < max_level and url has not already been checked.
source code
dict
pop(self)
Pop the top element from the stack.
source code
int
__len__(self)
Make len() work on URLStacks.
source code
list of string.
checked_urls(self)
Return set of checked_urls.
source code

Inherited from log.BaseLogger: logger

Inherited from log.BaseLogger (private): _getLogger

Class Variables [hide private]

Inherited from log.BaseLogger (private): _loggers

Instance Variables [hide private]
list of dict _urls_info = None
A list of dictionaries of the form {u'url: url, u'level': level}.
list _urls = None
The stack of only the URLs, not the level information.
set _checked_urls = None
A set of URLs that have already been on the stack.
int _max_level = -1
The maximum depth to which URLs should be accepted on the stack.

Inherited from log.BaseLogger (private): _source_name

Method Details [hide private]

__init__(self, source_name, max_level)
(Constructor)

source code 

Initialize the URLStack.

Parameters:
  • source_name (string) - The name of the source.
  • max_level (int) - The maximum level from which to accept new elements for the stack.
Overrides: log.BaseLogger.__init__

push(self, url, parent_level)

source code 

Push the URL on top of stack, if parent_level < max_level and url has not already been checked. Otherwise this function will refuse to push the URL on the stack.

Parameters:
  • url (string) - The URL to push on the stack.
  • parent_level (int) - The level of the document in which the link was found.

pop(self)

source code 

Pop the top element from the stack.

Returns: dict
The url information of the popped element.

__len__(self)
(Length operator)

source code 

Make len() work on URLStacks.

Returns: int
The size of the stack.

checked_urls(self)

source code 

Return set of checked_urls.

Returns: list of string.
The list of already checked (popped) URLs.
Decorators:
  • @property

Instance Variable Details [hide private]

_urls

The stack of only the URLs, not the level information. Required to check if the URL has been on the stack before.
Type:
list
Value:
None