Package buildxml :: Package tools :: Module BeautifulSoup :: Class BeautifulStoneSoup
[hide private]
[frames] | no frames]

Class BeautifulStoneSoup

source code


This class contains the basic parser and search code. It defines
a parser that knows nothing about tag behavior except for the
following:

  You can't close a tag without closing all the tags it encloses.
  That is, "<foo><bar></foo>" actually means
  "<foo><bar></bar></foo>".

[Another possible explanation is "<foo><bar /></foo>", but since
this class defines no SELF_CLOSING_TAGS, it will never use that
explanation.]

This class is useful for parsing XML or made-up markup languages,
or when BeautifulSoup makes an assumption counter to what you were
expecting.

Instance Methods [hide private]
 
__init__(self, markup="", parseOnlyThese=None, fromEncoding=None, markupMassage=True, smartQuotesTo=XML_ENTITIES, convertEntities=None, selfClosingTags=None, isHTML=False)
The Soup object is initialized as the 'root tag', and the provided markup (which can be a string or a file-like object) is fed into the underlying parser.
source code
 
convert_charref(self, name)
This method fixes a bug in Python's SGMLParser.
source code
 
_feed(self, inDocumentEncoding=None, isHTML=False) source code
 
__getattr__(self, methodName)
This method routes method call requests to either the SGMLParser superclass or the Tag superclass, depending on the method name.
source code
 
isSelfClosingTag(self, name)
Returns true iff the given string is the name of a self-closing tag according to this parser.
source code
 
reset(self)
Reset this instance.
source code
 
popTag(self) source code
 
pushTag(self, tag) source code
 
endData(self, containerClass=NavigableString) source code
 
_popToTag(self, name, inclusivePop=True)
Pops the tag stack up to and including the most recent instance of the given tag.
source code
 
_smartPop(self, name)
We need to pop up to the previous tag of this type, unless one of this tag's nesting reset triggers comes between this tag and the previous tag of this type, OR unless this tag is a generic nesting trigger and another generic nesting trigger comes between this tag and the previous tag of this type.
source code
 
unknown_starttag(self, name, attrs, selfClosing=0) source code
 
unknown_endtag(self, name) source code
 
handle_data(self, data) source code
 
_toStringSubclass(self, text, subclass)
Adds a certain piece of text to the tree as a NavigableString subclass.
source code
 
handle_pi(self, text)
Handle a processing instruction as a ProcessingInstruction object, possibly one with a %SOUP-ENCODING% slot into which an encoding will be plugged later.
source code
 
handle_comment(self, text)
Handle comments as Comment objects.
source code
 
handle_charref(self, ref)
Handle character references as data.
source code
 
handle_entityref(self, ref)
Handle entity references as data, possibly converting known HTML and/or XML entity references to the corresponding Unicode characters.
source code
 
handle_decl(self, data)
Handle DOCTYPEs and the like as Declaration objects.
source code
 
parse_declaration(self, i)
Treat a bogus SGML declaration as raw data.
source code

Inherited from Tag: __call__, __contains__, __delitem__, __eq__, __getitem__, __iter__, __len__, __ne__, __nonzero__, __repr__, __setitem__, __str__, __unicode__, childGenerator, clear, decompose, fetch, fetchText, find, findAll, findChild, findChildren, first, firstText, get, getString, getText, has_key, index, prettify, recursiveChildGenerator, renderContents, setString, text

Inherited from Tag (private): _convertEntities, _getAttrMap, _invert, _sub_entity

Inherited from PageElement: append, extract, fetchNextSiblings, fetchParents, fetchPrevious, fetchPreviousSiblings, findAllNext, findAllPrevious, findNext, findNextSibling, findNextSiblings, findParent, findParents, findPrevious, findPreviousSibling, findPreviousSiblings, insert, nextGenerator, nextSiblingGenerator, parentGenerator, previousGenerator, previousSiblingGenerator, replaceWith, replaceWithChildren, setup, substituteEncoding, toEncoding

Inherited from PageElement (private): _findAll, _findOne, _lastRecursiveChild

Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __setattr__, __sizeof__, __subclasshook__

Inherited from sgmllib.SGMLParser: close, convert_codepoint, convert_entityref, error, feed, finish_endtag, finish_shorttag, finish_starttag, get_starttag_text, goahead, handle_endtag, handle_starttag, parse_endtag, parse_pi, parse_starttag, report_unbalanced, setliteral, setnomoretags, unknown_charref, unknown_entityref

Inherited from sgmllib.SGMLParser (private): _convert_ref

Inherited from markupbase.ParserBase: getpos, parse_comment, parse_marked_section, unknown_decl, updatepos

Inherited from markupbase.ParserBase (private): _parse_doctype_attlist, _parse_doctype_element, _parse_doctype_entity, _parse_doctype_notation, _parse_doctype_subset, _scan_name

Class Variables [hide private]
  SELF_CLOSING_TAGS = {}
  NESTABLE_TAGS = {}
  RESET_NESTING_TAGS = {}
  QUOTE_TAGS = {}
  PRESERVE_WHITESPACE_TAGS = []
  MARKUP_MASSAGE = [(re.compile('(<[^<>]*)/>'), lambda x: x.grou...
  ROOT_TAG_NAME = u'[document]'
  HTML_ENTITIES = "html"
  XML_ENTITIES = "xml"
  XHTML_ENTITIES = "xhtml"
  ALL_ENTITIES = "xhtml"
  STRIP_ASCII_SPACES = {9: None, 10: None, 12: None, 13: None, 3...

Inherited from Tag: BARE_AMPERSAND_OR_BRACKET, XML_ENTITIES_TO_SPECIAL_CHARS, XML_SPECIAL_CHARS_TO_ENTITIES, string

Inherited from sgmllib.SGMLParser: entity_or_charref, entitydefs

Inherited from sgmllib.SGMLParser (private): _decl_otherchars

Properties [hide private]

Inherited from object: __class__

Method Details [hide private]

__init__(self, markup="", parseOnlyThese=None, fromEncoding=None, markupMassage=True, smartQuotesTo=XML_ENTITIES, convertEntities=None, selfClosingTags=None, isHTML=False)
(Constructor)

source code 
The Soup object is initialized as the 'root tag', and the
provided markup (which can be a string or a file-like object)
is fed into the underlying parser.

sgmllib will process most bad HTML, and the BeautifulSoup
class has some tricks for dealing with some HTML that kills
sgmllib, but Beautiful Soup can nonetheless choke or lose data
if your data uses self-closing tags or declarations
incorrectly.

By default, Beautiful Soup uses regexes to sanitize input,
avoiding the vast majority of these problems. If the problems
don't apply to you, pass in False for markupMassage, and
you'll get better performance.

The default parser massage techniques fix the two most common
instances of invalid HTML that choke sgmllib:

 <br/> (No space between name of closing tag and tag close)
 <! --Comment--> (Extraneous whitespace in declaration)

You can pass in a custom list of (RE object, replace method)
tuples to get Beautiful Soup to scrub your input the way you
want.

Overrides: markupbase.ParserBase.__init__

convert_charref(self, name)

source code 

This method fixes a bug in Python's SGMLParser.

Overrides: sgmllib.SGMLParser.convert_charref

__getattr__(self, methodName)
(Qualification operator)

source code 

This method routes method call requests to either the SGMLParser superclass or the Tag superclass, depending on the method name.

Overrides: Tag.__getattr__

reset(self)

source code 

Reset this instance. Loses all unprocessed data.

Overrides: markupbase.ParserBase.reset

_popToTag(self, name, inclusivePop=True)

source code 

Pops the tag stack up to and including the most recent instance of the given tag. If inclusivePop is false, pops the tag stack up to but *not* including the most recent instqance of the given tag.

_smartPop(self, name)

source code 
We need to pop up to the previous tag of this type, unless
one of this tag's nesting reset triggers comes between this
tag and the previous tag of this type, OR unless this tag is a
generic nesting trigger and another generic nesting trigger
comes between this tag and the previous tag of this type.

Examples:
 <p>Foo<b>Bar *<p>* should pop to 'p', not 'b'.
 <p>Foo<table>Bar *<p>* should pop to 'table', not 'p'.
 <p>Foo<table><tr>Bar *<p>* should pop to 'tr', not 'p'.

 <li><ul><li> *<li>* should pop to 'ul', not the first 'li'.
 <tr><table><tr> *<tr>* should pop to 'table', not the first 'tr'
 <td><tr><td> *<td>* should pop to 'tr', not the first 'td'

unknown_starttag(self, name, attrs, selfClosing=0)

source code 
Overrides: sgmllib.SGMLParser.unknown_starttag

unknown_endtag(self, name)

source code 
Overrides: sgmllib.SGMLParser.unknown_endtag

handle_data(self, data)

source code 
Overrides: sgmllib.SGMLParser.handle_data

handle_pi(self, text)

source code 

Handle a processing instruction as a ProcessingInstruction object, possibly one with a %SOUP-ENCODING% slot into which an encoding will be plugged later.

Overrides: sgmllib.SGMLParser.handle_pi

handle_comment(self, text)

source code 

Handle comments as Comment objects.

Overrides: sgmllib.SGMLParser.handle_comment

handle_charref(self, ref)

source code 

Handle character references as data.

Overrides: sgmllib.SGMLParser.handle_charref

handle_entityref(self, ref)

source code 

Handle entity references as data, possibly converting known HTML and/or XML entity references to the corresponding Unicode characters.

Overrides: sgmllib.SGMLParser.handle_entityref

handle_decl(self, data)

source code 

Handle DOCTYPEs and the like as Declaration objects.

Overrides: sgmllib.SGMLParser.handle_decl

parse_declaration(self, i)

source code 

Treat a bogus SGML declaration as raw data. Treat a CDATA declaration as a CData object.

Overrides: markupbase.ParserBase.parse_declaration

Class Variable Details [hide private]

MARKUP_MASSAGE

Value:
[(re.compile('(<[^<>]*)/>'), lambda x: x.group(1)+ ' />'), (re.compile\
('<!\s+([^<>]*)>'), lambda x: '<!'+ x.group(1)+ '>')]

STRIP_ASCII_SPACES

Value:
{9: None, 10: None, 12: None, 13: None, 32: None,}