Package buildxml :: Package tools :: Module BeautifulSoup :: Class BeautifulSoup
[hide private]
[frames] | no frames]

Class BeautifulSoup

source code


This parser knows the following facts about HTML:

* Some tags have no closing tag and should be interpreted as being
  closed as soon as they are encountered.

* The text inside some tags (ie. 'script') may contain tags which
  are not really part of the document and which should be parsed
  as text, not tags. If you want to parse the text as tags, you can
  always fetch it and parse it explicitly.

* Tag nesting rules:

  Most tags can't be nested at all. For instance, the occurance of
  a <p> tag should implicitly close the previous <p> tag.

   <p>Para1<p>Para2
    should be transformed into:
   <p>Para1</p><p>Para2

  Some tags can be nested arbitrarily. For instance, the occurance
  of a <blockquote> tag should _not_ implicitly close the previous
  <blockquote> tag.

   Alice said: <blockquote>Bob said: <blockquote>Blah
    should NOT be transformed into:
   Alice said: <blockquote>Bob said: </blockquote><blockquote>Blah

  Some tags can be nested, but the nesting is reset by the
  interposition of other tags. For instance, a <tr> tag should
  implicitly close the previous <tr> tag within the same <table>,
  but not close a <tr> tag in another table.

   <table><tr>Blah<tr>Blah
    should be transformed into:
   <table><tr>Blah</tr><tr>Blah
    but,
   <tr>Blah<table><tr>Blah
    should NOT be transformed into
   <tr>Blah<table></tr><tr>Blah

Differing assumptions about tag nesting rules are a major source
of problems with the BeautifulSoup class. If BeautifulSoup is not
treating as nestable a tag your page author treats as nestable,
try ICantBelieveItsBeautifulSoup, MinimalSoup, or
BeautifulStoneSoup before writing your own subclass.

Instance Methods [hide private]
 
__init__(self, *args, **kwargs)
The Soup object is initialized as the 'root tag', and the provided markup (which can be a string or a file-like object) is fed into the underlying parser.
source code
 
start_meta(self, attrs)
Beautiful Soup can detect a charset included in a META tag, try to convert the document to that charset, and re-parse the document from the beginning.
source code

Inherited from BeautifulStoneSoup: __getattr__, convert_charref, endData, handle_charref, handle_comment, handle_data, handle_decl, handle_entityref, handle_pi, isSelfClosingTag, parse_declaration, popTag, pushTag, reset, unknown_endtag, unknown_starttag

Inherited from Tag: __call__, __contains__, __delitem__, __eq__, __getitem__, __iter__, __len__, __ne__, __nonzero__, __repr__, __setitem__, __str__, __unicode__, childGenerator, clear, decompose, fetch, fetchText, find, findAll, findChild, findChildren, first, firstText, get, getString, getText, has_key, index, prettify, recursiveChildGenerator, renderContents, setString, text

Inherited from Tag (private): _convertEntities, _getAttrMap, _invert, _sub_entity

Inherited from PageElement: append, extract, fetchNextSiblings, fetchParents, fetchPrevious, fetchPreviousSiblings, findAllNext, findAllPrevious, findNext, findNextSibling, findNextSiblings, findParent, findParents, findPrevious, findPreviousSibling, findPreviousSiblings, insert, nextGenerator, nextSiblingGenerator, parentGenerator, previousGenerator, previousSiblingGenerator, replaceWith, replaceWithChildren, setup, substituteEncoding, toEncoding

Inherited from PageElement (private): _findAll, _findOne, _lastRecursiveChild

Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __setattr__, __sizeof__, __subclasshook__

Inherited from sgmllib.SGMLParser: close, convert_codepoint, convert_entityref, error, feed, finish_endtag, finish_shorttag, finish_starttag, get_starttag_text, goahead, handle_endtag, handle_starttag, parse_endtag, parse_pi, parse_starttag, report_unbalanced, setliteral, setnomoretags, unknown_charref, unknown_entityref

Inherited from sgmllib.SGMLParser (private): _convert_ref

Inherited from markupbase.ParserBase: getpos, parse_comment, parse_marked_section, unknown_decl, updatepos

Inherited from markupbase.ParserBase (private): _parse_doctype_attlist, _parse_doctype_element, _parse_doctype_entity, _parse_doctype_notation, _parse_doctype_subset, _scan_name

Class Variables [hide private]
  SELF_CLOSING_TAGS = buildTagMap(None, ('br', 'hr', 'input', 'i...
  PRESERVE_WHITESPACE_TAGS = set(['pre', 'textarea'])
  QUOTE_TAGS = {'script': None, 'textarea': None}
  NESTABLE_INLINE_TAGS = 'span', 'font', 'q', 'object', 'bdo', '...
  NESTABLE_BLOCK_TAGS = 'blockquote', 'div', 'fieldset', 'ins', ...
  NESTABLE_LIST_TAGS = {'ol': [], 'ul': [], 'li': ['ul', 'ol'], ...
  NESTABLE_TABLE_TAGS = {'table': [], 'tr': ['table', 'tbody', '...
  NON_NESTABLE_BLOCK_TAGS = 'address', 'form', 'p', 'pre'
  RESET_NESTING_TAGS = buildTagMap(None, NESTABLE_BLOCK_TAGS, 'n...
  NESTABLE_TAGS = buildTagMap([], NESTABLE_INLINE_TAGS, NESTABLE...
  CHARSET_RE = re.compile("((^|;)\s*charset=)([^;]*)", re.M)

Inherited from BeautifulStoneSoup: ALL_ENTITIES, HTML_ENTITIES, MARKUP_MASSAGE, ROOT_TAG_NAME, STRIP_ASCII_SPACES, XHTML_ENTITIES, XML_ENTITIES

Inherited from Tag: BARE_AMPERSAND_OR_BRACKET, XML_ENTITIES_TO_SPECIAL_CHARS, XML_SPECIAL_CHARS_TO_ENTITIES, string

Inherited from sgmllib.SGMLParser: entity_or_charref, entitydefs

Inherited from sgmllib.SGMLParser (private): _decl_otherchars

Properties [hide private]

Inherited from object: __class__

Method Details [hide private]

__init__(self, *args, **kwargs)
(Constructor)

source code 
The Soup object is initialized as the 'root tag', and the
provided markup (which can be a string or a file-like object)
is fed into the underlying parser.

sgmllib will process most bad HTML, and the BeautifulSoup
class has some tricks for dealing with some HTML that kills
sgmllib, but Beautiful Soup can nonetheless choke or lose data
if your data uses self-closing tags or declarations
incorrectly.

By default, Beautiful Soup uses regexes to sanitize input,
avoiding the vast majority of these problems. If the problems
don't apply to you, pass in False for markupMassage, and
you'll get better performance.

The default parser massage techniques fix the two most common
instances of invalid HTML that choke sgmllib:

 <br/> (No space between name of closing tag and tag close)
 <! --Comment--> (Extraneous whitespace in declaration)

You can pass in a custom list of (RE object, replace method)
tuples to get Beautiful Soup to scrub your input the way you
want.

Overrides: markupbase.ParserBase.__init__

Class Variable Details [hide private]

SELF_CLOSING_TAGS

Value:
buildTagMap(None, ('br', 'hr', 'input', 'img', 'meta', 'spacer', 'link\
', 'frame', 'base', 'col'))

NESTABLE_INLINE_TAGS

Value:
'span', 'font', 'q', 'object', 'bdo', 'sub', 'sup', 'center'

NESTABLE_BLOCK_TAGS

Value:
'blockquote', 'div', 'fieldset', 'ins', 'del'

NESTABLE_LIST_TAGS

Value:
{'ol': [], 'ul': [], 'li': ['ul', 'ol'], 'dl': [], 'dd': ['dl'], 'dt':\
 ['dl']}

NESTABLE_TABLE_TAGS

Value:
{'table': [], 'tr': ['table', 'tbody', 'tfoot', 'thead'], 'td': ['tr']\
, 'th': ['tr'], 'thead': ['table'], 'tbody': ['table'], 'tfoot': ['tab\
le'],}

RESET_NESTING_TAGS

Value:
buildTagMap(None, NESTABLE_BLOCK_TAGS, 'noscript', NON_NESTABLE_BLOCK_\
TAGS, NESTABLE_LIST_TAGS, NESTABLE_TABLE_TAGS)

NESTABLE_TAGS

Value:
buildTagMap([], NESTABLE_INLINE_TAGS, NESTABLE_BLOCK_TAGS, NESTABLE_LI\
ST_TAGS, NESTABLE_TABLE_TAGS)