Home | Trees | Indices | Help |
---|
|
This parser knows the following facts about HTML: * Some tags have no closing tag and should be interpreted as being closed as soon as they are encountered. * The text inside some tags (ie. 'script') may contain tags which are not really part of the document and which should be parsed as text, not tags. If you want to parse the text as tags, you can always fetch it and parse it explicitly. * Tag nesting rules: Most tags can't be nested at all. For instance, the occurance of a <p> tag should implicitly close the previous <p> tag. <p>Para1<p>Para2 should be transformed into: <p>Para1</p><p>Para2 Some tags can be nested arbitrarily. For instance, the occurance of a <blockquote> tag should _not_ implicitly close the previous <blockquote> tag. Alice said: <blockquote>Bob said: <blockquote>Blah should NOT be transformed into: Alice said: <blockquote>Bob said: </blockquote><blockquote>Blah Some tags can be nested, but the nesting is reset by the interposition of other tags. For instance, a <tr> tag should implicitly close the previous <tr> tag within the same <table>, but not close a <tr> tag in another table. <table><tr>Blah<tr>Blah should be transformed into: <table><tr>Blah</tr><tr>Blah but, <tr>Blah<table><tr>Blah should NOT be transformed into <tr>Blah<table></tr><tr>Blah Differing assumptions about tag nesting rules are a major source of problems with the BeautifulSoup class. If BeautifulSoup is not treating as nestable a tag your page author treats as nestable, try ICantBelieveItsBeautifulSoup, MinimalSoup, or BeautifulStoneSoup before writing your own subclass.
|
|||
|
|||
|
|||
Inherited from Inherited from Inherited from Inherited from Inherited from Inherited from Inherited from Inherited from Inherited from Inherited from Inherited from |
|
|||
SELF_CLOSING_TAGS = buildTagMap(None, ('br', 'hr', 'input', 'i
|
|||
PRESERVE_WHITESPACE_TAGS = set(['pre', 'textarea'])
|
|||
QUOTE_TAGS = {'script': None, 'textarea': None}
|
|||
NESTABLE_INLINE_TAGS = 'span', 'font', 'q', 'object', 'bdo', '
|
|||
NESTABLE_BLOCK_TAGS = 'blockquote', 'div', 'fieldset', 'ins',
|
|||
NESTABLE_LIST_TAGS = {'ol': [], 'ul': [], 'li': ['ul', 'ol'],
|
|||
NESTABLE_TABLE_TAGS = {'table': [], 'tr': ['table', 'tbody', '
|
|||
NON_NESTABLE_BLOCK_TAGS = 'address', 'form', 'p', 'pre'
|
|||
RESET_NESTING_TAGS = buildTagMap(None, NESTABLE_BLOCK_TAGS, 'n
|
|||
NESTABLE_TAGS = buildTagMap([], NESTABLE_INLINE_TAGS, NESTABLE
|
|||
CHARSET_RE = re.compile("((^|;)\s*charset=)([^;]*)", re.M)
|
|||
Inherited from Inherited from Inherited from Inherited from |
|
|||
Inherited from |
|
The Soup object is initialized as the 'root tag', and the provided markup (which can be a string or a file-like object) is fed into the underlying parser. sgmllib will process most bad HTML, and the BeautifulSoup class has some tricks for dealing with some HTML that kills sgmllib, but Beautiful Soup can nonetheless choke or lose data if your data uses self-closing tags or declarations incorrectly. By default, Beautiful Soup uses regexes to sanitize input, avoiding the vast majority of these problems. If the problems don't apply to you, pass in False for markupMassage, and you'll get better performance. The default parser massage techniques fix the two most common instances of invalid HTML that choke sgmllib: <br/> (No space between name of closing tag and tag close) <! --Comment--> (Extraneous whitespace in declaration) You can pass in a custom list of (RE object, replace method) tuples to get Beautiful Soup to scrub your input the way you want.
|
|
SELF_CLOSING_TAGS
|
NESTABLE_INLINE_TAGS
|
NESTABLE_BLOCK_TAGS
|
NESTABLE_LIST_TAGS
|
NESTABLE_TABLE_TAGS
|
RESET_NESTING_TAGS
|
NESTABLE_TAGS
|
Home | Trees | Indices | Help |
---|
Generated by Epydoc 3.0.1 on Thu Sep 16 13:42:08 2010 | http://epydoc.sourceforge.net |