Package buildxml :: Package tools :: Module BeautifulSoup :: Class UnicodeDammit
[hide private]
[frames] | no frames]

Class UnicodeDammit

source code

A class for detecting the encoding of a *ML document and converting it to a Unicode string. If the source encoding is windows-1252, can replace MS smart quotes with their HTML or XML equivalents.

Instance Methods [hide private]
 
__init__(self, markup, overrideEncodings=[], smartQuotesTo='xml', isHTML=False) source code
 
_subMSChar(self, orig)
Changes a MS smart quote character to an XML or HTML entity.
source code
 
_convertFrom(self, proposed) source code
 
_toUnicode(self, data, encoding)
Given a string and its encoding, decodes the string into Unicode.
source code
 
_detectEncoding(self, xml_data, isHTML=False)
Given a document, tries to detect its XML encoding.
source code
 
find_codec(self, charset) source code
 
_codec(self, charset) source code
 
_ebcdic_to_ascii(self, s) source code
Class Variables [hide private]
  CHARSET_ALIASES = {"macintosh": "mac-roman", "x-sjis": "shift-...
  EBCDIC_TO_ASCII_MAP = None
hash(x)
  MS_CHARS = {'\x80':('euro', '20AC'), '\x81': ' ', '\x82':('sbq...
Method Details [hide private]

_toUnicode(self, data, encoding)

source code 

Given a string and its encoding, decodes the string into Unicode. %encoding is a string recognized by encodings.aliases


Class Variable Details [hide private]

CHARSET_ALIASES

Value:
{"macintosh": "mac-roman", "x-sjis": "shift-jis"}

MS_CHARS

Value:
{'\x80':('euro', '20AC'), '\x81': ' ', '\x82':('sbquo', '201A'), '\x83\
':('fnof', '192'), '\x84':('bdquo', '201E'), '\x85':('hellip', '2026')\
, '\x86':('dagger', '2020'), '\x87':('Dagger', '2021'), '\x88':('circ'\
, '2C6'), '\x89':('permil', '2030'), '\x8A':('Scaron', '160'), '\x8B':\
('lsaquo', '2039'), '\x8C':('OElig', '152'), '\x8D': '?', '\x8E':('#x1\
7D', '17D'), '\x8F': '?', '\x90': '?', '\x91':('lsquo', '2018'), '\x92\
':('rsquo', '2019'), '\x93':('ldquo', '201C'), '\x94':('rdquo', '201D'\
), '\x95':('bull', '2022'), '\x96':('ndash', '2013'), '\x97':('mdash',\
...