Package buildxml :: Package tools :: Module removeInvalidUTF8
[hide private]
[frames] | no frames]

Module removeInvalidUTF8

source code

This program removes invalid UTF-8 multibyte sequences from a files. It does so line by line. It also shrinks sequences of whitespace to one space and replaces some invalid XHTML entities.

Usage:: $./removeInvalidUTF8.py [options]

Command line options:

  -i <file> | --input=<file>  File with invalid utf-8 control characters
  -o <file> | --output=<file> Output file
  -h        | --help          This text

Author: Johannes Schwenk

Copyright: 2010, Johannes Schwenk

Version: 2.0

Date: 2010-09-15

Functions [hide private]
 
main(argv)
Parse the command line options and call the function removeInvalidUTF8FromFile to do all the real work.
source code
 
removeInvalidUTF8FromFile(infile, outfile)
This function opens the input and the output file, telling the codec to replace the multibyte sequence with '\ufffd' if it is invalid.
source code
 
version()
Displays version information for this program.
source code
 
usage()
Display help on program usage and version information.
source code
Function Details [hide private]

removeInvalidUTF8FromFile(infile, outfile)

source code 

This function opens the input and the output file, telling the codec to replace the multibyte sequence with '\ufffd' if it is invalid. It also replaces sequences of whitespace with a singe space, using a regular expression. Finally a range of unicode control characters is removed.