.TH "HXUNENT" "1" "10 Jul 2011" "7.x" "HTML-XML-utils"
.de d \" begin display
.sp
.in +4
.nf
.ft CR
.CDS
..
.de e \" end display
.CDE
.in -4
.fi
.ft R
.sp
..
.SH NAME
hxunent \- replace HTML predefined character entities by UTF-8
.SH SYNOPSIS
.B hxunent
.RB "[\| " \-b " \|]"
.RB "[\| " \-f " \|]"
.RI "[\| " file " \|]"
.SH DESCRIPTION
.LP
The
.B hxunent
command reads the
.I file
(or standard input) and copies it to standard output with &-entities
by their equivalent character (encoded as UTF-8). E.g., " is
replaced by " and < is replaced by <.
.SH OPTIONS
The following options are supported:
.TP 10
.B -b
The five builtin entities of XML (< > " ' &) are not
replaced but copied unchanged. This is necessary if the output has to
be valid XML or SGML.
.TP
.B -f
This option changes how unknown entities or lone ampersands are handled. Normally they are copied unchanged, but this option tries to "fix" them by replacing ampersands by &. Often such stray ampersands are the result of copy and paste of URLs into a document and then this option indeed fixes them and makes the document valid.
.SH "DIAGNOSTICS"
The program's exit value is 0 if all went well, otherwise:
.TP 10
.B 1
The input couldn't be read (file not found, file not readable...)
.TP
.B 2
Wrong command line arguments.
.SH "SEE ALSO"
.BR asc2xml (1),
.BR xml2asc (1),
.BR UTF-8 " (RFC 2279)"
.SH BUGS
.LP
The program assumes entities are as defined by HTML. It doesn't read a
document's DTD to find the actual definitions in use in a document.
With
.BR \-f ,
it will even remove all entities that are not HTML entities.