.\" -*- mode: troff; coding: utf-8 -*-
.\" Automatically generated by Pod::Man 5.01 (Pod::Simple 3.43)
.\"
.\" Standard preamble:
.\" ========================================================================
.de Sp \" Vertical space (when we can't use .PP)
.if t .sp .5v
.if n .sp
..
.de Vb \" Begin verbatim text
.ft CW
.nf
.ne \\$1
..
.de Ve \" End verbatim text
.ft R
.fi
..
.\" \*(C` and \*(C' are quotes in nroff, nothing in troff, for use with C<>.
.ie n \{\
.    ds C` ""
.    ds C' ""
'br\}
.el\{\
.    ds C`
.    ds C'
'br\}
.\"
.\" Escape single quotes in literal strings from groff's Unicode transform.
.ie \n(.g .ds Aq \(aq
.el       .ds Aq '
.\"
.\" If the F register is >0, we'll generate index entries on stderr for
.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index
.\" entries marked with X<> in POD.  Of course, you'll have to process the
.\" output yourself in some meaningful fashion.
.\"
.\" Avoid warning from groff about undefined register 'F'.
.de IX
..
.nr rF 0
.if \n(.g .if rF .nr rF 1
.if (\n(rF:(\n(.g==0)) \{\
.    if \nF \{\
.        de IX
.        tm Index:\\$1\t\\n%\t"\\$2"
..
.        if !\nF==2 \{\
.            nr % 0
.            nr F 2
.        \}
.    \}
.\}
.rr rF
.\" ========================================================================
.\"
.IX Title "Test::utf8 3"
.TH Test::utf8 3 2023-07-26 "perl v5.38.0" "User Contributed Perl Documentation"
.\" For nroff, turn off justification.  Always turn off hyphenation; it makes
.\" way too many mistakes in technical documents.
.if n .ad l
.nh
.SH NAME
Test::utf8 \- handy utf8 tests
.SH SYNOPSIS
.IX Header "SYNOPSIS"
.Vb 3
\&  # check the string is good
\&  is_valid_string($string);   # check the string is valid
\&  is_sane_utf8($string);      # check not double encoded
\&
\&  # check the string has certain attributes
\&  is_flagged_utf8($string1);   # has utf8 flag set
\&  is_within_ascii($string2);   # only has ascii chars in it
\&  isnt_within_ascii($string3); # has chars outside the ascii range
\&  is_within_latin_1($string4); # only has latin\-1 chars in it
\&  isnt_within_ascii($string5); # has chars outside the latin\-1 range
.Ve
.SH DESCRIPTION
.IX Header "DESCRIPTION"
This module is a collection of tests useful for dealing with utf8 strings in
Perl.
.PP
This module has two types of tests: The validity tests check if a string is
valid and not corrupt, whereas the characteristics tests will check that string
has a given set of characteristics.
.SS "Validity Tests"
.IX Subsection "Validity Tests"
.ie n .IP "is_valid_string($string, $testname)" 4
.el .IP "is_valid_string($string, \f(CW$testname\fR)" 4
.IX Item "is_valid_string($string, $testname)"
Checks if the string is "valid", i.e. this passes and returns true unless
the internal utf8 flag hasn't been set on scalar that isn't made up of a valid
utf\-8 byte sequence.
.Sp
This should \fInever\fR happen and, in theory, this test should always pass. Unless
you (or a module you use) goes monkeying around inside a scalar using Encode's
private functions or XS code you shouldn't ever end up in a situation where
you've got a corrupt scalar.  But if you do, and you do, then this function
should help you detect the problem.
.Sp
To be clear, here's an example of the error case this can detect:
.Sp
.Vb 4
\&  my $mark = "Mark";
\&  my $leon = "L\ex{e9}on";
\&  is_valid_string($mark);  # passes, not utf\-8
\&  is_valid_string($leon);  # passes, not utf\-8
\&
\&  my $iloveny = "I \ex{2665} NY";
\&  is_valid_string($iloveny);      # passes, proper utf\-8
\&
\&  my $acme = "L\ex{c3}\ex{a9}on";
\&  Encode::_utf8_on($acme);      # (please don\*(Aqt do things like this)
\&  is_valid_string($acme);       # passes, proper utf\-8 byte sequence upgraded
\&
\&  Encode::_utf8_on($leon);      # (this is why you don\*(Aqt do things like this)
\&  is_valid_string($leon);       # fails! the byte \ex{e9} isn\*(Aqt valid utf\-8
.Ve
.ie n .IP "is_sane_utf8($string, $name)" 4
.el .IP "is_sane_utf8($string, \f(CW$name\fR)" 4
.IX Item "is_sane_utf8($string, $name)"
This test fails if the string contains something that looks like it
might be dodgy utf8, i.e. containing something that looks like the
multi-byte sequence for a latin\-1 character but perl hasn't been
instructed to treat as such.  Strings that are not utf8 always
automatically pass.
.Sp
Some examples may help:
.Sp
.Vb 2
\&  # This will pass as it\*(Aqs a normal latin\-1 string
\&  is_sane_utf8("Hello L\ex{e9}eon");
\&
\&  # this will fail because the \ex{c3}\ex{a9} looks like the
\&  # utf8 byte sequence for e\-acute
\&  my $string = "Hello L\ex{c3}\ex{a9}on";
\&  is_sane_utf8($string);
\&
\&  # this will pass because the utf8 is correctly interpreted as utf8
\&  Encode::_utf8_on($string)
\&  is_sane_utf8($string);
.Ve
.Sp
Obviously this isn't a hundred percent reliable.  The edge case where
this will fail is where you have \f(CW\*(C`\ex{c2}\*(C'\fR (which is "LATIN CAPITAL
LETTER WITH CIRCUMFLEX") or \f(CW\*(C`\ex{c3}\*(C'\fR (which is "LATIN CAPITAL LETTER
WITH TILDE") followed by one of the latin\-1 punctuation symbols.
.Sp
.Vb 4
\&  # a capital letter A with tilde surrounded by smart quotes
\&  # this will fail because it\*(Aqll see the "\ex{c2}\ex{94}" and think
\&  # it\*(Aqs actually the utf8 sequence for the end smart quote
\&  is_sane_utf8("\ex{93}\ex{c2}\ex{94}");
.Ve
.Sp
However, since this hardly comes up this test is reasonably reliable
in most cases.  Still, care should be applied in cases where dynamic
data is placed next to latin\-1 punctuation to avoid false negatives.
.Sp
There exists two situations to cause this test to fail; The string
contains utf8 byte sequences and the string hasn't been flagged as
utf8 (this normally means that you got it from an external source like
a C library; When Perl needs to store a string internally as utf8 it
does it's own encoding and flagging transparently) or a utf8 flagged
string contains byte sequences that when translated to characters
themselves look like a utf8 byte sequence.  The test diagnostics tells
you which is the case.
.SS "String Characteristic Tests"
.IX Subsection "String Characteristic Tests"
These routines allow you to check the range of characters in a string.
Note that these routines are blind to the actual encoding perl
internally uses to store the characters, they just check if the
string contains only characters that can be represented in the named
encoding:
.IP is_within_ascii 4
.IX Item "is_within_ascii"
Tests that a string only contains characters that are in the ASCII
character set.
.IP is_within_latin_1 4
.IX Item "is_within_latin_1"
Tests that a string only contains characters that are in latin\-1.
.PP
Simply check if a scalar is or isn't flagged as utf8 by perl's
internals:
.ie n .IP "is_flagged_utf8($string, $name)" 4
.el .IP "is_flagged_utf8($string, \f(CW$name\fR)" 4
.IX Item "is_flagged_utf8($string, $name)"
Passes if the string is flagged by perl's internals as utf8, fails if
it's not.
.IP isnt_flagged_utf8($string,$name) 4
.IX Item "isnt_flagged_utf8($string,$name)"
The opposite of \f(CW\*(C`is_flagged_utf8\*(C'\fR, passes if and only if the string
isn't flagged as utf8 by perl's internals.
.Sp
Note: you can refer to this function as \f(CW\*(C`isn\*(Aqt_flagged_utf8\*(C'\fR if you
really want to.
.SH AUTHOR
.IX Header "AUTHOR"
Written by Mark Fowler \fBmark@twoshortplanks.com\fR
.SH COPYRIGHT
.IX Header "COPYRIGHT"
Copyright Mark Fowler 2004,2012.  All rights reserved.
.PP
This program is free software; you can redistribute it
and/or modify it under the same terms as Perl itself.
.SH BUGS
.IX Header "BUGS"
None known.  Please report any to me via the CPAN RT system.  See
http://rt.cpan.org/ for more details.
.SH "SEE ALSO"
.IX Header "SEE ALSO"
Test::DoubleEncodedEntities for testing for double encoded HTML
entities.