! Aware to Perl: How can I match strings with multibyte characters?

RocketAware > Perl >

How can I match strings with multibyte characters?

Tips: Browse or Search all pages for efficient awareness of Perl functions, operators, and FAQs.

Home

Search Perl pages

Subjects

By activity
Professions, Sciences, Humanities, Business, ...

User Interface
Text-based, GUI, Audio, Video, Keyboards, Mouse, Images,...

Text Strings
Conversions, tests, processing, manipulation,...

Math
Integer, Floating point, Matrix, Statistics, Boolean, ...

Processing
Algorithms, Memory, Process control, Debugging, ...

Stored Data
Data storage, Integrity, Encryption, Compression, ...

Communications
Networks, protocols, Interprocess, Remote, Client Server, ...

Hard World
Timing, Calendar and Clock, Audio, Video, Printer, Controls...

File System
Management, Filtering, File & Directory access, Viewers, ...

How can I match strings with multibyte characters?

This is hard, and there's no good way. Perl does not directly support wide characters. It pretends that a byte and a character are synonymous. The following set of approaches was offered by Jeffrey Friedl, whose article in issue #5 of The Perl Journal talks about this very matter.

Let's suppose you have some weird Martian encoding where pairs of ASCII uppercase letters encode single Martian letters (i.e. the two bytes ``CV'' make a single Martian letter, as do the two bytes ``SG'', ``VS'', ``XX'', etc.). Other bytes represent single characters, just like ASCII.

So, the string of Martian ``I am CVSGXX!'' uses 12 bytes to encode the nine characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'.

Now, say you want to search for the single character /GX/. Perl doesn't know about Martian, so it'll find the two bytes ``GX'' in the ``I am CVSGXX!'' string, even though that character isn't there: it just looks like it is because ``SG'' is next to ``XX'', but there's no real ``GX''. This is a big problem.

Here are a few ways, all painful, to deal with it:

   $martian =~ s/([A-Z][A-Z])/ $1 /g; # Make sure adjacent ``martian'' bytes
                                      # are no longer adjacent.
   print "found GX!\n" if $martian =~ /GX/;

Or like this:

   @chars = $martian =~ m/([A-Z][A-Z]|[^A-Z])/g;
   # above is conceptually similar to:     @chars = $text =~ m/(.)/g;
   #
   foreach $char (@chars) {
       print "found GX!\n", last if $char eq 'GX';
   }

Or like this:

   while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) {  # \G probably unneeded
       print "found GX!\n", last if $1 eq 'GX';
   }

Or like this:

   die "sorry, Perl doesn't (yet) have Martian support )-:\n";

In addition, a sample program which converts half-width to full-width katakana (in Shift-JIS or EUC encoding) is available from CPAN as

There are many double- (and multi-) byte encodings commonly used these days. Some versions of these have 1-, 2-, 3-, and 4-byte characters, all mixed.

Source: Perl FAQ: Regexps
Copyright: Copyright (c) 1997 Tom Christiansen and Nathan Torkington.

Next: Can I get a BNF/yacc/RE for the Perl language?

Previous: What's wrong with using grep or map in a void context?

(Corrections, notes, and links courtesy of RocketAware.com)

[Overview Topics]

Up to: NUL terminated String Comparison and Search

Rapid-Links: Search | About | Comments | Submit Path: RocketAware > Perl > perlfaq6/How_can_I_match_strings_with_mul.htm