icon Top 9 categories map      RocketAware > Perl >

How do I remove HTML from a string?

Tips: Browse or Search all pages for efficient awareness of Perl functions, operators, and FAQs.



Home

Search Perl pages


Subjects

By activity
Professions, Sciences, Humanities, Business, ...

User Interface
Text-based, GUI, Audio, Video, Keyboards, Mouse, Images,...

Text Strings
Conversions, tests, processing, manipulation,...

Math
Integer, Floating point, Matrix, Statistics, Boolean, ...

Processing
Algorithms, Memory, Process control, Debugging, ...

Stored Data
Data storage, Integrity, Encryption, Compression, ...

Communications
Networks, protocols, Interprocess, Remote, Client Server, ...

Hard World
Timing, Calendar and Clock, Audio, Video, Printer, Controls...

File System
Management, Filtering, File & Directory access, Viewers, ...

    

How do I remove HTML from a string?

The most correct way (albeit not the fastest) is to use HTML::Parse from CPAN (part of the libwww-perl distribution, which is a must-have module for all web hackers).

Many folks attempt a simple-minded regular expression approach, like s/<.*?>//g, but that fails in many cases because the tags may continue over line breaks, they may contain quoted angle-brackets, or HTML comment may be present. Plus folks forget to convert entities, like < for example.

Here's one ``simple-minded'' approach, that works for most files:

    #!/usr/bin/perl -p0777
    s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

If you want a more complete solution, see the 3-stage striphtml program in http://www.perl.com/CPAN/authors/Tom_Christiansen/scripts/striphtml.gz .


Source: Perl FAQ: Networking
Copyright: Copyright (c) 1997 Tom Christiansen and Nathan Torkington.
Next: How do I extract URLs?

Previous: My CGI script runs from the command line but not the browser. Can you help me fix it?



(Corrections, notes, and links courtesy of RocketAware.com)


[Overview Topics]

Up to: NUL Terminated String processing
Up to: WWW authoring




Rapid-Links: Search | About | Comments | Submit Path: RocketAware > Perl > perlfaq9/How_do_I_remove_HTML_from_a_stri.htm
RocketAware.com is a service of Mib Software
Copyright 2000, Forrest J. Cavalier III. All Rights Reserved.
We welcome submissions and comments