Convert utf8 into html entities in Perl

Andrew Newby Source

I've inherited a horrible site, so please forgive this question :) Rather than upgrading the site into fully UTF8, we need to convert any non-standard chars into HTML entities:

&#ord_value;

This test script does this for one charachter:

$foo =~ s/(\x{ed})/to_ord($1)/e;

sub to_ord {
    return ("&#" . ord($_[0]). ";")
}

What I need to do though, is trigger this on anything greater than ord 127. Is there an easy way I can do this? I've looked into the character classes but can't see anything that fits the bill

FWIW, I've made them aware that the way they currently store the data is horrible, and will cause issues with people trying to search on the HTML entities - but this is out of my control.

UPDATE: This works, but I'm sure there must be a better way to do it - so please do share if you have a suggestion :)

s/([^a-z \.,-_0-9])/to_ord($1)/eg
perl

Answers

answered 3 months ago ikegami #1

s/(...)/ ... /eg;

Choices of patterns:

  • [^\x00-\x21\x23-\x25\x28-\x3B\x3D\x3F-\x7F] (Escape non-ASCII.)
  • [^\x09\x0A\x0D\x20-\x21\x23-\x25\x28-\x3B\x3D\x3F-\x7E] (Escape non-ASCII and control characters.)

Choices of replacement expressions:

  • "&#".ord($1).";"
  • sprintf("&#x%X;", ord($1)) (Extra CPU, but reduced bandwidth.)

For example,

s/([^\x09\x0A\x0D\x20-\x21\x23-\x25\x28-\x3B\x3D\x3F-\x7E])/ sprintf("&#x%X;", ord($1)) /eg;

comments powered by Disqus