Simple solution to strange ֪ characters entities

emiddleton · n00b Joined: 22 Sep 2002 Posts: 8

Would it be possiable to filter the output with something like the following.

/&#/&#/

There is a bug somewhere in forumn's version of phpBB that is converting all & into & which distroys all character entities used to encode non-european languages when using the iso-8859-1 encoding. If you make this browsers change it is possible to see the correct characters in at least some of the browsers (obviously only if the correct fonts are installed.)

For more information about character entities look at.
http://www.w3.org/TR/2000/REC-xml-20001006#sec-references

rac · Posted: Mon Jan 27, 2003 6:40 pm Post subject:

In case anyone is wondering, the regex reads: /& amp;#/&#/ (disregard the space: I included it to defeat the feature being discussed). We need to study the issue a bit to make sure there are no unwanted side effects, but thanks a lot for the suggestion.

My current feeling is that this would break a lot of posts with '&' in them, and that's not acceptable. Do you have a suggestion for a way around this, or can you convince me I'm being silly?
_________________
For every higher wall, there is a taller ladder

emiddleton · n00b Joined: 22 Sep 2002 Posts: 8

Thanks for the response.

(All spaces in the following character entities are put in to stop the conversion)

Could you give an example of how, not converting & characters to & amp; could cause the & character to display incorrectly.

The reference I quoted above explains what character entities are. Basically any unicode character that can't be displayed in the current encoding (the forumn uses iso-8859-1, so anthing that isn't a european language) is converted into one of two possiable sequences.

' & # ' [0-9]+ ';' binary
' & # x' [0-9a-fA-F]+ ';' hexidecimal

e.g. & # 1502;

The number represents the numbers for characters in the ISO/IEC 10646 character set. Its really not all that difficult. These are not random sequences of letters and numbers and they may not contain spaces. You could also change the encoding to utf-8 (which encodes english characters as ASCII) which would cause these characters to be encoded without the use of character entities. If you go this way be carefull to encode the specify the encoding in the http headers as well.

The unicode standard page is at

http://www.unicode.org/standard/

(unicode uses the same character set as ISO/IEC10646)