HOWTO: Using UTF-8 on Gentoo (edited)

gna · n00b Joined: 19 Mar 2003 Posts: 38 Location: Beijing

I think all the packages mentioned in the howto have ebuilds. Can you be a bit more precise about what kind of ebuild?

skyfolly · Posted: Tue Jul 27, 2004 8:52 am Post subject:

Gatak · Apprentice Joined: 04 Jan 2004 Posts: 174

I have one problem with UTF-8. It is that I cannot mount a WindowsXP share with UTF-8. All extended characters come out very wrong, or simply missing.

But if I mount a Samba share from WindowsXP, UTF-8 works.

I tried with mount -o iocharset=utf8 with no luck.

EDIT: It works now with:

gna · n00b Joined: 19 Mar 2003 Posts: 38 Location: Beijing

Actually you are still using samba to mount your Windows XP partition. That is what the

Gatak · Apprentice Joined: 04 Jan 2004 Posts: 174

I think you are mistaking me what I wanted to do. I am not mounting a partition, but a Windows share over the network.

gna · n00b Joined: 19 Mar 2003 Posts: 38 Location: Beijing

I have tried this on a Win2k share and am also having similar problems.
Why did you chose cp850?
Is cp850 the default codepage on your windows XP?
What is the default nls in your kernel?

thanks

Gatak · Apprentice Joined: 04 Jan 2004 Posts: 174

The codepage should be irrelevant when using UTF-8 (Unicode). This is the whole point with Unicode.

My default NLS in the kernel is UTF-8.

Windows XP and Windows 2000 are using Unicode for SMB shares, not single-byte codepages. This is why it is so strange when Samba required me to choose one.

cp850 is a "western latin-1" codepage so this is why I tested it. Windows 2000/XP uses codepages for non-Unicode applications only.

Normally, a character is described as 8 bits. This makes it possible to have 256 different ones. Naturally. 256 characters aren't enough to describe all languages and all systems. Therefore codepages were developed so applications could know what the specific byte would be.

If two users were to talk to eachother over the net their systems would need to use the same codepage or characters would end up wrong.

Unicode was developed to remedy this. Unicode is large enough to be able to describe most (all?) languages in the world. Therefore the need for other codepage is removed. The biggest remaining problem is to have full unicode fonts. The fullest one I know is Arial Unicode MS. It has about 55000 characters defined.

gna · n00b Joined: 19 Mar 2003 Posts: 38 Location: Beijing

I agree that it should not be necessary to specify a codepage and, preferably, also no iocharset. It seems that that is the way it is intended to work. Why that is not working is either a bug or a configuration error.

Two more suggestions:

In the kernel configuration check
File Systems -> Network File Systems -> SMB File System support -> Use a default NLS -> utf8
It seems you can specify two default NLS's in the kernel, one for smbfs and one for other stuff.

Also try using the cifs filesystem. Just replace smbfs with cifs in your mount command (assuming it is configured in the kernel). cifs doesn't have a codepage option and is supposed to have better international support than smbfs. Cifs is now recommended over smbfs for all except old smb systems. Documentation is in /usr/src/linx/fs/cifs/README

If you can't get it to work then it might be good to ask a question on the linux cifs mailing list and/or file a bug report.

Leo Lausren · Posted: Tue Aug 24, 2004 6:25 am Post subject:

max4ever · Posted: Thu Sep 02, 2004 8:54 pm Post subject:

umm so if i did this

Gatak · Apprentice Joined: 04 Jan 2004 Posts: 174

Only if the application you use has a font which includes these characters. And only if the application support UTF-8.

max4ever · Posted: Fri Sep 03, 2004 11:27 am Post subject:

hmm, and how can i find out if a font has "support" for those characters ? for example i'm having problems with mplayer showing correctly subtitles..., can u suggest some font with utf8 support and antialias ?
_________________
Stop posting your PC's hardware as your signature.

Gatak · Apprentice Joined: 04 Jan 2004 Posts: 174

You can try to load the font in a character map program. I think there is one in Gnome. It allows you to see which characters exist in the font. Then you have to use that font in mplayer.

But remember, the subtitles that you load in mplayer may not be encoded with UTF-8, but some other local encoding. Mplayer would need to support that one.

andrewski · Posted: Sat Oct 02, 2004 2:50 am Post subject:

It'd be great if you could post a bit on the various fonts that are necessary to complete the effort to actually "see" UTF, i.e. console font, *term font. In all my searching, I haven't been able to figure that one out!

Also, where does CONSOLETRANSLATION from /etc/rc.conf come in? Perhaps that's necessary to seal the deal, as it were?

Thanks for a nice howto.

obmun · Posted: Sat Oct 02, 2004 11:22 am Post subject:

@andreskwi:

Forget about UTF-8 in console. It won't work completely (compose chars won't work). For more info take a look at this post. There I have some info about console font. Essentialy you have to use a console font with unicode map. Also it's good to have a font that makes use of the full 512 available gliphs (and not just one with 256).

CONSOLETRANSLATION tells setfont the translation map it will use to translate program output from 8 bit to the UTF-8 the kernel expects (kernel is always in UTF-8. It always execpts to recive unicode chars) when you're not using UTF-8. If apps are already sending UTF-8 chars it's not necessary to use the translation map and therefore CONSOLETRANSLATION should be commented out if you're using UTF-8 as your default coding.

talon · n00b Joined: 11 Jun 2003 Posts: 13

My major problem in porting my machine to utf-8 was that all gtk-1 apps didn´t display chars correctly. After a long time of experimenting I figured out how to do it right. You have to add the following line to your ~/.gtkrc.mine:

Haqqax · n00b Joined: 11 Jul 2004 Posts: 35

Can anyone shed some light on how to force (or whether it can be done at all) KDE apps to work with Unicode Plane1 characters?
I have been testing a little the last two days. I managed to create a font with just a few characters encoded in Plane1 (they start with 0x12000 - I am trying to make my linux support Akkadian cuneiform), I installed it and created with Perl a text file and HTML file for tests. HTML has both plain text chars and character entity references.

The only applications that processes and displays these files correctly are Firefox (it does display cuneiform texts :-)

) and Thunderbird (I did send a cuneiform e-mail to myself, and when it arrived it got displayed correctly :-)

) All the other applications, including but not limited to: OpenOffice, Konqueror and standard KDE apps do not parse UTF from Plane 1 correctly (they split one code into 2 chars) and of course do not display the text correctly. I am particularely disappointed by OpenOffice in this matter.

Can my KDE be cured? Does my success with Firefox and Thunderbird mean, that other GTK editors may work equally well?

gna · n00b Joined: 19 Mar 2003 Posts: 38 Location: Beijing

Actually this topic is of interest to me too. I know that a lot of applications ignore the supplementary planes. There is a UTF-8 project at freedesktop.org that is trying to make a list of non unicode compliant software. In particular they have a list of unicode software that doesn't work for the supplementary planes. Unfortunately this list is very short. But if you do find out something please report here and let us all know.

What software did you use to make your font? It would be helpful to know so that more people know how to do testing.

thanks

Haqqax · n00b Joined: 11 Jul 2004 Posts: 35

numerodix · l33t Joined: 18 Jul 2002 Posts: 743 Location: nl.eu

Ok, so I finally succeeded in getting this to work, my /etc/env.d/02locale now looks like this:

Haqqax · n00b Joined: 11 Jul 2004 Posts: 35

Gatak · Apprentice Joined: 04 Jan 2004 Posts: 174

It think most Gnome applications support Unicode. At least if compiled in with accessibility support. In GEdit, for example, I can view all sorts of Unicode characters. I suppose I still need truetype or opentype fonts in system that support Unicode.

Haqqax · n00b Joined: 11 Jul 2004 Posts: 35

Haqqax · n00b Joined: 11 Jul 2004 Posts: 35

I've got one more question: are Hebrew niqud and Arabic vowels displayed correctly on your Gentoo boxes? On my box they are displayed, but are not positioned correctly on their characters.

And, of course, arabic ligatures are broken by the vowels.

Is it working for anyone?

obmun · Posted: Mon Oct 11, 2004 3:07 pm Post subject:

@numerodix:

Console and UTF-8? Bad mixture. Take a look at this post. There I analize the problem. Conclusion? It's a kernel problem.