Buildroot Archive mirror
 help / color / mirror / Atom feed
* [Buildroot] User question UTF-8
@ 2015-09-15 17:11 Steve Calfee
  2015-09-15 21:21 ` Thomas Petazzoni
  2015-09-15 21:21 ` Arnout Vandecappelle
  0 siblings, 2 replies; 5+ messages in thread
From: Steve Calfee @ 2015-09-15 17:11 UTC (permalink / raw
  To: buildroot

Hi,

I am trying to port a python application to buildroot/busybox. It
needs to read disk files from removable drives. The filenames may
contain utf-8 chars.

Currently ls from busybox prints ? for the utf-8 non-ascii chars. Both
from console on minicom and from ssh (which should handle utf-8).

There seems to be lots of config knobs.

I assume utf-8 chars are somehow related to locales? I enabled locales
in the internal glib toolchain.

BR2_arm=y
BR2_TOOLCHAIN_BUILDROOT_GLIBC=y
BR2_TOOLCHAIN_BUILDROOT_CXX=y
BR2_ENABLE_LOCALE_PURGE=y
BR2_GENERATE_LOCALE="en_US.UTF-8"
BR2_TARGET_OPTIMIZATION="-Os -pipe"
# BR2_TARGET_GENERIC_GETTY is not set
# BR2_TARGET_GENERIC_REMOUNT_ROOTFS_RW is not set
BR2_PACKAGE_LIBPTHREAD_STUBS=y
# BR2_TARGET_ROOTFS_TAR is not set
BR2_TARGET_SHEEVAPLUG=y


Busybox also has locale settings:
grep LOCAL output/build/busybox-1.23.2/.config
CONFIG_LOCALE_SUPPORT=y
# CONFIG_UNICODE_USING_LOCALE is not set
# CONFIG_FEATURE_UNIX_LOCAL is not set
# CONFIG_HUSH_LOCAL is not set

From googling, Linux always supports anything for filenames, since it
just uses bytes not unicode for filenames.

But I seem to be missing something. My generated system does not seem
to properly handle utf-8. I am guessing until that works the python os
module is also not going to handle utf-8. And indeed it does not work
now.

Regards, Steve

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Buildroot] User question UTF-8
  2015-09-15 17:11 [Buildroot] User question UTF-8 Steve Calfee
@ 2015-09-15 21:21 ` Thomas Petazzoni
  2015-09-15 21:39   ` Steve Calfee
  2015-09-15 21:21 ` Arnout Vandecappelle
  1 sibling, 1 reply; 5+ messages in thread
From: Thomas Petazzoni @ 2015-09-15 21:21 UTC (permalink / raw
  To: buildroot

Dear Steve Calfee,

On Tue, 15 Sep 2015 10:11:56 -0700, Steve Calfee wrote:

> I am trying to port a python application to buildroot/busybox. It
> needs to read disk files from removable drives. The filenames may
> contain utf-8 chars.

Are you actually sure they are UTF-8 encoded? I don't think characters
in FAT16/32 filesystems are typically encoded as UTF-8, but rather some
weird Windows-specific code page encoding.

According to
https://msdn.microsoft.com/en-us/library/windows/desktop/dd317748(v=vs.85).aspx:

"""
NTFS stores file names in Unicode. In contrast, the older FAT12, FAT16,
and FAT32 file systems use the OEM character set. For more information,
see Code Pages.
"""

> Currently ls from busybox prints ? for the utf-8 non-ascii chars. Both
> from console on minicom and from ssh (which should handle utf-8).

Can you instead try if a UTF-8 encoded text file prints correctly? If
it does, then the problem is really more of a filesystem character
encoding issue than a problem in the UTF-8 support.

Best regards,

Thomas
-- 
Thomas Petazzoni, CTO, Free Electrons
Embedded Linux, Kernel and Android engineering
http://free-electrons.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Buildroot] User question UTF-8
  2015-09-15 17:11 [Buildroot] User question UTF-8 Steve Calfee
  2015-09-15 21:21 ` Thomas Petazzoni
@ 2015-09-15 21:21 ` Arnout Vandecappelle
  2015-09-15 21:49   ` Steve Calfee
  1 sibling, 1 reply; 5+ messages in thread
From: Arnout Vandecappelle @ 2015-09-15 21:21 UTC (permalink / raw
  To: buildroot

On 15-09-15 19:11, Steve Calfee wrote:
> Hi,
> 
> I am trying to port a python application to buildroot/busybox. It
> needs to read disk files from removable drives. The filenames may
> contain utf-8 chars.
> 
> Currently ls from busybox prints ? for the utf-8 non-ascii chars. Both
> from console on minicom and from ssh (which should handle utf-8).

 Busybox ls will print all non-ASCII characters as ? unless UNICODE_SUPPORT is
enabled. Our default busybox config doesn't have UNICODE_SUPPORT enabled. So do
'make busybox-menuconfig' and enable UNICODE_SUPPORT. You'll also need to enable
WCHAR in the toolchain - but since you use glibc, it always has WCHAR enabled.

> 
> There seems to be lots of config knobs.
> 
> I assume utf-8 chars are somehow related to locales? I enabled locales
> in the internal glib toolchain.
> 
> BR2_arm=y
> BR2_TOOLCHAIN_BUILDROOT_GLIBC=y
> BR2_TOOLCHAIN_BUILDROOT_CXX=y
> BR2_ENABLE_LOCALE_PURGE=y
> BR2_GENERATE_LOCALE="en_US.UTF-8"
> BR2_TARGET_OPTIMIZATION="-Os -pipe"
> # BR2_TARGET_GENERIC_GETTY is not set
> # BR2_TARGET_GENERIC_REMOUNT_ROOTFS_RW is not set
> BR2_PACKAGE_LIBPTHREAD_STUBS=y
> # BR2_TARGET_ROOTFS_TAR is not set
> BR2_TARGET_SHEEVAPLUG=y
> 
> 
> Busybox also has locale settings:
> grep LOCAL output/build/busybox-1.23.2/.config
> CONFIG_LOCALE_SUPPORT=y
> # CONFIG_UNICODE_USING_LOCALE is not set
> # CONFIG_FEATURE_UNIX_LOCAL is not set
> # CONFIG_HUSH_LOCAL is not set
> 
>>From googling, Linux always supports anything for filenames, since it
> just uses bytes not unicode for filenames.
> 
> But I seem to be missing something. My generated system does not seem
> to properly handle utf-8. I am guessing until that works the python os
> module is also not going to handle utf-8. And indeed it does not work
> now.

 Busybox and python are completely unrelated. In python 2, you'll have to
explicitly encode/decode the filenames with the appropriate character set. The
default character set is ascii, not utf-8. In python 3, there is an environment
variable that you can set to default to utf-8, though.

 Regards,
 Arnout

> 
> Regards, Steve
> _______________________________________________
> buildroot mailing list
> buildroot at busybox.net
> http://lists.busybox.net/mailman/listinfo/buildroot
> 


-- 
Arnout Vandecappelle                          arnout at mind be
Senior Embedded Software Architect            +32-16-286500
Essensium/Mind                                http://www.mind.be
G.Geenslaan 9, 3001 Leuven, Belgium           BE 872 984 063 RPR Leuven
LinkedIn profile: http://www.linkedin.com/in/arnoutvandecappelle
GPG fingerprint:  7493 020B C7E3 8618 8DEC 222C 82EB F404 F9AC 0DDF

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Buildroot] User question UTF-8
  2015-09-15 21:21 ` Thomas Petazzoni
@ 2015-09-15 21:39   ` Steve Calfee
  0 siblings, 0 replies; 5+ messages in thread
From: Steve Calfee @ 2015-09-15 21:39 UTC (permalink / raw
  To: buildroot

Hi Thomas,

On Tue, Sep 15, 2015 at 2:21 PM, Thomas Petazzoni
<thomas.petazzoni@free-electrons.com> wrote:
> Dear Steve Calfee,
>
> On Tue, 15 Sep 2015 10:11:56 -0700, Steve Calfee wrote:
>
>> I am trying to port a python application to buildroot/busybox. It
>> needs to read disk files from removable drives. The filenames may
>> contain utf-8 chars.
>
> Are you actually sure they are UTF-8 encoded? I don't think characters
> in FAT16/32 filesystems are typically encoded as UTF-8, but rather some
> weird Windows-specific code page encoding.
>

Yes, I am sure. I wrote the vfat fs from ubuntu 14.04. I have also
successfully read the fs from recent raspian on a rpy.

>
>> Currently ls from busybox prints ? for the utf-8 non-ascii chars. Both
>> from console on minicom and from ssh (which should handle utf-8).
>
> Can you instead try if a UTF-8 encoded text file prints correctly? If
> it does, then the problem is really more of a filesystem character
> encoding issue than a problem in the UTF-8 support.
>
I did what you suggested, I wrote a text file with a few non-ascii
chars to the flash drive.

When I moved it back to the dockstar with my built buildroot/busybox.
I cat ed the file to minicom and
one of the names is Irish: Siobh?n McCarthy. And it printed a diamond
with a ? in it. Same when printed from the ssh terminal.

I also tried "export LANG=n_US.UTF-8". still same I am not sure what
the environment vars LC_ALL and LANG do and I did not enable them in
busybox anyway.

Thanks for the suggestions. I thought that all you Europeans here
would have solved this. Is there a defconfig the supports utf-8?

Steve

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Buildroot] User question UTF-8
  2015-09-15 21:21 ` Arnout Vandecappelle
@ 2015-09-15 21:49   ` Steve Calfee
  0 siblings, 0 replies; 5+ messages in thread
From: Steve Calfee @ 2015-09-15 21:49 UTC (permalink / raw
  To: buildroot

Hi Arnout,

On Tue, Sep 15, 2015 at 2:21 PM, Arnout Vandecappelle <arnout@mind.be> wrote:
>
>  Busybox and python are completely unrelated. In python 2, you'll have to
> explicitly encode/decode the filenames with the appropriate character set. The
> default character set is ascii, not utf-8. In python 3, there is an environment
> variable that you can set to default to utf-8, though.
>
>  Regards,
>  Arnout
>
Hi Arnout,

Yes I know. I started by discovering the utf-8 problem in python using
os.walk. I then tried to simplify the problem by getting the filenames
to work with just the os (linux 4.1.4 and busybox).

Python 2.x just barely supports utf-8 and seems to require that the os
agree with the support, at least for os.walk. It took lots of fiddling
to get it to work on Raspian too.

Thanks for helping,

Steve

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-09-15 21:49 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-15 17:11 [Buildroot] User question UTF-8 Steve Calfee
2015-09-15 21:21 ` Thomas Petazzoni
2015-09-15 21:39   ` Steve Calfee
2015-09-15 21:21 ` Arnout Vandecappelle
2015-09-15 21:49   ` Steve Calfee

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).