ISO-8859-x encodings and invalid bytes
from Fred@programming.dev to python@programming.dev on 25 Jun 2024 10:42
https://programming.dev/post/16012518
from Fred@programming.dev to python@programming.dev on 25 Jun 2024 10:42
https://programming.dev/post/16012518
Hi,
I have to interface with systems that use iso-8859-x encoding (not by choice…), and I’m surprised that the following doesn’t throw an error:
>>> str(bytes(range(256)), encoding="iso-8859-1", errors="strict") '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0¡¢£¤¥¦§¨©ª«¬\xad®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ'
Bytes in the 0x80—0x9f range are not valid iso-8859-1, and I was expecting the above to raise a DecodeError of some sort; instead it looks like those are passed through.
I’m perfectly happy with this behaviour, I would like to make sure I can depend on it. Can I take an arbitrary byte buffer, decode as ISO-8859-1, and never get any error? Is it guaranteed to be lossless ?
#python
threaded - newest
I was curious, so I did some searches on this topic for you and found these pages:
The second link in particular notes:
Whether depending on this is actually correct or not is beyond me, but it seems like people have actually been using that pass-through behavior in practice and put it into things like Python2 -> 3 migration guides.
The first link suggests that the seemingly undefined ranges are valid as C0 and C1 control codes which may be why it doesn’t throw errors.
Maybe I’m tripping here but this kinda also explains why the human genome contains lots of noncoding DNA.
I think it has more to do with the fact Mother Nature is really inefficient and allocated much more DNA storage than necessary.
Thank you for the pointers @e0qdk@reddthat.com
My use case certainly fall into that described by ESR, I only really need to understand markup that falls in the ASCII range and pass the rest unmodified