Why and How Does Python Use Bloom Filters in String Processing? (codeconfessions.substack.com)
from learnbyexample@programming.dev to python@programming.dev on 15 Sep 2023 13:10
https://programming.dev/post/3029652

#python

threaded - newest

qwop@programming.dev on 15 Sep 2023 22:33 collapse

The article says that CPython represents strings as UTF-8 encoded, which is not correct. The details about how it works are correct, just that’s not UTF-8.

That’s just a minor point though, nice article.

abhi9u@lemmy.world on 17 Sep 2023 06:09 collapse

Hi @qwop, I am the author. Thank you for reading and the kind words. I would like to understand the error I made better so that I don’t repeat in future, and if I can fix it. Could you please clarify?

qwop@programming.dev on 19 Sep 2023 11:53 collapse

UTF-8 is an encoding for unicode, that means it’s a way of representing a unicode string as actual bytes on a computer.

It is variable length and works by using the first bits of each byte to indicate how many bytes are are needed to represent the current character.

Python also uses an encoding, as you describe in the article, but it’s different to UTF-8. Unlike unicode, all characters in Python’s representation of the unicode string use the same number of bytes, which is the maximum that any individual unicode character in the string needs.

I’d probably mess up a more detailed explanation of UTF-8 or Python’s representation, so I’ll let you look into how they work in more detail if you’re interested.

abhi9u@lemmy.world on 19 Sep 2023 14:59 collapse

Thank you! That’s helpful. I spent quite some time trying to understand the difference between UTF-8 and Python’s representation and arrived at the same understanding as you wrote. However, most of the external documents simply say that strings in Python are UTF-8 which made me conclude that perhaps I am missing something and it might be safer to write it as utf-8.

I will look more in the code as you suggested.