deddit.petersanchez.com

Parsing a CSV file really really fast in Python (datapythonista.me)
from rimu@piefed.social to python@programming.dev on 18 Mar 2024 23:47
https://piefed.social/post/64395

If you care about performance, you may want to avoid CSV files. But since our data sources are often like our family, we can’t make a choice, we’ll see in this blog post how to process a CSV file as fast as possible.

#python

threaded - newest

stevedidwhat_infosec@infosec.pub on 19 Mar 2024 00:21 next collapse

Okay so would it be faster to convert it to something better and then do something faster with this better format?

Edit: I guess looking at the numbers, they’re already pretty low there. Idk how much faster it’d really be and whether not it’d be worth doing

What’s even the “gold standard” for logging stuff I guess?

sugar_in_your_tea@sh.itjust.works on 19 Mar 2024 03:52 next collapse

That really depends on how much of it you’re doing. If you’re just handing a few times at a time, the difference between 0.1s and 3s isn’t that big of a deal. If you’re handling thousands or even millions in a day, it can be an order of magnitude cost savings to make it more efficient.

We use a CSVs at work, but it’s not a common thing so we just use the built-in csv library. If we did more with it, pandas would be the way to go (or maybe we’d rewrite that service in Rust).

NostraDavid@programming.dev on 19 Mar 2024 11:05 next collapse

What’s even the “gold standard” for logging stuff I guess?

structlog. Or just Structured Logging in general.

Don’t do:

logging.info(f"{something} happened!")

But do

logging.info(“thing-happened”, thing=something)

Why? Your event will become a category, which means it’s easily searchable/findable, you can output either human-readable stuff (the typical {date}, {loglevel}, {event}) or just straight up JSONL (a JSON object/dict per line). If you have JSON logs you can use jq to query/filter/manipulate your logs, if you have something like ELK, you can insert your logs there and create dashboards.

It’s amazing - though it may break your brain initially.

NostraDavid@programming.dev on 19 Mar 2024 11:07 collapse

Also, regarding better formats: parquet is relatively nice. Smaller files, though not human readable. Use parquet if you read often, or have IO issues (file “too large” as CSV).

fartsparkles@sh.itjust.works on 19 Mar 2024 00:22 next collapse

Holy shit, switching to PyArrow is going to make me seem a mystical wizard when I merge in the morning. I’ve easily halved the execution time of a horrible but unavoidable job (yay crappy vendor “API” that returns a huge CSV).

AlecSadler@sh.itjust.works on 19 Mar 2024 04:18 collapse

You and me both. I’ve been parsing around 10-100 million row CSVs lately and…this will hopefully help.

SpaceNoodle@lemmy.world on 19 Mar 2024 04:19 collapse

LOL just use fscanf() you silly goose