deddit.petersanchez.com

Scrapping websites and finding unique values
from dudesss@lemmy.ca to programming@programming.dev on 07 May 2026 11:39
https://lemmy.ca/post/64491926

Anyone have recommendation for how I can scrap a website, and extract unique names – such as product names.

I was thinking of using some website scrapping tool, then a local LLM to find unique product names.

#programming

threaded - newest

exu@feditown.com on 07 May 2026 11:54 next collapse

Usually you’d have to be the owner or operator of a website to scrap it. I guess hacking into the server and deleting all data would also work.

dudesss@lemmy.ca on 07 May 2026 11:57 collapse

I was thinking of doing it once a day. Even if I have to manually initiate it to be legal. It would only be for personal non-public nor commercial reasons.

It would save me time from manually copying the HTML over to an LLM or something.

hendrik@palaver.p3x.de on 07 May 2026 12:37 next collapse

Just read the robots.txt and obey the rules. Also set your user agent string properly. We’ve had crawlers forever on the internet and that’s the long accepted way to give consent or revoke consent, for website owners. Either you match a disallow directive and need to stop. Or you’re completely fine to scrape it.

dudesss@lemmy.ca on 07 May 2026 12:56 collapse

Neat, I’ve never heard of these.

exu@feditown.com on 07 May 2026 13:42 collapse

I was joking about your use of scrap and scrapping, as in to remove or to cancel :)

Web scraping only has one p

Nomad@infosec.pub on 07 May 2026 11:57 next collapse

You are probably talking about scraping a website. There are usually tools for this already that make that easy. Last time I had to do something like that I used scrapy.

Lysergid@lemmy.ml on 07 May 2026 12:14 collapse

Scrappy created exactly for this use case. I used to work in project for product info scraping when LLMs didn’t exist. So you don’t really have to use LLM. It’s usually semi-structured data. Your biggest pain will likely be SPAs with JS which need to run in order to load content. If you need to render SPAs check Selenium web driver or similar

sbeak@sopuli.xyz on 08 May 2026 00:10 collapse

The term is to “scrape”, “scraping a website”. See that singular p?

Another good example to illustrate is the word “illustrate”. You would say “illustrating”, not “illustratting”, as the magic e is replaced with “ing”. A closer example to “scrape” would be the word “gape”, you would say “a gaping hole”, not a “gapping”

I hope this helps! English is a weird language. To “scrape” means to collect, while to “scrap” means to discard.