deddit.petersanchez.com

Having trouble with web scraping. Help please.
from 3rr4tt1c@programming.dev to python@programming.dev on 01 May 2025 02:31
https://programming.dev/post/29528816

I was applying to a job, and then I had to answer a question about web scraping, which I’m not familiar with. I answered all the other questions with no issue, so I decided might as well put in the effort to learn the basics and see if I can do it in a day.
Yes, it was *somewhat * easier than I expected, but I still had to watch like 4 YouTube videos and read a bunch of reddit and stack overflow posts.
I got the code working, but I decided to run it again to double-check. It stopped working. Not sure why.
Testing is also annoying because the “web page” is a google doc and constantly reloads or something. It takes forever to get proper results from my print statements.
I attached an image with the question. I haven’t heard back from them, and I’ve seen other people post what I think might be this exact question online, so hopefully I’m not doing anything illegal.
At this point, I just want to solve it. Here’s the code:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

def createDataframe(url): #Make the data easier to handle
    #Get the page's html data using BeautifulSoup
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')

    #Extract the table's headers and column structure
    table_headers = soup.find('tr', class_='c8')
    table_headers_titles = table_headers.find_all('td')
    headers = [header.text for header in table_headers_titles]

    #Extract the table's row data
    rows = soup.find_all('tr', class_='c4')
    row_data_outer = [row.find_all('td') for row in rows]
    row_data = [[cell.text.strip() for cell in row] for row in row_data_outer]

    #Create a dataframe using the extracted data
    df = pd.DataFrame(row_data, columns=headers)
    return df

def printMessage(dataframe): #Print the message gotten from the organised data
    #Drop rows that have missing coordinates
    dataframe = dataframe.dropna(subset=['x-coordinate', 'y-coordinate'], inplace=True)

    #Convert the coordinate columns to integers so they can be used
    dataframe['x-coordinate'] = dataframe['x-coordinate'].astype(int)
    dataframe['y-coordinate'] = dataframe['y-coordinate'].astype(int)

    #Determine how large the grid to be printed is
    max_x = int(dataframe['x-coordinate'].max())
    max_y = int(dataframe['y-coordinate'].max())

    #Create an empty grid
    grid = np.full((max_y + 1, max_x + 1), " ")

    #Fill the grid with the characters using coordinates as the indices
    for _, row in dataframe.iterrows():
        x = row['x-coordinate']
        y = row['y-coordinate']
        char = row['Character']
        grid[y][x] = char
    for row in grid:
        print("".join(row))

test = 'https://docs.google.com/document/d/e/2PACX-1vQGUck9HIFCyezsrBSnmENk5ieJuYwpt7YHYEzeNJkIb9OSDdx-ov2nRNReKQyey-cwJOoEKUhLmN9z/pub'
printMessage(createDataframe(test))

My most recent error:

C:\Users\User\PycharmProjects\dataAnnotationCodingQuestion\.venv\Scripts\python.exe C:\Users\User\PycharmProjects\dataAnnotationCodingQuestion\.venv\app.py 
Traceback (most recent call last):
  File "C:\Users\User\PycharmProjects\dataAnnotationCodingQuestion\.venv\app.py", line 50, in <module>
    printMessage(createDataframe(test))
  File "C:\Users\User\PycharmProjects\dataAnnotationCodingQuestion\.venv\app.py", line 30, in printMessage
    dataframe['x-coordinate'] = dataframe['x-coordinate'].astype(int)
                                ~~~~~~~~~^^^^^^^^^^^^^^^^
TypeError: 'NoneType' object is not subscriptable

Process finished with exit code 1

#python

threaded - newest

eager_eagle@lemmy.world on 01 May 2025 02:47 next collapse

Your dataframe is None because you’re dropping the rows in place. Either set inplace=False in that dropna call, or remove the reassignment to dataframe.

3rr4tt1c@programming.dev on 01 May 2025 04:02 next collapse

3rr4tt1c@programming.dev on 01 May 2025 04:06 collapse

Thanks, I set inplace=False like you suggested. I thought setting it to true meant that it modified the original dataframe. Why does it work with false?

alphapuggle@programming.dev on 01 May 2025 05:55 collapse

Inplace=true will modify the original dataframe, while false will return the result in a new dataframe. When you assign a variable the result of in place=true (dataframe = dataframe…(I place=true)), you’ve overwritten the data frame with the result of that call (which is nothing, it happened in place)

fubarx@lemmy.world on 01 May 2025 03:47 collapse

There are obviously fields with missing values.

You could loop and test each value against None. Or if you want to just skip when something’s missing, surround every assignment with a try and in the except clause, continue so the loop keeps going.

If this is for a job interview, I’d err on the side of verbosity. Break it all into distinct, easy to read steps: load, process, output, logging, exception handling, comments, etc.

Otherwise you’ll get dinged on style and maintainability.

3rr4tt1c@programming.dev on 01 May 2025 04:10 collapse

If this is for a job interview, I’d err on the side of verbosity. Break it all into distinct, easy to read steps: load, process, output, logging, exception handling, comments, etc.

I turned it in days ago so there’s nothing I can do about it. But I’ll keep that in mind for the future.