Python vs. Go: The Hunt for the Best GPU Price

Published on
17 mins read
--- views

Finding Digital Gold: Building Reliable Systems to Monitor GPU Prices

In a world where thousands of shoppers compete for a handful of reasonably priced graphics cards, you need more than just persistence — you need a reliable, scalable architecture. When scanning hundreds of pages across dozens of retailers, your solution's resilience isn't just a nice-to-have, it's the foundation of a system that works consistently day after day. Like finding a needle in the digital haystack, this requires thoughtful design patterns and robust concurrency models. This article explores how to architect systems that can reliably monitor vast product catalogs for those rare affordable deals.

The "Why": Our Quest for the Best Graphics Card Price

Imagine you want to buy a new graphics card. The price changes daily across different pages of your favorite price comparison site. Checking all of them manually is tedious. This is a perfect job for a computer! Web scraping is the art of writing a program to automatically visit websites and extract specific information.

Our mission, should we choose to accept it, is to scrape the graphics card listings from idealo.fr. The listings are spread across multiple pages, and we want to do this as fast as possible.

If you visit the pages one by one, your program will spend most of its time waiting for the website to respond. We need to work in parallel, or concurrently. A fantastic pattern for this is the producer-consumer model.

  • The Producer: Its job is to generate the list of tasks. In our case, it will produce the URLs for each page of the graphics card listings.
  • The Consumer: Its job is to take a task (a page URL), perform the work (visit the page and extract the name and price of each card listed), and then grab the next task.

We'll have multiple consumers working at the same time. Let's see how to build this with Python and Go.

The Python Way: Simulating Concurrency with Coroutines

Python's journey into concurrency is fascinating. Before the modern async/await syntax, the foundation was laid by generators and the yield keyword. Understanding this foundation helps appreciate the power of the tools we have today. We can use yield to create coroutines that can be paused and resumed, allowing us to manually switch between tasks to simulate concurrency.

Let's build a system where a producer first creates a list of all URLs, and then a pool of consumers works through that list.

Coroutines for Cooperative Multitasking

Before diving into code, let's understand what "cooperative multitasking" means. Unlike preemptive multitasking where the operating system forcefully switches between tasks, cooperative multitasking relies on each task voluntarily yielding control back to a scheduler. Think of it like a polite conversation where each person speaks, then pauses to let others talk.

In our case, we'll define a "consumer" as a coroutine that, when called, does its work and then yields, signaling to our scheduler that it's done with its current task and is ready for another. The key word here is "cooperative" - our consumers will politely hand control back to the scheduler rather than hogging all the processing time.

# You would need to install these libraries:
# pip install requests beautifulsoup4
import requests
from bs4 import BeautifulSoup
from collections import deque
import time

# The Producer is now a simple function that prepares all the work upfront.
def producer(num_pages):
    base_url = "https://www.idealo.fr/cat/16073/cartes-graphiques.html?q=graphics%20card"
    print("Producer is creating the list of all URLs...")
    urls = [f"{base_url}&page={i}" for i in range(1, num_pages + 1)]
    print(f"Producer finished. {len(urls)} URLs created.")
    return urls

# The Consumer is a coroutine that waits to be sent a URL.
def consumer(consumer_id, all_results):
    """
    This coroutine will:
    1. Start and print its ID.
    2. Enter an infinite loop.
    3. In the loop, it will `yield`, pausing itself and waiting for a URL.
    4. When a URL is sent to it via .send(), it resumes and scrapes the data.
    5. The scraped data is added to a shared list `all_results`.
    """
    print(f"Consumer {consumer_id}: Ready and waiting for a URL.")
    while True:
        try:
            # This is the pause point. It waits for the scheduler to .send() a URL.
            url = (yield)
            if url is None:
                # Sentinel value to stop the consumer.
                break

            print(f"Consumer {consumer_id}: Got task! Scraping {url}")
            response = requests.get(url) # This is a blocking call.

            if response.status_code == 200:
                soup = BeautifulSoup(response.text, 'html.parser')
                items = soup.find_all('div', class_='offerList-item')
                print(f"--- Consumer {consumer_id} found {len(items)} items ---")
                for item in items:
                    title_el = item.find('div', class_='offerList-item-title')
                    price_el = item.find('div', class_='offerList-item-price-value')
                    if title_el and price_el:
                        title = title_el.get_text(strip=True)
                        price = price_el.get_text(strip=True)
                        all_results.append({'id': consumer_id, 'title': title, 'price': price})
            else:
                print(f"Consumer {consumer_id}: Got status {response.status_code}")

        except GeneratorExit:
            # This is called when we .close() the generator.
            print(f"Consumer {consumer_id}: Shutting down.")
            break

# The "Scheduler": A manual loop to manage consumers and tasks.
def main_cooperative():
    # 1. Producer creates all the tasks.
    task_queue = deque(producer(num_pages=5))

    # 2. Create a shared list for all results.
    final_results = []

    # 3. Create a pool of consumers.
    num_consumers = 3
    consumer_pool = [consumer(i, final_results) for i in range(num_consumers)]

    # 4. Prime the coroutines by calling next() on them once.
    # This runs them to their first `yield` statement, making them ready to receive data.
    for c in consumer_pool:
        next(c)

    print("\nScheduler starting... Distributing tasks to consumers.")
    # 5. Distribute work until the queue is empty.
    while task_queue:
        for c in consumer_pool:
            if not task_queue:
                break

            url = task_queue.popleft()
            # Send the URL to the waiting consumer. This makes it resume.
            c.send(url)

    # 6. All tasks are done. Send a signal to stop the consumers.
    for c in consumer_pool:
        try:
            c.send(None) # or c.close()
        except StopIteration:
            pass # Generator is already closed.

    print(f"\nAll scraping finished. Total items scraped: {len(final_results)}")
    # print(final_results)


if __name__ == "__main__":
    main_cooperative()

Breaking Down the Worker Pool Mechanism

This pattern is powerful because it perfectly demonstrates how to use a small, fixed number of workers to process a potentially massive number of tasks. Let's break it down with an analogy: a restaurant kitchen.

Restaurant kitchen analogy for producer-consumer pattern showing chefs (consumers) working from a ticket rail (task queue) coordinated by a head chef (scheduler)
  1. The Order Board (task_queue): The producer is like the chef on the left, preparing dishes and posting them on the central order board. This board can hold hundreds of orders (our deque of URLs). The number of tasks can be virtually infinite.

  2. The Waiter (consumer_pool): We have a waiter on the right (num_consumers = 3 in our case, though the image shows one for simplicity). The waiter doesn't prepare food—they take completed dishes from the board and serve them to customers. These waiters are our limited, precious resources.

  3. Getting Ready to Serve (next(c)): This is like the waiter saying "I'm ready!" Each waiter (consumer) runs their code to the first yield and then pauses, effectively saying, "I'm ready for my next order."

  4. The Coordination Process (The Scheduler Loop): The while task_queue: loop is like the coordination between the kitchen and service. As long as there are orders on the board, the system cycles through the available waiters (for c in consumer_pool:), grabs an order (task_queue.popleft()), and hands it to the next free waiter (c.send(url)).

  5. "Order Up!" (yield): When a waiter finishes serving an order (scrapes a URL), they effectively say "Order up!" and return to check the board for the next order. In our code, this is when the consumer's internal logic loops back to the url = (yield) statement. It automatically pauses, waiting for the scheduler to .send() the next task.

This model is incredibly scalable. If the task_queue has 10 URLs or 10 million URLs, we still only need our 3 consumer instances in memory. The program will just run for longer, with the scheduler efficiently dispatching tasks to the small pool of workers until the job is done.

How This Simpler Model Works

This version is much more direct:

  1. One-Way Communication: The scheduler .send()s URLs to the consumers. The consumers no longer yield data back. They simply add their findings to a shared final_results list.
  2. Clearer yield: The line url = (yield) now has a single purpose: pause the function until a URL is sent to it. This is the fundamental feature of a coroutine.
  3. Cooperative, Not Concurrent: It's crucial to understand that this is cooperative multitasking, not true concurrency. The requests.get(url) call is blocking. When one consumer is making a network request, the entire program waits. We are just manually switching between tasks in an orderly fashion.

This "native" approach is a fantastic way to understand the mechanics of yield and coroutines, which are the building blocks for Python's modern asyncio library.

The Modern Solution: asyncio

To solve the "blocking" problem and achieve true I/O concurrency, Python provides the asyncio library. It gives us a professional, highly-performant event loop (our scheduler) and requires I/O libraries (like aiohttp) that are specially designed to be non-blocking.

Here is a conceptual look at the asyncio version. Notice how async def, await, and asyncio.Queue replace our manual logic.

# You would need to install these libraries:
# pip install aiohttp beautifulsoup4
import asyncio
import aiohttp
from bs4 import BeautifulSoup
import time

# Producer async function to create tasks
async def producer(queue, num_pages):
    base_url = "https://www.idealo.fr/cat/16073/cartes-graphiques.html?q=graphics%20card"
    print("Producer is creating tasks and adding them to the queue...")

    for i in range(1, num_pages + 1):
        url = f"{base_url}&page={i}"
        print(f"Producer: Adding {url} to queue")
        await queue.put(url)

    # Add sentinel values to signal consumers to stop
    for _ in range(3):  # One for each consumer
        await queue.put(None)
    print("Producer finished adding all URLs to queue.")

# Consumer async function that processes URLs from the queue
async def consumer(consumer_id, queue, results):
    print(f"Consumer {consumer_id}: Ready and waiting for URLs.")
    while True:
        # Wait for a URL from the queue - THIS IS NON-BLOCKING
        url = await queue.get()

        if url is None:  # Sentinel value to stop
            print(f"Consumer {consumer_id}: Received stop signal, shutting down.")
            queue.task_done()
            break

        print(f"Consumer {consumer_id}: Processing {url}")
        try:
            # The key difference: aiohttp's request is non-blocking
            # While this consumer awaits the response, the event loop
            # can run other consumers
            async with aiohttp.ClientSession() as session:
                async with session.get(url) as response:
                    if response.status == 200:
                        # Parse HTML with BeautifulSoup
                        html = await response.text()
                        soup = BeautifulSoup(html, 'html.parser')
                        items = soup.find_all('div', class_='offerList-item')

                        print(f"--- Consumer {consumer_id} found {len(items)} items ---")
                        for item in items:
                            title_el = item.find('div', class_='offerList-item-title')
                            price_el = item.find('div', class_='offerList-item-price-value')

                            if title_el and price_el:
                                title = title_el.get_text(strip=True)
                                price = price_el.get_text(strip=True)
                                results.append({
                                    'id': consumer_id,
                                    'title': title,
                                    'price': price
                                })
                    else:
                        print(f"Consumer {consumer_id}: Got status {response.status}")

        except Exception as e:
            print(f"Consumer {consumer_id}: Error processing {url}: {e}")
        finally:
            # Mark the task as done
            queue.task_done()

# Main function using asyncio
async def main_async():
    # Create a queue to communicate between producer and consumers
    task_queue = asyncio.Queue()
    results = []

    # Start the producer
    producer_task = asyncio.create_task(producer(task_queue, num_pages=5))

    # Start the consumers
    num_consumers = 3
    consumer_tasks = [
        asyncio.create_task(consumer(i, task_queue, results))
        for i in range(num_consumers)
    ]

    # Wait for the producer to finish
    await producer_task

    # Wait for all consumers to process all items
    await task_queue.join()

    # Wait for all consumers to finish
    await asyncio.gather(*consumer_tasks)

    print(f"\nAll scraping finished. Total items scraped: {len(results)}")
    # We could print or save the results here
    # print(results)

if __name__ == "__main__":
    start_time = time.time()
    asyncio.run(main_async())
    print(f"Total execution time: {time.time() - start_time:.2f} seconds")

Breaking Down the Asyncio Implementation

The asyncio version is much more elegant than our manual scheduler. Let's break down how it works:

  1. Modern Event Loop: Instead of our manual "head chef" scheduler, asyncio provides a professional event loop. This loop efficiently manages which coroutine runs when, automatically switching between them at await points.

  2. The Queue (asyncio.Queue): Unlike our manual deque, the asyncio.Queue is specifically designed for concurrent access. The await queue.get() and await queue.put() operations are cooperative - they allow other code to run while waiting.

  3. True Non-Blocking I/O: When a consumer reaches await session.get(url), it doesn't block the entire program like requests.get(url) did. Instead, it tells the event loop: "I'm waiting for network I/O, please run something else until my data arrives."

  4. Cooperative Tasks: Each asyncio.create_task() creates a task that the event loop manages. These tasks cooperatively yield control at await points, allowing other tasks to run.

  5. Elegant Flow Control: The queue.task_done() and await queue.join() pattern provides a clean way to know when all work is complete. No need to manually track which consumers are working.

Here's a visual representation of the flow:

asyncio.run(main_async())
    ├─► task_queue = asyncio.Queue()
    ├─► producer_task = asyncio.create_task(producer(...))
    │    └─► for i in range(...):
    │        └─► await queue.put(url)  ─┐
    │                                   │
    ├─► consumer_tasks = [    │    asyncio.create_task(consumer(0, ...)),
    │    asyncio.create_task(consumer(1, ...)),
    │    asyncio.create_task(consumer(2, ...))
]    │    └─► while True:    │        └─► url = await queue.get() ◄┘
    │        └─► async with session.get(url): # Non-blocking!
    │        └─► queue.task_done()
    ├─► await producer_task  # Wait for producer to finish
    ├─► await queue.join()   # Wait for queue to be empty
    └─► await asyncio.gather(*consumer_tasks)  # Clean up

The real magic happens during network I/O. Let's say we have three consumers (A, B, C):

  1. Consumer A: Starts downloading a page (slow operation)
  2. Event loop: "A is waiting for I/O, let's run B"
  3. Consumer B: Also starts downloading (another slow operation)
  4. Event loop: "B is also waiting, let's run C"
  5. Consumer C: Starts its download
  6. Event loop: "Everyone's waiting for I/O. I'll monitor all three connections."
  7. When any download completes, the event loop wakes up the corresponding consumer.

This creates a dramatic speedup compared to our blocking coroutine version. If each page takes 1 second to download, the blocking version would take 5 seconds for 5 pages. The asyncio version might finish in just over 1 second (if using 3 consumers), as downloads happen in parallel.

asyncio is the standard for high-performance I/O in Python. It's more verbose than the Go equivalent but essential for modern network applications.

The Go Way: Concurrency as a Native Language

Go was built for concurrency. It uses goroutines (lightweight threads) and channels (pipes for communication) to make concurrent programming feel natural.

Our Go Scraper with go-rod

Many modern sites use JavaScript to load content. A simple HTTP request won't work. We need a real browser. The go-rod package lets us control a browser programmatically, making it perfect for this task.

// You would need to install go-rod:
// go get github.com/go-rod/rod
package main

import (
	"fmt"
	"log"
	"strings"
	"sync"
	"time"

	"github.com/go-rod/rod"
	"github.com/go-rod/rod/lib/launcher"
)

// Producer: sends page URLs into a channel.
func producer(base_url string, num_pages int, urlChannel chan<- string) {
	fmt.Println("Producer starting...")
	for i := 1; i <= num_pages; i++ {
		page_url := fmt.Sprintf("%s&page=%d", base_url, i)
		fmt.Println("Produced:", page_url)
		urlChannel <- page_url
	}
	close(urlChannel) // Signal that no more URLs are coming.
	fmt.Println("Producer finished.")
}

// Consumer: receives a URL, launches a browser, and scrapes the page.
func consumer(id int, urlChannel <-chan string, wg *sync.WaitGroup) {
	defer wg.Done()
	fmt.Printf("Consumer %d starting...\n", id)

	// Launch a browser for each consumer. For high-volume scraping,
	// you might share a browser instance.
	// We use a stealth launcher to better mimic a real user.
	browser := rod.New().ControlURL(launcher.New().MustLaunch()).MustConnect()
	defer browser.MustClose()

	for url := range urlChannel {
		fmt.Printf("Consumer %d consuming: %s\n", id, url)

		page := browser.MustPage(url).MustWaitLoad()

		// Wait for the main offer list to be visible
		page.MustWaitElement(".offerList")

		// Find all product containers
		// Note: These selectors might change if the website updates.
		items, err := page.Elements(".offerList-item")
		if err != nil {
			log.Printf("Consumer %d: Could not find items on %s: %v", id, url, err)
			continue
		}

		fmt.Printf("--- Consumer %d found %d items on %s ---\n", id, len(items), url)
		for _, item := range items {
			// In go-rod, it's often more reliable to check for an element's existence before using it.
			titleEl, err := item.Element(".offerList-item-title")
			if err != nil { continue }
			priceEl, err := item.Element(".offerList-item-price-value")
			if err != nil { continue }

			title := strings.TrimSpace(titleEl.MustText())
			price := strings.TrimSpace(priceEl.MustText())
			fmt.Printf("  - %s: %s\n", title, price)
		}
		// Add a small delay to be respectful to the server
		time.Sleep(time.Second)
	}
	fmt.Printf("Consumer %d shutting down.\n", id)
}

func main() {
	baseUrl := "https://www.idealo.fr/cat/16073/cartes-graphiques.html?q=graphics%20card"
	numPagesToScrape := 3

	urlChannel := make(chan string, numPagesToScrape)
	var wg sync.WaitGroup

	// Start 3 consumer goroutines.
	for i := 1; i <= 3; i++ {
		wg.Add(1)
		go consumer(i, urlChannel, &wg)
	}

	go producer(baseUrl, numPagesToScrape, urlChannel)

	wg.Wait()
	fmt.Println("All work finished.")
}

The Go version is remarkably clean. The go keyword, channels, and sync.WaitGroup handle all the complex synchronization. go-rod provides the power to deal with modern, JavaScript-heavy websites, and it fits perfectly into Go's concurrent model.

Here's a visual representation of the flow in Go:

main()
    ├─► urlChannel := make(chan string, numPagesToScrape)
    ├─► wg := sync.WaitGroup{}
    ├─► for i := 1; i <= 3; i++ {
    │    └─► wg.Add(1)
    │    └─► go consumer(i, urlChannel, &wg) ──┐
    │        └─► browser := rod.New()...    │        └─► for url := range urlChannel   │
    │            └─► page := browser.MustPage(url)
    │            └─► // scrape data            │
    │        └─► wg.Done()    │                                          │
    ├─► go producer(baseUrl, numPagesToScrape, urlChannel)
    │    └─► for i := 1; i <= num_pages; i++ {
    │        └─► urlChannel <- page_url ───────┘
    │    └─► close(urlChannel)
    └─► wg.Wait() // Block until all consumers finish

The key differences from the asyncio model:

  1. Real OS Threads: Goroutines are lightweight but still backed by actual OS threads, unlike asyncio's tasks which all run in a single thread.

  2. Channel-Based Communication: Instead of a queue with .put() and .get() operations, Go uses channels with the <- operator, which are built into the language.

  3. Automatic Scheduling: Go's runtime handles the scheduling of goroutines across available CPU cores automatically, without explicit await points.

  4. WaitGroup for Synchronization: Instead of await queue.join(), Go uses sync.WaitGroup to track when all consumers are done.

  5. Channel Closing as Signal: Instead of sending sentinel values (None), Go idiomatically signals completion by closing the channel, which causes the range loop to exit.

Final Comparison

FeaturePython (Native yield)Python (asyncio)Go
ConcurrencyCooperative Multitasking (Fake)True I/O Concurrency (Single-threaded)True Parallelism (Multi-threaded)
ComplexityHigh (Manual scheduler)Medium (Requires async ecosystem)Low (Built into the language)
VerbosityHighMediumLow
Real-world ScrapingPossible, but inefficient for I/O.Excellent, with a rich ecosystem.Excellent, especially with tools like go-rod.

Conclusion: The Right Tool for the Job

Starting with Python's native coroutines shows us why tools like asyncio were created. They automate the complex scheduling logic and enable true I/O concurrency, making them essential for modern network applications in Python.

However, when we compare this to Go, we see a language where concurrency was not an afterthought, but a foundational principle. The simplicity and power of goroutines and channels make writing robust, high-performance concurrent programs feel natural. For a task like our GPU price hunt, Go's brevity and raw power are difficult to ignore. It doesn't just offer a way to do concurrency; it offers a better way to think about it.