Web Scraping with Python — Complete Guide

Web Scraping with Python

A complete technical reference — tools, patterns, ethics & code

core libraries

scraper types

pipeline stages

legal principles

What is web scraping?

Automated extraction of data from websites

Web scraping is the process of programmatically fetching web pages and extracting structured data from HTML, JSON, or XML content. Python is the dominant language for scraping due to its rich ecosystem of HTTP clients, HTML parsers, and browser automation tools.

When to use each approach

Static HTML

requests + BeautifulSoup

Content already present in raw HTML. No JavaScript execution needed. Fast, lightweight, ideal for most news sites, Wikipedia, forums.

JS-rendered

Selenium / Playwright

Content loaded by JavaScript after page load — SPAs, infinite scroll, AJAX-driven tables. Needs a real or headless browser.

Large scale

Scrapy framework

Multi-page crawling with link following, pipelines, item validation, rate limiting, and distributed crawling support out of the box.

API-first

Direct API calls

Many sites expose undocumented internal APIs the browser calls via XHR/Fetch. Inspecting DevTools Network tab often reveals cleaner endpoints.

Core flow — click a step for detail

Request

HTTP GET/POST

›

Parse

HTML / JSON

›

Extract

CSS / XPath

›

Clean

regex / strip

›

Store

CSV / DB / API

Click a step above to learn more

Each stage of the scraping pipeline has its own tools and best practices.

Library comparison

Library	Renders JS	Speed	Best for	Install
requests	✗	Very fast	HTTP fetch, APIs	`pip install requests`
BeautifulSoup4	✗	Fast	HTML/XML parsing	`pip install beautifulsoup4`
lxml	✗	Very fast	XPath, large docs	`pip install lxml`
Scrapy	✗	Fast async	Large crawls, pipelines	`pip install scrapy`
Selenium	✓	Slow	Legacy JS sites	`pip install selenium`
Playwright	✓	Moderate	Modern SPAs, async	`pip install playwright`
httpx	✗	Very fast	Async HTTP, HTTP/2	`pip install httpx`

Selector types — click to compare

CSS selector XPath Regex JSON / jmespath

CSS selector

Familiar from web dev. Supported natively by BeautifulSoup and Playwright.

soup.select(‘table.results tr td:nth-child(2)’)
soup.select_one(‘h1.title’).get_text(strip=True)
soup.select(‘a[href^=”/product/”]’)

Anti-blocking techniques

Rotate User-Agents

Send realistic browser headers. Use the fake-useragent library to cycle headers per request.

Rate limiting

Add random delays between requests: time.sleep(random.uniform(1, 3)) — mimics human browsing pace.

Proxy rotation

Route requests through rotating residential or datacenter proxies to avoid IP bans on high-volume scraping.

Session handling

Use requests.Session() to persist cookies across requests — required for sites behind login walls.

Pattern 1 — static HTML scraper

import requests
from bs4 import BeautifulSoup
import time, random

headers = {“User-Agent”: “Mozilla/5.0 (compatible; MyBot/1.0)”}
session = requests.Session()
session.headers.update(headers)

def scrape_page(url):
    resp = session.get(url, timeout=10)
    resp.raise_for_status()                  # raises on 4xx / 5xx
    soup = BeautifulSoup(resp.text, “lxml”)
    return soup

soup = scrape_page(“https://books.toscrape.com/”)
books = soup.select(“article.product_pod”)   # CSS selector

for book in books:
    title = book.select_one(“h3 a”)[“title”]
    price = book.select_one(“p.price_color”).text.strip()
    rating = book.select_one(“p.star-rating”)[“class”][1]
    print(f“{title} | {price} | {rating}”)
    time.sleep(random.uniform(0.5, 1.5))

Pattern 2 — async multi-page (httpx + asyncio)

import asyncio, httpx
from bs4 import BeautifulSoup

async def fetch(client, url):
    r = await client.get(url, timeout=15)
    return r.text

async def main(urls):
    async with httpx.AsyncClient() as client:
        tasks = [fetch(client, u) for u in urls]
        pages = await asyncio.gather(*tasks)
    return pages

urls = [f“https://example.com/page/{i}” for i in range(1, 6)]
pages = asyncio.run(main(urls))              # fetches all 5 concurrently

Pattern 3 — JS-rendered with Playwright

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto(“https://spa-site.example.com”)
    page.wait_for_selector(“table.data-loaded”)  # wait for JS render

    rows = page.query_selector_all(“table tr”)
    for row in rows:
        cells = [td.inner_text() for td in row.query_selector_all(“td”)]
        print(cells)
    browser.close()

Pattern 4 — save to CSV

import csv

data = [{“title”: “Book A”, “price”: “£9.99”}]   # your scraped list

with open(“results.csv”, “w”, newline=“”, encoding=“utf-8”) as f:
    writer = csv.DictWriter(f, fieldnames=[“title”, “price”])
    writer.writeheader()
    writer.writerows(data)

Production scraping pipeline

1. Discover

Sitemap.xml, seed URLs, link following. Tools: Scrapy Spider, urllib.robotparser

2. Fetch

HTTP GET with retries, timeout, session cookies, proxy rotation, rate-limit delays

3. Render

For JS sites only: headless Chrome via Playwright. For static: skip this stage

4. Parse

BeautifulSoup / lxml (HTML) · json.loads (APIs) · pdfplumber (PDFs)

5. Clean

strip whitespace · normalise dates · validate types · deduplicate records

6. Store

CSV (csv) · SQLite (sqlite3) · PostgreSQL (psycopg2) · MongoDB (pymongo)

Error handling pattern

import time, requests

def fetch_with_retry(url, retries=3, backoff=2):
    for attempt in range(retries):
        try:
            r = requests.get(url, timeout=10)
            r.raise_for_status()
            return r
        except requests.HTTPError as e:
            if e.response.status_code == 429:   # rate-limited
                time.sleep(backoff ** attempt)
            elif e.response.status_code == 404:
                return None                         # skip missing pages
            else:
                raise
        except requests.RequestException:
            time.sleep(backoff ** attempt)
    return None

Common HTTP status codes

200

Request succeeded. Proceed to parse.

301/302

Redirect

requests follows automatically. Check final URL.

403

Forbidden

Headers/auth problem. Add User-Agent or cookies.

404

Not Found

Skip silently; log URL for review.

429

Rate Limited

Back off and retry with exponential delay.

503

Server Error

Retry later. Server may be overloaded.

Legal & ethical principles

Respect robots.txt
Always parse and honour the site’s robots.txt before crawling. Python: urllib.robotparser.RobotFileParser. Disallowed paths must not be fetched.
Rate limit your requests
Do not hammer servers. Add at least 1–2 s random delay between requests. High-volume scraping can degrade service for real users — a legal liability in many jurisdictions.
Check the Terms of Service
Many sites explicitly prohibit automated scraping in their ToS. Violating ToS may constitute breach of contract, even if data is publicly visible. Prefer official APIs when available.
Personal data & copyright
Under GDPR (EU) and similar laws, scraping personal data without lawful basis is illegal. Scraped content remains copyrighted by its authors — re-publishing without permission may infringe copyright.

robots.txt check — code

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url(“https://example.com/robots.txt”)
rp.read()

if rp.can_fetch(“*”, “https://example.com/products”):
    print(“Allowed — proceeding”)
else:
    print(“Disallowed by robots.txt — skipping”)

Use the official API when one exists

Many services (Twitter/X, Reddit, GitHub, Google Maps, OpenWeatherMap) provide free or paid APIs that are faster, more reliable, and legally unambiguous compared to scraping. Always check for a public API before building a scraper. Scraping should be a last resort when no API exists.

GDPR CFAA (US) robots.txt Terms of Service Copyright law Rate limiting

Please Share This Share this content

You Might Also Like

How I Built a Chrome Extension to Automate US Visa Appointment Slot Checking | Mohammad Tanvir

Scraping Indeed.com: A Step-by-Step Guide Using Playwright and BeautifulSoup

How to Scrape Websites Using Only Your Browser: No External Tools Required! 🚀| Chrome Browser DevTools

Leave a Reply Cancel reply

Share this content