Web Scraping with Python — Complete Guide

Web Scraping with Python — Complete Guide
Web Scraping with Python
A complete technical reference — tools, patterns, ethics & code
5
core libraries
3
scraper types
6
pipeline stages
4
legal principles
What is web scraping?
Automated extraction of data from websites
Web scraping is the process of programmatically fetching web pages and extracting structured data from HTML, JSON, or XML content. Python is the dominant language for scraping due to its rich ecosystem of HTTP clients, HTML parsers, and browser automation tools.
When to use each approach
Static HTML
requests + BeautifulSoup
Content already present in raw HTML. No JavaScript execution needed. Fast, lightweight, ideal for most news sites, Wikipedia, forums.
JS-rendered
Selenium / Playwright
Content loaded by JavaScript after page load — SPAs, infinite scroll, AJAX-driven tables. Needs a real or headless browser.
Large scale
Scrapy framework
Multi-page crawling with link following, pipelines, item validation, rate limiting, and distributed crawling support out of the box.
API-first
Direct API calls
Many sites expose undocumented internal APIs the browser calls via XHR/Fetch. Inspecting DevTools Network tab often reveals cleaner endpoints.
Core flow — click a step for detail
1
Request
HTTP GET/POST
2
Parse
HTML / JSON
3
Extract
CSS / XPath
4
Clean
regex / strip
5
Store
CSV / DB / API
Click a step above to learn more
Each stage of the scraping pipeline has its own tools and best practices.
Library comparison
LibraryRenders JSSpeedBest forInstall
requestsVery fastHTTP fetch, APIspip install requests
BeautifulSoup4FastHTML/XML parsingpip install beautifulsoup4
lxmlVery fastXPath, large docspip install lxml
ScrapyFast asyncLarge crawls, pipelinespip install scrapy
SeleniumSlowLegacy JS sitespip install selenium
PlaywrightModerateModern SPAs, asyncpip install playwright
httpxVery fastAsync HTTP, HTTP/2pip install httpx
Selector types — click to compare
CSS selector XPath Regex JSON / jmespath
CSS selector
Familiar from web dev. Supported natively by BeautifulSoup and Playwright. soup.select(‘table.results tr td:nth-child(2)’) soup.select_one(‘h1.title’).get_text(strip=True) soup.select(‘a[href^=”/product/”]’)
Anti-blocking techniques
Rotate User-Agents
Send realistic browser headers. Use the fake-useragent library to cycle headers per request.
Rate limiting
Add random delays between requests: time.sleep(random.uniform(1, 3)) — mimics human browsing pace.
Proxy rotation
Route requests through rotating residential or datacenter proxies to avoid IP bans on high-volume scraping.
Session handling
Use requests.Session() to persist cookies across requests — required for sites behind login walls.
Pattern 1 — static HTML scraper
import requests from bs4 import BeautifulSoup import time, random headers = {“User-Agent”: “Mozilla/5.0 (compatible; MyBot/1.0)”} session = requests.Session() session.headers.update(headers) def scrape_page(url): resp = session.get(url, timeout=10) resp.raise_for_status() # raises on 4xx / 5xx soup = BeautifulSoup(resp.text, “lxml”) return soup soup = scrape_page(“https://books.toscrape.com/”) books = soup.select(“article.product_pod”) # CSS selector for book in books: title = book.select_one(“h3 a”)[“title”] price = book.select_one(“p.price_color”).text.strip() rating = book.select_one(“p.star-rating”)[“class”][1] print(f“{title} | {price} | {rating}”) time.sleep(random.uniform(0.5, 1.5))
Pattern 2 — async multi-page (httpx + asyncio)
import asyncio, httpx from bs4 import BeautifulSoup async def fetch(client, url): r = await client.get(url, timeout=15) return r.text async def main(urls): async with httpx.AsyncClient() as client: tasks = [fetch(client, u) for u in urls] pages = await asyncio.gather(*tasks) return pages urls = [f“https://example.com/page/{i}” for i in range(1, 6)] pages = asyncio.run(main(urls)) # fetches all 5 concurrently
Pattern 3 — JS-rendered with Playwright
from playwright.sync_api import sync_playwright with sync_playwright() as p: browser = p.chromium.launch(headless=True) page = browser.new_page() page.goto(“https://spa-site.example.com”) page.wait_for_selector(“table.data-loaded”) # wait for JS render rows = page.query_selector_all(“table tr”) for row in rows: cells = [td.inner_text() for td in row.query_selector_all(“td”)] print(cells) browser.close()
Pattern 4 — save to CSV
import csv data = [{“title”: “Book A”, “price”: “£9.99”}] # your scraped list with open(“results.csv”, “w”, newline=“”, encoding=“utf-8”) as f: writer = csv.DictWriter(f, fieldnames=[“title”, “price”]) writer.writeheader() writer.writerows(data)
Production scraping pipeline
1. Discover
Sitemap.xml, seed URLs, link following. Tools: Scrapy Spider, urllib.robotparser
2. Fetch
HTTP GET with retries, timeout, session cookies, proxy rotation, rate-limit delays
3. Render
For JS sites only: headless Chrome via Playwright. For static: skip this stage
4. Parse
BeautifulSoup / lxml (HTML) · json.loads (APIs) · pdfplumber (PDFs)
5. Clean
strip whitespace · normalise dates · validate types · deduplicate records
6. Store
CSV (csv) · SQLite (sqlite3) · PostgreSQL (psycopg2) · MongoDB (pymongo)
Error handling pattern
import time, requests def fetch_with_retry(url, retries=3, backoff=2): for attempt in range(retries): try: r = requests.get(url, timeout=10) r.raise_for_status() return r except requests.HTTPError as e: if e.response.status_code == 429: # rate-limited time.sleep(backoff ** attempt) elif e.response.status_code == 404: return None # skip missing pages else: raise except requests.RequestException: time.sleep(backoff ** attempt) return None
Common HTTP status codes
200
OK
Request succeeded. Proceed to parse.
301/302
Redirect
requests follows automatically. Check final URL.
403
Forbidden
Headers/auth problem. Add User-Agent or cookies.
404
Not Found
Skip silently; log URL for review.
429
Rate Limited
Back off and retry with exponential delay.
503
Server Error
Retry later. Server may be overloaded.
Legal & ethical principles
  • Respect robots.txt
    Always parse and honour the site’s robots.txt before crawling. Python: urllib.robotparser.RobotFileParser. Disallowed paths must not be fetched.
  • Rate limit your requests
    Do not hammer servers. Add at least 1–2 s random delay between requests. High-volume scraping can degrade service for real users — a legal liability in many jurisdictions.
  • Check the Terms of Service
    Many sites explicitly prohibit automated scraping in their ToS. Violating ToS may constitute breach of contract, even if data is publicly visible. Prefer official APIs when available.
  • Personal data & copyright
    Under GDPR (EU) and similar laws, scraping personal data without lawful basis is illegal. Scraped content remains copyrighted by its authors — re-publishing without permission may infringe copyright.
robots.txt check — code
from urllib.robotparser import RobotFileParser rp = RobotFileParser() rp.set_url(“https://example.com/robots.txt”) rp.read() if rp.can_fetch(“*”, “https://example.com/products”): print(“Allowed — proceeding”) else: print(“Disallowed by robots.txt — skipping”)
Use the official API when one exists
Many services (Twitter/X, Reddit, GitHub, Google Maps, OpenWeatherMap) provide free or paid APIs that are faster, more reliable, and legally unambiguous compared to scraping. Always check for a public API before building a scraper. Scraping should be a last resort when no API exists.
GDPR CFAA (US) robots.txt Terms of Service Copyright law Rate limiting

Leave a Reply