Modern websites detect and block crawlers within minutes. This guide covers the exact proxy strategy, request patterns, and tool stack that data engineers use to crawl millions of pages in 2026 without triggering bans or CAPTCHAs.
You spin up a scraper, point it at a target site, and within 15 minutes your IP is banned. You switch IPs and the same thing happens. You add delays — still banned. You try a headless browser — CAPTCHA wall.
This is the standard experience for anyone crawling modern websites without the right proxy infrastructure.
Anti-bot systems in 2026 are sophisticated: they track request velocity, browser fingerprints, IP reputation, TLS fingerprints, header ordering, and behavioral patterns simultaneously. A crawler that trips any one signal gets blocked — and the IP is burned permanently for that site.
This guide covers what actually works: the proxy types, rotation strategies, request patterns, and tool choices that let you crawl at scale without getting blocked.
Before choosing solutions, understand what you're up against:
The simplest defense: more than N requests per minute from a single IP = block. Thresholds vary wildly: News sites: 30–60 req/min before soft block E-commerce (Amazon, Walmart): 5–10 req/min before CAPTCHA LinkedIn: 1–2 profile views/min from same IP before challenge Real estate (Zillow, Redfin): 10–20 req/min before 429
Every IP has a reputation score based on historical abuse. Scores are shared across vendors (Cloudflare, Akamai, DataDome, PerimeterX all use shared threat intelligence): Datacenter IPs: bad reputation by default Residential pool IPs: degraded from abuse by thousands of other crawlers Fresh mobile/residential IPs: clean reputation
JavaScript-based bots are fingerprinted by: Navigator properties (plugins, platform, vendor) Canvas / WebGL rendering hash Font enumeration Audio context fingerprint WebRTC local IP leak TLS ClientHello fingerprint (JA3/JA4 hash)
Curl and Python requests have immediately recognizable TLS fingerprints. So do Selenium and Puppeteer unless patched.
ML models analyze session behavior: Time between requests (too regular = bot) Mouse movement / scroll patterns (headless = no movement) Page interaction order (bot = linear, human = random) Referrer chain (bots often have no referrer or always the same one)
Invisible links in the page HTML that only bots follow. Following one = permanent ban regardless of IP or behavior.
Datacenter IPs are fast and cheap, but have the worst reputation scores. Most major sites block datacenter ASNs at the firewall level.
Use for: Internal APIs with no bot protection Sites you have explicit permission to crawl Development/testing (not production crawling) High-volume low-risk targets (public government data, academic databases)
Avoid for: E-commerce price scraping (Amazon, eBay) Social media (LinkedIn, Instagram, Facebook) Any Cloudflare/Akamai/DataDome protected site
Large pools of residential IPs (sourced from real user devices via SDK partnerships). Each request can come from a different residential IP in the target country or city.
Use for: General e-commerce scraping Price monitoring across many retailers SEO rank checking (geo-targeted SERPs) Ad verification
Limitations: Shared pools — same IPs used by thousands of other crawlers Reputation degrades over time on heavily targeted sites Variable speed (depends on the residential device) Not ideal for session-dependent scraping (login, cart flows)
Same as rotating residential, but you can lock to one IP for a defined window (5–30 minutes). Critical for: Login flows (session must stay on one IP) Multi-step checkout scraping User profile data (session cookies must persist)
Real 4G/5G carrier IPs. The same IP class that millions of real mobile users come from — sites cannot block mobile carrier ranges without blocking legitimate traffic.
Use for: Targets that block all residential proxies (heavily protected social/e-commerce) Mobile-specific content (app store data, mobile-only pages) Any target using advanced bot detection (DataDome, PerimeterX, Kasada)
Cost: Higher than residential, but significantly fewer blocks = better cost per successful request.
👉 View proxy plans (rotating residential + mobile) →
For small crawls, a simple request loop with proxy rotation works:
Use a thread pool with a proxy rotation manager:
For serious crawls (millions of pages), Scrapy with a rotating proxy middleware is the industry standard:
Never send bare minimal headers. Match a real browser's full header set:
Fixed delays are bot signatures. Use randomized delays that mimic human reading time:
Maintain cookies across requests like a real browser — don't start a fresh session for every request:
Parse only visible links. Honeypot links are typically hidden via CSS:
Always check robots.txt before crawling. Ignoring it is both legally risky and a crawling red flag (some sites serve honeypot robots.txt entries to catch bots):