
This repository shows how to build a Gemini-powered web scraper using Python and LLMs to extract structured data from complex web pages β without writing custom parsing logic.
π Read the full tutorial β How to Leverage Gemini AI for Web Scraping
- Fetches HTML from any public webpage
- Converts HTML to Markdown using markdownify
- Sends it to Gemini AI with a natural language prompt
- Extracts structured data in JSON format
google-generativeai
β Gemini API for LLM-powered parsingrequests
β For basic HTTP requests (if not using a proxy)beautifulsoup4
β For basic HTML parsing (optional)markdownify
β Converts HTML into cleaner Markdownpython-dotenv
β For managing API keys and environment variables
- Clone this repo:
git clone https://github.com/yourusername/gemini-ai-web-scraper.git
cd gemini-ai-web-scraper
- Install dependencies:
pip install google-generativeai python-dotenv requests beautifulsoup4 markdownify
- Add your Gemini API Key in the script or as environment variable.
Web scraping with Gemini AI can hit blocks, CAPTCHAs, and anti-bot systems. Crawlbase Smart Proxy solves that.
- Avoid IP blocks with automatic rotation
- Bypass CAPTCHAs seamlessly
- Skip proxy management
- Get clean, parsed HTML for better AI input
import requests
import time
proxy_url = "http://_USER_TOKEN_@smartproxy.crawlbase.com:8012"
proxies = {"http": proxy_url, "https": proxy_url}
url = "https://example.com/protected-page"
time.sleep(2) # Mimic human behavior
response = requests.get(url, proxies=proxies, verify=False)
print(response.text)
Replace _USER_TOKEN_
with your Crawlbase Smart Proxy token. Get one after signup on Crawlbase.