Web Scraper Chrome Extension
A powerful Chrome extension that saves web pages as static HTML with all images, organized in a clean folder structure. Features advanced sitemap support, batch downloading, and a beautiful tree-view interface for easy page selection.
✨ Features
🎯 Core Functionality
- Save Complete Pages: Downloads full HTML with all images to local folders
- Smart Image Handling: Updates image paths to relative references, maintaining page structure
- Auto-Scroll: Automatically scrolls to load lazy-loaded images before scraping
- Organized Storage: Saves to
domain_timestamp/page_X/ folders with separate images/ subdirectories
📋 Sitemap Integration
- Automatic Sitemap Loading: Fetches and parses website sitemap.xml files
- Sitemap Index Support: Automatically detects and loads multiple sub-sitemaps
- Tree View Interface: Groups URLs by category with expandable/collapsible sections
- Smart Selection: Parent checkboxes select all children, with indeterminate states
- Live Statistics: Real-time count of total URLs and selected pages
- Text Extraction: Extract page copy to Markdown for AI processing
- Hybrid Discovery Engine: High-performance “Smart Scan” that prioritizes sitemaps and seeds from the live tab for 100% compatibility with modern SPAs (React, Vue, Next.js)
- Advanced Crawling: Background crawler explores site structure with robust regex and depth-priority logic
⚡ Batch Operations
- Background Processing: Service worker handles batch downloads independently
- Progress Notifications: Desktop notifications show download progress
- Badge Counter: Extension icon displays current page number during batch downloads
- Persistent Sessions: Downloads continue even if popup is closed
- Resume Support: Reopen popup anytime to see live progress
- Smart Session Detection: Automatically detects domain changes and prompts for a clean slate
- Completion Dashboard: High-fidelity success/error overlay with persistent status tracking
- Error Logs: Immediate visibility into failed downloads with dedicated failure tracking
🎨 Modern UI
- Gradient Design: Beautiful purple/blue gradient header
- Intuitive Controls: Clear buttons with icons for all actions
- Visual Feedback: Color-coded status messages (success/error/info)
- Responsive Layout: Smooth hover effects and transitions
- Statistics Dashboard: Live metrics for total and selected URLs
🔄 Two Download Modes
Manual Mode (Save Page X)
- Load sitemap or navigate to desired page
- Click “Save Page X” to download current page
- Extension automatically navigates to next URL in sitemap
- Reopen popup and click again for next page
- Progress tracked across sessions
Batch Mode (Save Selected Pages)
- Load sitemap to see all available URLs
- Check desired pages (or use Select All)
- Click “Save Selected Pages”
- All pages download automatically in background
- Receive notifications for progress and completion
📁 Folder Structure
Downloads are organized as:
domain_timestamp/
├── page_1/
│ ├── index.html
│ └── images/
│ ├── 1_image1.jpg
│ ├── 2_image2.png
│ └── ...
├── page_2/
│ ├── index.html
│ └── images/
│ └── ...
└── ...
🧪 Unit Tests
The project includes a suite of unit tests for the core logic (sitemap parsing, scraping, naming).
-
Install dependencies:
-
Run tests:
See TESTS-README.md for more details on the testing setup and architecture.
🚀 How to Test Locally
To test the extension on your local machine:
-
Open Chrome Extensions:
- Open Google Chrome browser
- Navigate to
chrome://extensions in the address bar
-
Enable Developer Mode:
- Toggle “Developer mode” switch in the top-right corner to “On”
-
Load the Extension:
- Click “Load unpacked” button on the left
- Navigate to the project’s
extension/ directory and click “Select”
- Ensure
manifest.json is in the extension/ folder
-
Verify Installation:
- Extension should appear in your list of installed extensions
- Click the extension icon in Chrome toolbar (or pin it from the puzzle piece menu)
🎮 How to Use
Single Page Download
- Navigate to any web page
- Click the extension icon
- Click “💾 Save Page” button
- Page and images download to your Downloads folder
Sitemap-Based Download
- Navigate to any website
- Click extension icon
- Click “📋 Load Sitemap” to fetch sitemap.xml
- Browse URLs in tree structure (grouped by category)
- Expand categories by clicking the arrow or category name
- Check boxes for pages you want to download
- Choose your mode:
- Manual: Click “💾 Save Page X” to download one at a time
- Batch: Click “⚡ Save Selected Pages” for automatic batch download
Intelligent Scanning
- Click “🔍 Smart Website Scan” to start the hybrid discovery engine
- The engine will automatically check for sitemaps and fallback to a crawler if needed
- Watch the page tree grow in real-time as new pages are discovered
- Use the “🔄” icon in the header to start a fresh session at any time
Managing Downloads
- Statistics: View total URLs and selected count in real-time
- Select All/Deselect All: Quick buttons for bulk selection
- Progress Tracking: Check extension icon badge during batch downloads
- Notifications: Receive desktop alerts for batch progress
- Session Management: Click “✓ Done” to clear session and start fresh
- Text Extraction: Check “Extract Copy to Markdown” to get a
website_copy.md file with your download.
🛠️ Technical Details
Technologies Used
- Manifest V3: Latest Chrome extension architecture
- Service Worker: Background processing for reliable batch downloads
- Chrome APIs:
chrome.downloads - File downloads
chrome.scripting - Content injection
chrome.storage - State persistence
chrome.notifications - Desktop notifications
chrome.tabs - Tab management
chrome.action - Badge updates
Key Features
- Auto-Scroll Algorithm: Multi-target scrolling for modern framework compatibility
- Tree Structure: URLs grouped by first path segment for intuitive navigation
- Hybrid Discovery: Priority-based engine (Sitemap -> Tab Seed -> Background Crawl)
- URL Normalization: Strict deduplication of trailing slashes and hashes
- State Persistence: Session data survives popup closure and navigation
- Error Handling: Dedicated completion dashboard with persistent state tracking
- Base64 Encoding: HTML converted to data URLs for reliable downloads
Permissions Required
activeTab - Access current tab content
scripting - Inject scraping scripts
downloads - Save files to disk
storage - Persist session state
notifications - Show progress alerts
tabs - Manage tab navigation
📦 How to Submit to Chrome Web Store
To publish your extension on the Chrome Web Store:
-
Package the Extension:
- Create a
.zip file of the extension/ directory
- Ensure
manifest.json is at the root of the zip file
- Do not include
.git, node_modules, or development files. Only zip the contents of the extension/ folder.
-
Go to Chrome Developer Dashboard:
-
Add a New Item:
- Click “Add new item” button
- Upload your
.zip file
-
Complete Store Listing:
- Fill out all required information:
- Description: Detailed feature list and benefits
- Icons: Provide 128x128, 48x48, and 16x16 PNG icons
- Screenshots: Add 1280x800 or 640x400 screenshots showcasing features
- Category: Choose “Productivity” or “Developer Tools”
- Language: Select primary language
-
Privacy Practices:
- Declare data usage and privacy policy
- Explain permissions required and why
- State that no user data is collected or transmitted
-
Submit for Review:
- Click “Submit for review”
- Review typically takes 1-3 business days
- Address any feedback from Chrome Web Store team
Once approved, your extension will be publicly available on the Chrome Web Store.
🤝 Contributing
Contributions are welcome! Please feel free to submit issues or pull requests.
📄 License
This project is open source and available under the MIT License.