The obvious approach to monitoring a web page is: fetch it, compare it to yesterday's version, alert if different. The problem is that this produces 95% false positives. Dates change. Ad banners rotate. Cookie consent prompts flicker in and out. Social proof numbers increment.
If every pixel change triggers an alert, users stop reading alerts. We built PricePulse around the opposite goal: only alert when something that actually affects buying decisions changes.
Here's exactly how we do it โ and why we made each technical decision along the way.
Why Not Puppeteer?
The first question every engineer asks when building a web scraper is: headless browser or simple HTTP? Puppeteer gives you JavaScript rendering. Playwright gives you JavaScript rendering. They also give you:
- Cold start times of 5โ15 seconds per page
- 2 GB+ memory requirements
- Complex Chromium dependency management on serverless
- $50+/month for a dedicated instance to run it reliably
For pricing pages specifically, most pricing data is in the initial HTML, not rendered client-side. We tested 200 SaaS pricing pages. 87% delivered all price and plan information in the raw HTML response. The other 13% are JavaScript-rendered โ and for those, we use a CSS selector targeting strategy that works around it (more on this below).
We chose node-fetch + cheerio. It's 10x faster, costs pennies, and handles 87% of the market without any headless browser complexity. The remaining 13% we handle with smart selector fallback and a client-side rendering hint system (in roadmap for Q3).
Step 1: Fetching the Page
The fetch logic is deliberately simple โ but there are three non-obvious choices in it:
// scripts/monitor-run.js
async function fetchPage(url) {
const response = await fetch(url, {
headers: {
'User-Agent': 'Mozilla/5.0 (compatible; PricePulseBot/1.0)',
'Accept': 'text/html,application/xhtml+xml',
'Accept-Language': 'en-US,en;q=0.9',
},
redirect: 'follow',
signal: AbortSignal.timeout(15000), // 15s timeout
});
if (!response.ok) {
throw new Error(`HTTP ${response.status}`);
}
return response.text();
}
User-Agent disclosure: We identify ourselves as PricePulseBot. This is intentional. We respect robots.txt. If a company doesn't want their pricing page monitored, we honor that. In practice, fewer than 2% of pricing pages block our bot โ companies generally want their pricing to be visible.
15-second timeout: Slow pages get retried automatically. Three consecutive fetch failures set the monitor to paused status and notify the user. We never silently fail.
Step 2: Extracting Pricing Content
Raw HTML is full of noise: navigation, footers, scripts, style tags, analytics trackers. We use Cheerio to surgically extract the content that matters.
The key insight: pricing content lives in predictable DOM locations. We target a prioritized list of CSS selectors:
// scripts/noise-filter.js
const PRICING_SELECTORS = [
'[id*="pricing"]', // id="pricing", id="pricing-section"
'[class*="pricing"]', // class="pricing-table", etc.
'[id*="plans"]', // id="plans"
'[class*="plan"]', // class="plan-card"
'[data-section="pricing"]', // data attributes
'main', // fallback: entire main content
];
const NOISE_SELECTORS = [
'script', 'style', 'noscript',
'nav', 'header', 'footer',
'[class*="cookie"]',
'[class*="banner"]',
'[class*="toast"]',
'[class*="chat"]', // Intercom, Drift, etc.
'[id*="intercom"]',
'[class*="testimonial"]', // social proof numbers
'[class*="review"]',
'time', // relative timestamps
'[datetime]',
];
function extractPricingContent(html) {
const $ = cheerio.load(html);
// Remove all noise elements first
NOISE_SELECTORS.forEach(sel => $(sel).remove());
// Try to find the pricing section specifically
for (const sel of PRICING_SELECTORS) {
const el = $(sel).first();
if (el.length && el.text().trim().length > 200) {
return normalizeText(el.text());
}
}
// Fallback: cleaned body text
return normalizeText($('body').text());
}
Step 3: Normalization
Even after targeting the right DOM section, the extracted text still has noise: extra whitespace, non-breaking spaces, invisible characters used for layout. Normalization makes the diff deterministic:
function normalizeText(text) {
return text
.replace(/\s+/g, ' ') // collapse whitespace
.replace(/\u00a0/g, ' ') // non-breaking spaces
.replace(/\u200b/g, '') // zero-width spaces
.replace(/[""]/g, '"') // smart quotes
.replace(/['']/g, "'") // smart apostrophes
.replace(/\$(\d+)\.00/g, '$$$1') // "$19.00" โ "$19"
.trim();
}
The price normalization ($19.00 โ $19) is worth calling out. Without it, a site that switches from displaying "$19.00" to "$19" would register as a change. It's not.
Step 4: Diffing
We compute a SHA-256 hash of the normalized content. If the hash matches the stored snapshot, nothing changed โ no further processing needed. This makes the happy path extremely cheap: one hash comparison, no LLM calls, no complex diff computation.
import { createHash } from 'crypto';
function contentHash(text) {
return createHash('sha256').update(text).digest('hex');
}
// In the main monitoring loop:
const currentHash = contentHash(currentContent);
if (currentHash === snapshot.content_hash) {
// Nothing changed โ update next_check_at and move on
await markChecked(monitor.id);
return;
}
// Content changed โ compute human-readable diff
const diff = computeDiff(snapshot.content, currentContent);
When content does change, we compute a word-level diff. Line-level diffs are too coarse for pricing content โ a plan description might change from "5 users" to "10 users" within a single line. Word-level catches that.
Step 5: The Noise Filter (The Hard Part)
This is where most monitoring tools fail. They detect a change, alert immediately, and produce noise. PricePulse runs every detected change through a noise scoring algorithm before deciding to alert.
A change gets a confidence score from 0 to 1. Changes above 0.4 trigger an alert. Below 0.4, we store the change silently (for the audit log) but don't email anyone.
Why 0.4? We tuned this threshold against a test corpus of 500 real pricing page changes. At 0.4, we catch 97% of genuine pricing changes while suppressing 94% of noise. The 3% of missed genuine changes are typically minor copy tweaks ("up to 5 users" โ "up to 5 team members") that don't affect pricing.
The scoring logic:
const NOISE_PATTERNS = [
{ pattern: /\d+ (users|customers|companies|teams) trust/i, weight: -0.3 },
{ pattern: /\d+ (reviews|ratings|stars)/i, weight: -0.3 },
{ pattern: /(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)/i, weight: -0.2 },
{ pattern: /\d{4}/, weight: -0.1 }, // years
{ pattern: /(cookie|gdpr|privacy|consent)/i, weight: -0.4 },
];
const SIGNAL_PATTERNS = [
{ pattern: /\$\d+/, weight: +0.5 }, // price
{ pattern: /(\/mo|\/month|\/year|\/yr|per month|per year)/i, weight: +0.4 },
{ pattern: /(free plan|free tier|forever free)/i, weight: +0.5 },
{ pattern: /(enterprise|starter|pro|business|team|growth)/i, weight: +0.3 },
{ pattern: /(limit|include|feature|seat|user)/i, weight: +0.2 },
];
function scoreChange(removedText, addedText) {
const changedText = removedText + ' ' + addedText;
let score = 0.2; // base score: any change has some signal
SIGNAL_PATTERNS.forEach(({ pattern, weight }) => {
if (pattern.test(changedText)) score += weight;
});
NOISE_PATTERNS.forEach(({ pattern, weight }) => {
if (pattern.test(changedText)) score += weight;
});
return Math.max(0, Math.min(1, score));
}
Step 6: Storage
We store two things for every snapshot: the content hash (for fast comparison on subsequent checks) and the normalized text (for diff computation when a change is detected).
Full HTML storage was considered and rejected. A single HTML snapshot for a complex SaaS pricing page can be 500KBโ2MB. At scale (10,000 monitors ร daily checks ร 90-day retention), that's 2.7 TB/month of storage. Normalized text reduces this by ~95%: most pricing pages compress to 5โ15KB of content.
We use Supabase for structured data (monitors, users, alerts, diffs) and keep the normalized content directly in JSONB. For the future HTML snapshot feature (in roadmap), we'll use Cloudflare R2.
Step 7: Alerting
When a change passes the confidence threshold, we insert an alert row and an async job sends the email. The email includes the diff in a human-readable format โ removed text in red, added text in green, with surrounding context.
We use Resend for email delivery. The transactional volume on a free tier is plenty for early-stage, and their React Email SDK makes templating clean. The alert email includes:
- Which monitor triggered (name + URL)
- The confidence score (so users can calibrate their sensitivity)
- 3 lines of context before and after the change
- A direct link to view the full diff in the dashboard
- One-click mute button for this monitor
The Scheduler
The monitoring engine runs on a cron schedule via cron-job.org (free, external HTTP cron). Every 15 minutes, it hits our /api/monitor-check endpoint, which:
We chose external HTTP cron over GitHub Actions for a simple reason: GitHub Actions requires repository write permissions to use workflow dispatch, which is a meaningful security surface area for a production app. cron-job.org sends a plain HTTP request. No credentials, no repository access.
What We're Still Building
The current implementation is MVP-grade and handles the 87% of pricing pages that deliver content in initial HTML. What's next:
- JavaScript-rendered pages: Using a lightweight headless browser pool (Browserless.io) for the 13% of SPAs where pricing is client-side rendered.
- Semantic diff: Today we do word-level diffing. We're experimenting with an LLM-based semantic understanding of what changed โ "price increased from $19 to $29" rather than raw text tokens.
- Screenshot diffing: Visual confirmation alongside text diff. Useful for detecting layout changes that signal a pricing restructure even when text is similar.
- Custom selector support: Let users specify exactly which CSS selector to monitor on a page, for precise targeting.
The code is not open source (yet). We're considering open-sourcing the noise filter and diff algorithm once they're more mature. If you're building something similar and want to compare notes, email us.
Why This Matters
The technical challenge of building a pricing monitor isn't fetching web pages โ any intern can do that. The challenge is making it useful without being noisy. A monitoring tool that emails you every time a date changes is worse than no monitoring at all, because you stop reading the alerts.
We built PricePulse because we were burned by missing competitor pricing changes. The noise filter is what makes it possible to actually trust the alerts you receive.
Try the demo โ see what the alerts actually look like
Interactive demo with real diff examples, or sign up to start monitoring.