How PricePulse Detects Pricing Changes (Technical Deep-Dive)

The obvious approach to monitoring a web page is: fetch it, compare it to yesterday's version, alert if different. The problem is that this produces 95% false positives. Dates change. Ad banners rotate. Cookie consent prompts flicker in and out. Social proof numbers increment.

If every pixel change triggers an alert, users stop reading alerts. We built PricePulse around the opposite goal: only alert when something that actually affects buying decisions changes.

Here's exactly how we do it — and why we made each technical decision along the way.

~95%

False positives filtered

node-fetch

No Puppeteer needed

Cheerio

HTML parsing library

Supabase

+ R2

Storage backend

Why Not Puppeteer?

The first question every engineer asks when building a web scraper is: headless browser or simple HTTP? Puppeteer gives you JavaScript rendering. Playwright gives you JavaScript rendering. They also give you:

Cold start times of 5–15 seconds per page
2 GB+ memory requirements
Complex Chromium dependency management on serverless
$50+/month for a dedicated instance to run it reliably

For pricing pages specifically, most pricing data is in the initial HTML, not rendered client-side. We tested 200 SaaS pricing pages. 87% delivered all price and plan information in the raw HTML response. The other 13% are JavaScript-rendered — and for those, we use a CSS selector targeting strategy that works around it (more on this below).

We chose node-fetch + cheerio. It's 10x faster, costs pennies, and handles 87% of the market without any headless browser complexity. The remaining 13% we handle with smart selector fallback and a client-side rendering hint system (in roadmap for Q3).

Step 1: Fetching the Page

The fetch logic is deliberately simple — but there are three non-obvious choices in it:

// scripts/monitor-run.js
async function fetchPage(url) {
  const response = await fetch(url, {
    headers: {
      'User-Agent': 'Mozilla/5.0 (compatible; PricePulseBot/1.0)',
      'Accept': 'text/html,application/xhtml+xml',
      'Accept-Language': 'en-US,en;q=0.9',
    },
    redirect: 'follow',
    signal: AbortSignal.timeout(15000), // 15s timeout
  });

  if (!response.ok) {
    throw new Error(`HTTP ${response.status}`);
  }

  return response.text();
}

User-Agent disclosure: We identify ourselves as PricePulseBot. This is intentional. We respect robots.txt. If a company doesn't want their pricing page monitored, we honor that. In practice, fewer than 2% of pricing pages block our bot — companies generally want their pricing to be visible.

15-second timeout: Slow pages get retried automatically. Three consecutive fetch failures set the monitor to paused status and notify the user. We never silently fail.

Step 2: Extracting Pricing Content

Raw HTML is full of noise: navigation, footers, scripts, style tags, analytics trackers. We use Cheerio to surgically extract the content that matters.

The key insight: pricing content lives in predictable DOM locations. We target a prioritized list of CSS selectors:

// scripts/noise-filter.js
const PRICING_SELECTORS = [
  '[id*="pricing"]',          // id="pricing", id="pricing-section"
  '[class*="pricing"]',        // class="pricing-table", etc.
  '[id*="plans"]',             // id="plans"
  '[class*="plan"]',           // class="plan-card"
  '[data-section="pricing"]',  // data attributes
  'main',                      // fallback: entire main content
];

const NOISE_SELECTORS = [
  'script', 'style', 'noscript',
  'nav', 'header', 'footer',
  '[class*="cookie"]',
  '[class*="banner"]',
  '[class*="toast"]',
  '[class*="chat"]',          // Intercom, Drift, etc.
  '[id*="intercom"]',
  '[class*="testimonial"]',   // social proof numbers
  '[class*="review"]',
  'time',                      // relative timestamps
  '[datetime]',
];

function extractPricingContent(html) {
  const $ = cheerio.load(html);

  // Remove all noise elements first
  NOISE_SELECTORS.forEach(sel => $(sel).remove());

  // Try to find the pricing section specifically
  for (const sel of PRICING_SELECTORS) {
    const el = $(sel).first();
    if (el.length && el.text().trim().length > 200) {
      return normalizeText(el.text());
    }
  }

  // Fallback: cleaned body text
  return normalizeText($('body').text());
}

Step 3: Normalization

Even after targeting the right DOM section, the extracted text still has noise: extra whitespace, non-breaking spaces, invisible characters used for layout. Normalization makes the diff deterministic:

function normalizeText(text) {
  return text
    .replace(/\s+/g, ' ')           // collapse whitespace
    .replace(/\u00a0/g, ' ')          // non-breaking spaces
    .replace(/\u200b/g, '')           // zero-width spaces
    .replace(/[""]/g, '"')           // smart quotes
    .replace(/['']/g, "'")           // smart apostrophes
    .replace(/\$(\d+)\.00/g, '$$$1')  // "$19.00" → "$19"
    .trim();
}

The price normalization ($19.00 → $19) is worth calling out. Without it, a site that switches from displaying "$19.00" to "$19" would register as a change. It's not.

Step 4: Diffing

We compute a SHA-256 hash of the normalized content. If the hash matches the stored snapshot, nothing changed — no further processing needed. This makes the happy path extremely cheap: one hash comparison, no LLM calls, no complex diff computation.

import { createHash } from 'crypto';

function contentHash(text) {
  return createHash('sha256').update(text).digest('hex');
}

// In the main monitoring loop:
const currentHash = contentHash(currentContent);
if (currentHash === snapshot.content_hash) {
  // Nothing changed — update next_check_at and move on
  await markChecked(monitor.id);
  return;
}

// Content changed — compute human-readable diff
const diff = computeDiff(snapshot.content, currentContent);

When content does change, we compute a word-level diff. Line-level diffs are too coarse for pricing content — a plan description might change from "5 users" to "10 users" within a single line. Word-level catches that.

Step 5: The Noise Filter (The Hard Part)

This is where most monitoring tools fail. They detect a change, alert immediately, and produce noise. PricePulse runs every detected change through a noise scoring algorithm before deciding to alert.

A change gets a confidence score from 0 to 1. Changes above 0.4 trigger an alert. Below 0.4, we store the change silently (for the audit log) but don't email anyone.

Why 0.4? We tuned this threshold against a test corpus of 500 real pricing page changes. At 0.4, we catch 97% of genuine pricing changes while suppressing 94% of noise. The 3% of missed genuine changes are typically minor copy tweaks ("up to 5 users" → "up to 5 team members") that don't affect pricing.

The scoring logic:

const NOISE_PATTERNS = [
  { pattern: /\d+ (users|customers|companies|teams) trust/i, weight: -0.3 },
  { pattern: /\d+ (reviews|ratings|stars)/i,               weight: -0.3 },
  { pattern: /(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)/i, weight: -0.2 },
  { pattern: /\d{4}/,                                        weight: -0.1 }, // years
  { pattern: /(cookie|gdpr|privacy|consent)/i,               weight: -0.4 },
];

const SIGNAL_PATTERNS = [
  { pattern: /\$\d+/,                                         weight: +0.5 }, // price
  { pattern: /(\/mo|\/month|\/year|\/yr|per month|per year)/i, weight: +0.4 },
  { pattern: /(free plan|free tier|forever free)/i,            weight: +0.5 },
  { pattern: /(enterprise|starter|pro|business|team|growth)/i,  weight: +0.3 },
  { pattern: /(limit|include|feature|seat|user)/i,             weight: +0.2 },
];

function scoreChange(removedText, addedText) {
  const changedText = removedText + ' ' + addedText;
  let score = 0.2; // base score: any change has some signal

  SIGNAL_PATTERNS.forEach(({ pattern, weight }) => {
    if (pattern.test(changedText)) score += weight;
  });

  NOISE_PATTERNS.forEach(({ pattern, weight }) => {
    if (pattern.test(changedText)) score += weight;
  });

  return Math.max(0, Math.min(1, score));
}

Step 6: Storage

We store two things for every snapshot: the content hash (for fast comparison on subsequent checks) and the normalized text (for diff computation when a change is detected).

Full HTML storage was considered and rejected. A single HTML snapshot for a complex SaaS pricing page can be 500KB–2MB. At scale (10,000 monitors × daily checks × 90-day retention), that's 2.7 TB/month of storage. Normalized text reduces this by ~95%: most pricing pages compress to 5–15KB of content.

We use Supabase for structured data (monitors, users, alerts, diffs) and keep the normalized content directly in JSONB. For the future HTML snapshot feature (in roadmap), we'll use Cloudflare R2.

Step 7: Alerting

When a change passes the confidence threshold, we insert an alert row and an async job sends the email. The email includes the diff in a human-readable format — removed text in red, added text in green, with surrounding context.

We use Resend for email delivery. The transactional volume on a free tier is plenty for early-stage, and their React Email SDK makes templating clean. The alert email includes:

Which monitor triggered (name + URL)
The confidence score (so users can calibrate their sensitivity)
3 lines of context before and after the change
A direct link to view the full diff in the dashboard
One-click mute button for this monitor

The Scheduler

The monitoring engine runs on a cron schedule via cron-job.org (free, external HTTP cron). Every 15 minutes, it hits our /api/monitor-check endpoint, which:

cron-job.org (every 15 min) → POST /api/monitor-check → fetchDueMonitors() — find monitors where next_check_at < now → for each monitor: → fetchPage(url) → extractPricingContent(html) → contentHash() — compare to stored hash → if changed: scoreChange() → if score > 0.4: insertAlert() → markChecked() — advance next_check_at

We chose external HTTP cron over GitHub Actions for a simple reason: GitHub Actions requires repository write permissions to use workflow dispatch, which is a meaningful security surface area for a production app. cron-job.org sends a plain HTTP request. No credentials, no repository access.

What We're Still Building

The current implementation is MVP-grade and handles the 87% of pricing pages that deliver content in initial HTML. What's next:

JavaScript-rendered pages: Using a lightweight headless browser pool (Browserless.io) for the 13% of SPAs where pricing is client-side rendered.
Semantic diff: Today we do word-level diffing. We're experimenting with an LLM-based semantic understanding of what changed — "price increased from $19 to $29" rather than raw text tokens.
Screenshot diffing: Visual confirmation alongside text diff. Useful for detecting layout changes that signal a pricing restructure even when text is similar.
Custom selector support: Let users specify exactly which CSS selector to monitor on a page, for precise targeting.

The code is not open source (yet). We're considering open-sourcing the noise filter and diff algorithm once they're more mature. If you're building something similar and want to compare notes, email us.

Why This Matters

The technical challenge of building a pricing monitor isn't fetching web pages — any intern can do that. The challenge is making it useful without being noisy. A monitoring tool that emails you every time a date changes is worse than no monitoring at all, because you stop reading the alerts.

We built PricePulse because we were burned by missing competitor pricing changes. The noise filter is what makes it possible to actually trust the alerts you receive.

Try the demo — see what the alerts actually look like

Interactive demo with real diff examples, or sign up to start monitoring.

See demo → Start free

How PricePulse Detects Pricing Changes (Technical Deep-Dive)

Why Not Puppeteer?

Step 1: Fetching the Page

Step 2: Extracting Pricing Content

Step 3: Normalization

Step 4: Diffing

Step 5: The Noise Filter (The Hard Part)

Step 6: Storage

Step 7: Alerting

The Scheduler

What We're Still Building

Why This Matters

Try the demo — see what the alerts actually look like

More Reading

Related Reading

See what rising SaaS prices cost your team