Using Cloudflare Workers HTMLRewriter to extend Ghost(Pro)

I setup my blog on Ghost a couple of months ago, and overall have really enjoyed the experience. The editor is simple to use, my site is super fast, and I don't have to worry about any upkeep or maintenance of servers.

However, when I added my new Ghost(Pro) blog to my Cloudflare account, so I could start playing with some Cloudflare features, I was met with the following from within Cloudflare.

Ghost(Pro) blog CNAME record within Cloudflare

I searched around and found that this was added relatively recently when Ghost and Cloudflare started working together. I then found a post by Scott Helme, and much of this code and research is based on his blog post, which I'd encourage you to read here.

Why I wanted more control

There were a number of reasons I wanted more control of my blog:

  • I wanted to set HTTP headers such as X-Frame-Options, Report-To, and Referrer-Policy, etc - security headers.
  • Better analytics.
  • I still had some legacy code in sub-directories on my origin server, not hosted by Ghost. I wanted to still be able to use these alongside my Ghost(Pro) subscription.
  • Extending Ghost functionality.

Better Analytics

This is something I'm still working on, but Ghost(Pro) bases its pricing model on page views, and yet doesn't give you any way to actually monitor this. They recommend third-party analytics tools, but these are never accurate with the prevalence of adblockers nowadays (I even had their CEO recommend I use Cloudflare DNS analytics to monitor page-views...). The Ghost team state they "never disable sites for traffic spikes", which is fantastic, but not knowing the actual metrics they're using to potentially bill me in the future is still very disconcerting. From their pricing page:

Views refer to the number of requests to your site each month. These are tracked much like an analytics 'pageview' - and are incremented with each page or API request.

This is something I'd really love to see Ghost(Pro) improve - not so much with a fully-fledged analytics product, but just giving users access to the data (pageviews) they're already tracking. I currently have a (vague) idea of my billable traffic numbers by referencing Google Analytics and Workers Analytics, but how these numbers actually stack against the server logs Ghost(Pro) are using, I as an end-user have no idea.

Extending Ghost Functionality

I was looking to add native lazy-loading images to my blog posts, as I have a post with quite a few large images. My first inclination was to check Ghost's UI to see if it had support for this. Nope - no problem, this is a very new feature for browsers - I submitted a feature request.

I next started looking for Ghost plugins to extend my blog, but plugins don't really seem to exist with Ghost. I can't find any concept of "plugins", beyond just self-modifying the source code, but I then wouldn't be able to use the Ghost(Pro) hosting - one of the benefits of which is being always up to date without any effort on my part. Ghost is entirely open-source though, so this is definitely an option if you don't mind self-hosting.

Enter Cloudflare Workers

Given all of my requirements, Cloudflare Workers was a perfect choice - I just had to find a way to make it work with my Ghost(Pro) blog. Thanks to Scott's blog post as I mentioned earlier, this was easier than I anticipated. I've made some changes for easier configuration, implemented HTMLRewriter, and improved a few things.

Ghost Settings

Firstly, make sure your Ghost(Pro) blog is set to your your ghost.io URL. Make sure any "custom domain" setup with Ghost(Pro) is disabled. This can be controlled at https://my.ghost.org/.

Ghost(Pro) blog configuration

To prevent any kind of duplicate content and ensure your visitors always end up on your own domain, rather than your ghost.io domain, we need to inject a tiny line of JavaScript to the top of the page. My ghost.io domain is jamesross.ghost.io.

From your Ghost blog settings (found at yourdomain.ghost.io/ghost), head to "Code Injection", and in the "Site Header", add the following.

Ghost blog code injection configuration
<script id="redirect">"jamesross.ghost.io"==window.location.hostname&&(window.location.href="https://jross.me"+window.location.pathname);</script>

Make replacements as necessary for your domain and ghost.io domain. This will redirect any visitors on your ghost.io domain to your real domain. I've added an id to the script here to make it easier to identify and remove programatically on our real domain later.

Cloudflare Workers

Cloudflare Workers allow you to write JavaScript code and have that code run in all of Cloudflare's data centers, sitting in front of your site, before requests hit your origin. They can be used for all kinds of things from transforming responses, accessing external APIs, manipulating page headers, or even entire applications. Check out some of their tutorials if you're interested in learning more.

For our blog, we're essentially going to be using Cloudflare Workers as a proxy to our ghost.io domain, rather than using a CNAME record. This will give us much more control over our domain, as we don't have to hand-over control to Ghost(Pro).

const domain = 'jross.me';
const ghostDomain = 'jamesross.ghost.io';

// Headers to add to your pages. In this example, I'm adding security headers
// such as X-Frame-Options, Referrer-Policy, etc.
// I also append a X-CF-Worker header for easy debugging
const addHeaders = {
	"X-XSS-Protection": "1; mode=block",
	"X-Frame-Options": "DENY",
	"X-Content-Type-Options": "nosniff",
	"Referrer-Policy": "no-referrer-when-downgrade",
	"Strict-Transport-Security": "max-age=31536000; includeSubDomains; preload",
	"X-CF-Worker": "true"
};

// paths to pass-through to your origin. This is useful if you want to use your
// naked domain for your blog, but still have sub-folders or files on your origin
// you want to pass-through. These act as prefixes and are checked with `startsWith`
const originPaths = [
	'/favicon.ico',
	'/textscreens/'
];

// attributes potentially containing our .ghost.io domain
const attrs = ['href', 'src', 'content'];

// our HTMLRewriter element handler. We use a global one for ease-of-use
class elementHandler{
	element(element){
		// remove script to redirect from ghost.io to your own domaing
		if(element.tagName === 'script' && element.getAttribute('id') === 'redirect'){
			return element.remove();
		}
		// don't manipulate noscript elements - causes issues with AMP validation
		if(element.tagName === 'noscript'){
			this.ignore = true;
			return;
		}
		// don't manipulate any code-based tags to prevent escaping of text output
		if(element.tagName === 'pre' || element.tagName === 'code' || (element.tagName === 'span' && element.hasAttribute('class') && element.getAttribute('class').includes('token'))){
			this.ignore = true;
			return;
		}
		this.buffer = ''; // initialise text buffer for this element
		// set lazy=loading on all images that don't already have a `loading` attribute
		if(element.tagName === 'img'){
			const lazyLoadingVal = element.getAttribute('loading');
			if(!lazyLoadingVal){
				element.setAttribute('loading', 'lazy');
			}
		}
		// update element attributes that contain our ghost domain, with our real domain
		attrs.forEach(attr => {
			const attrVal = element.getAttribute(attr);
			if(element && attrVal && attrVal.includes(ghostDomain)){
				element.setAttribute(attr, attrVal.replace(new RegExp(ghostDomain, 'g'), domain));
			}
		});
	}
	text(text){
		if(this.ignore){ return; } // don't manipulate this element
		this.buffer += text.text; // concatenate new text with existing text buffer
		if(text.lastInTextNode){
			// this is the last bit of text in the chunk. Search and replace text
			text.replace(this.buffer.replace(new RegExp(ghostDomain, 'g'), domain), {html: true});
			this.buffer = '';
		}else{
			// This wasn't the last text chunk, and we don't know if this chunk will
			// participate in a match. We must remove it so the client doesn't see it
			text.remove();
		}
	}
}

// setup page listener
addEventListener('fetch', event => {
	event.respondWith(handleRequest(event.request));
});

async function handleRequest(req){
	const url = new URL(req.url);
	const realDomain = url.hostname.toString();
	// handle requests to other subdomains, etc. this worker may be processing and redirect
	// useful for redirecting things like blog.yourdomain.com to yourdomain.com
	if(realDomain !== domain){
		let newUrl = new URL(req.url);
		newUrl.hostname = domain;
		let redirectHeaders = new Headers();
		redirectHeaders.set('Location', newUrl.toString());
		return new Response('', {
			status: 301,
			headers: redirectHeaders
		});
	}
	// pass through paths to origin
	const passThrough = originPaths.length && originPaths.find(path => {
		return url.pathname.startsWith(path);
	});
	if(passThrough){
		console.log('origin passthrough', url.toString());
		return fetch(req, {
			minify: {
				javascript: true,
				css: true,
				html: true
			},
			polish: 'lossless'
		});
	}

	const fullUrl = new URL(req.url).toString().replace(domain, ghostDomain);
	const init = {
		headers: req.headers,
		method: req.method,
		cf: {
			minify: {
				javascript: true,
				css: true,
				html: true
			},
			polish: 'lossless'
		}
	};
	if(req.method !== 'GET' && req.method !== 'HEAD'){
		init.body = req.body;
	}
	// fetch page from ghost.io
	const response = await fetch(fullUrl, init);

	// don't stream responses for non-html content
	if(response.headers.has("Content-Type") && !response.headers.get("Content-Type").includes("text/html")){
		// if robots or sitemap.xml, run our replace over the entire file
		if(url.pathname === '/robots.txt' || url.pathname.endsWith('.xml')){
			const text = await response.text();
			const modified = text.replace(new RegExp(ghostDomain, 'g'), domain);
			return new Response(modified, {
				status: response.status,
				statusText: response.statusText,
				headers: response.headers
			});
		}
		return new Response(response.body, {
			status: response.status,
			statusText: response.statusText,
			headers: response.headers
		});
	}

	// helper URL to redirect /ghost/ to your ghost.io domain for quick editing
	if(url.pathname === '/ghost' || url.pathname === '/ghost/'){
		let redirectHeaders = new Headers();
		redirectHeaders.set('Location', `https://${ghostDomain}/ghost/`);
		return new Response('', {
			status: 302,
			headers: redirectHeaders
		});
	}

	// handle the ghost.io domain responding with a different path
	// this is common for things like preview links when in the admin editor
	const originPath = new URL(response.url).pathname;
	if(originPath !== url.pathname){
		const redirectHeaders = new Headers();
		redirectHeaders.set('Location', `https://${domain}` + originPath);
		return new Response('', {
			status: 301,
			headers: redirectHeaders
		});
	}

	// handle HTML rewriting. This uses the Cloudflare HTMLRewriter API to manipulate the response
	// in real-time as it's streamed, so to minimise effect on TTFB
	const rewriter = new HTMLRewriter();
	rewriter.on('*', new elementHandler());
	const transformed = rewriter.transform(response);

	// add headers to transformed response
	const newHdrs = new Headers(transformed.headers);
	Object.keys(addHeaders).forEach(name => {
		newHdrs.set(name, addHeaders[name]);
	});

	return new Response(await transformed.text(), {
		status: response.status,
		statusText: response.statusText,
		headers: newHdrs
	});
}

The amount of code running here may look overwhelming, but it's relatively simple and straight-forward for a Cloudflare Worker. I've commented the code pretty extensively, but I'll walk you through some of the interesting parts. Feel free to reach out to me on Twitter or in the comments if you have any questions, or have any improvements to my Worker.

Note, after removing the CNAME for Ghost(Pro), if you don't have an origin server and don't want to use that feature of this worker, you'll need to create a "dummy" A record for your domain pointing to any IP, such as 8.8.8.8. Without this, the Worker won't run.

Configuration

const domain = 'jross.me';
const ghostDomain = 'jamesross.ghost.io';

const addHeaders = {};

These are the 3 main points of configuration for the script. If you tweak these and nothing else, the script should work as expected. You can add any headers here including things like a Content-Security-Policy, I've just limited them in this example for brevity.

const originPaths = [
	'/favicon.ico',
	'/textscreens'
];

This is another configuration point where you can configure paths that will completely bypass your Ghost(Pro) blog, and go straight to your origin. In my example, I send requests to /textscreens straight to my origin, as I have some legacy code running here that I haven't moved serverless yet. All of these paths act as prefixes, so requests to /textscreens/hello will fall-through to my origin in this example. If you have no use for this feature, simply set it to an empty array [].

HTMLRewriter

This next section of code is really cool, and takes advantage of the new (beta) HTMLRewriter API in Workers. I'm doing a lot of things here:

  • Removing the inline <script> we added earlier to prevent unnecessary inline JS from existing on the page (useful for strict CSPs).
  • Ignoring certain elements such as code blocks, noscript tags, etc. that we don't want to manipulate. This can result in weirdness such as double-encoding of characters.
  • Scanning element attributes for references to my ghost.io domain, and making replacements as necessary.
  • Adding loading=lazy to all img elements that don't already contain a loading attribute. This uses the new native lazy-loading feature within Chrome to lazy-load images, which can help a lot of page-weight, especially on mobile devices. Other browsers are likely to implement this new specification soon.
  • Find and replacing all other text content that contains my ghost.io domain. Due to the way that chunking works in the HTMLRewriter API, we have to store a buffer of text until the end of element, deleting each element as it comes in, and then replacing the final element with the full text. A huge thanks to harris on the Cloudflare team for detailing this solution.
class elementHandler{
	element(element){
		// remove script to redirect from ghost.io to your own domaing
		if(element.tagName === 'script' && element.getAttribute('id') === 'redirect'){
			return element.remove();
		}
		// don't manipulate noscript elements - causes issues with AMP validation
		if(element.tagName === 'noscript'){
			this.ignore = true;
			return;
		}
		// don't manipulate any code-based tags to prevent escaping of text output
		if(element.tagName === 'pre' || element.tagName === 'code' || (element.tagName === 'span' && element.hasAttribute('class') && element.getAttribute('class').includes('token'))){
			this.ignore = true;
			return;
		}
		this.buffer = ''; // initialise text buffer for this element
		// set lazy=loading on all images that don't already have a `loading` attribute
		if(element.tagName === 'img'){
			const lazyLoadingVal = element.getAttribute('loading');
			if(!lazyLoadingVal){
				element.setAttribute('loading', 'lazy');
			}
		}
		// update element attributes that contain our ghost domain, with our real domain
		attrs.forEach(attr => {
			const attrVal = element.getAttribute(attr);
			if(element && attrVal && attrVal.includes(ghostDomain)){
				element.setAttribute(attr, attrVal.replace(new RegExp(ghostDomain, 'g'), domain));
			}
		});
	}
	text(text){
		if(this.ignore){ return; } // don't manipulate this element
		this.buffer += text.text; // concatenate new text with existing text buffer
		if(text.lastInTextNode){
			// this is the last bit of text in the chunk. Search and replace text
			text.replace(this.buffer.replace(new RegExp(ghostDomain, 'g'), domain));
			this.buffer = '';
		}else{
			// This wasn't the last text chunk, and we don't know if this chunk will
			// participate in a match. We must remove it so the client doesn't see it
			text.remove();
		}
	}
}

HTMLRewriter essentially lets you transform the HTML response in real-time, using jQuery-like selectors, whilst streaming the output to the client. Cloudflare claims this uses much less overall CPU and RAM than traditional methods, and it also avoids delaying time to first byte (TTFB), which is huge.

There's so much more that can be done with HTMLRewriter, and I'd be very interested to hear about any use-cases you develop. Shoot me a message on Twitter or drop a comment here and I may add a link to your post below.

Closing Thoughts

I acknowledge that Ghost(Pro) controlling your domain's Cloudflare settings is probably best for the vast majority of their users. As a developer who likes to tinker with things though, this left me with less control of my domain than I'd like, and without access to Cloudflare Workers, something I've started to use heavily.

Ghost is an awesome, simple blogging platform. It may be a little too simple, but with this setup I now have the best of both worlds; my Ghost(Pro) hosted blog and no servers for me to worry about, and my Cloudflare Worker, giving me full control over my domain again.