My blog was hosted on PostHaven for about 12 years now. It’s a pretty good platform and has served me well. But I wanted to move my blog to a MarkDown powered static site. Unfortunately, posthaven doesn’t provide an export option, probably because it not in their financial interest. Oh well, I’ll scrape my own blog and extract the posts.
My first attempt was to use the requests and BeautifulSoup to fetch the urls from the archives page. But the archives page is lazy loaded using Javascript and I was not in the mood to learn selenium for this task.
I remembered Simon’s shot-scraper tool which is a CLI for taking screenshots of websites. A quick look at the documentation showed fully functional examples of selectively scraping a website using CSS selectors and returning the results as JSON.
Here’s the final script I used to scrape my blog and extract the posts into a SQLite database using sqlite-utils library.
import json
from sqlite_utils import Database # pip install sqlite-utils
import runez # pip install runez
archives = ["https://blog.amjith.com/archive", "https://blog.amjith.com/archive?page=2"]
blog_urls = []
archive_js = """new Promise(done => setInterval(() => {done(
Array.from(
document.querySelectorAll(".archive-list ul li a")).map(x => x.href))
}, 1000));"""
# iterate over each archive page and grab the url for the individual posts
for archive_page in archives:
r = runez.run("shot-scraper", "javascript", archive_page, archive_js)
urls = json.loads(r.output)
blog_urls.extend(urls)
post_js = """new Promise(done => setInterval(() => {
done({
title: document.querySelector(".post-title h2").innerText,
rawbody: document.querySelector(".post-body").innerHTML,
date: document.querySelector(".posthaven-formatted-date").getAttribute("data-unix-time"),
tags: Array.from(document.querySelectorAll("header .tags a")).map(x => x.innerText),
}
)
}, 5));"""
blog_posts = []
# iterate over each blog_url and fetch the title, post, tags and date
for url in blog_urls:
print("Fetching", url)
r = runez.run("shot-scraper", "javascript", url, post_js)
content = json.loads(r.output)
content["url"] = url
blog_posts.append(content)
db = Database("blog.db")
db["posts"].insert_all(blog_posts, pk="id")
Now I have a SQLite database with a table called posts
with all my blog posts. I used markdownify to convert the HTML snippets to markdown and write them out as individual files that were compatible with Hugo static site format.
import sqlite_utils
from datetime import datetime
import os
from markdownify import markdownify as md # pip install markdownify
db = sqlite_utils.Database("blog.db")
for row in db["posts"].rows:
ts = datetime.fromtimestamp(int(row["date"]))
# Convert ts to iso 8601
slug = row["url"].rsplit("/", 1)[-1]
date = ts.isoformat()
year = ts.strftime("%Y")
os.makedirs(year, exist_ok=True)
filename = f"{year}/{slug}.md"
with open(filename, "w") as f:
f.write("---\n")
f.write(f'title: "{row["title"]}"\n')
f.write(f"date: {date}\n")
f.write(f"tags: {row['tags']}\n")
f.write(f'url: "/blog/{slug}"\n')
f.write("---\n\n")
f.write(md(row["rawbody"]))
We’re all done. Welcome to my new blog.
Now that I own all my content and not locked into a vendor, maybe I’ll write more often.