Stream state machine: html parser
Today I want to talk about some interesting and powerful technique that might be useful to parse url from html pages - state machine.
Basically, state machine is a program defined by some states and rules to move between them:
- state represents some program condition;
- one or few states can be used as an initial (there also can be zero, one or few final states);
- each state can have transitions to itself, other states or nowhere.
I think that basic idea is clear. Now, let look at parsing problem and potential solutions, then let think about how to use state machine for html parsing.
Parsing problem
The problem itself is really simple:
- download html;
- parse/extract data (in this case, url).
This problem is really sensitive to time and space complexity, because we need to parse a lot of html pages in minimum time and memory.
Possible solutions
Download all html, transform it into DOM, extract 'a' tags and and their href attributes
Main problem with this solution is that it's really slow and heavy. Also we have to use some external libraries to generate and work with DOM.
Download all html and use regular expressions to extract url
Main problem with this solution is that it's required non trivial regular expression, really error prone and also slow (depending on regular expression)
Use front and back search to find required characters and use virtual state
This solutions require a lot of data reading and front and back index movement. So, it's can be slow. Also virtual states can be hard to debug.
State machine
HTML parsing state machine
Let's look at a simple state machine for parsing url from html:
- initial state will be just a raw html;
- from 'html' we can go to 'tag' state, if we find symbol '<', or stay in current position;
- when we moved into 'tag' state, we can go to possible 'a tag' state, stays in 'tag' or go to the 'tag end';
- if we find 'a tag', we moved from one state to another until we find 'href\s*=' or '>';
- from '=' state we can go through ' or " state to 'URL' or go to 'URL' state directly;
- when 'URL' state is reached, we finally can extract url from 'a' tag and go to 'tag end' state;
- then we go through 'end tag' state back to the 'html' state.
This is the basic idea how we can easily parse url from HTML. It probably need some checks and additional states/transitions, but why it's so useful?
- First of all, each step we look at current symbol;
- Secondly, we look at each symbol just once;
- And finally, at each state we use just a few check to move from one state to another.
This means that we can extrude all urls in a linear time.
Code implementation
Now let's try to write a simple implementation of this approach:
// Package.json with correct name and version for user-sgent
const packagenpm = require("./package.json");
// NodeJs build in URL module for url parsing
import * as URL from "url";
// HTTP and HTTPS clients for download
import * as HTTP from "http";
import * as HTTPS from "https";
// KeepAlive agents for http and https (if we whant to build crawler)
const HTTP_AGENT = require("agentkeepalive");
const HTTPS_AGENT = HTTP_AGENT.HttpsAgent;
// HTTP KeepAlive agents
const HTTP_KEEP_ALIVE_AGENT = new HTTP_AGENT({
keepAlive: true,
keepAliveMsecs: 1000,
freeSocketKeepAliveTimeout: 30000,
timeout: 60000,
maxSockets: 100,
maxFreeSockets: 10,
});
// HTTPS KeepAlive agents
const HTTPS_KEEP_ALIVE_AGENT = new HTTPS_AGENT({
keepAlive: true,
keepAliveMsecs: 1000,
freeSocketKeepAliveTimeout: 30000,
timeout: 60000,
maxSockets: 100,
maxFreeSockets: 10,
});
// Our State Machine Parser class
class Parser {// URL to downloadprivate url: string;// Parsed URLprivate urlValue: URL.UrlWithParsedQuery;// HashSet with excructed urlsprivate urls = new Set();// Current parsed URL for state machineprivate parseUrl = "";// State machine initial stateprivate parseState = 0;
// Constructor to save raw and parsed URLpublic constructor(url: string) {this.url = url;
this.urlValue = URL.parse(url, true);
}
// Run method to download and parse htmlpublic process() {return this.download();
}
// Method for html parsingprivate parserHTML(html: string) {// Got through all characters from current chunk and switch statesfor (let i = 0; i < html.length; i++) {if (this.parseState === 0) {
// 'html' stateif (html[i] === "<") {// if we find open tag symbol go to 'tag' statethis.parseState = 1;
}
} else if (this.parseState === 1) {
// 'tag' stateif (html[i] === "a" || html[i] === "A") {// if we find 'a' symbol go to the posible 'a tag' statethis.parseState = 2;
} else if (html[i] !== "a" && html[i] !== "A" && html[i] !== " ") {
// if we find some other non space or 'a' symbol that it's not an 'a' tag and we go to 'tag end' state
i--;
this.parseState = 11;
} else {
// otherwise it's space symbol and we just stay in current statethis.parseState = 1;
}
} else if (this.parseState === 2) {
// posible 'a tag' stateif (html[i] === " ") {// if we find space symbol that it's an 'a' tagthis.parseState = 3;
} else {
// otherwise it's some other tag and we go to 'tag end' state
i--;
this.parseState = 11;
}
} else if (this.parseState === 3) {
// 'a tag' statethis.parseUrl = "";
if (html[i] === "h" || html[i] === "H") {
// go through 'href' statesthis.parseState = 4;
} else if (html[i] === ">") {
// if we find end tag symbol just go to the 'tag end' state
i--;
this.parseState = 11;
} else {
// if we find some other symbol than go to the 'a tag' statethis.parseState = 3;
}
} else if (this.parseState === 4) {
// 'h' stateif (html[i] === "r" || html[i] === "R") {// go through 'href' statesthis.parseState = 5;
} else if (html[i] === ">") {
// if we find end tag symbol just go to the 'tag end' state
i--;
this.parseState = 11;
} else {
// if we find some other symbol than go to the 'a tag' statethis.parseState = 3;
}
} else if (this.parseState === 5) {
// 'r' stateif (html[i] === "e" || html[i] === "E") {// go through 'href' statesthis.parseState = 6;
} else if (html[i] === ">") {
// if we find end tag symbol just go to the 'tag end' state
i--;
this.parseState = 11;
} else {
// if we find some other symbol than go to the 'a tag' statethis.parseState = 3;
}
} else if (this.parseState === 6) {
// 'e' stateif (html[i] === "f" || html[i] === "F") {// go through 'href' statesthis.parseState = 7;
} else if (html[i] === ">") {
// if we find end tag symbol just go to the 'tag end' state
i--;
this.parseState = 11;
} else {
// if we find some other symbol than go to the 'a tag' statethis.parseState = 3;
}
} else if (this.parseState === 7) {
// 'f' stateif (html[i] === " ") {// go through 'href' statesthis.parseState = 7;
} else if (html[i] === "=") {
this.parseState = 8;
} else {
// if we find some other symbol than go to the 'a tag' statethis.parseState = 3;
}
} else if (this.parseState === 8) {
// '=' stateif (html[i] === " ") {// if we find space symbol than just ignore itthis.parseState = 8;
} else if (html[i] === "\"" || html[i] === "'") {
// if we find quote symbol got to quote statethis.parseState = 9;
} else {
// otherwise got to the URL state
i--;
this.parseState = 10;
}
} else if (this.parseState === 9) {
// quote stateif (html[i] === " ") {// ignore space symbolsthis.parseState = 9;
} else {
// than go to the URL state
i--;
this.parseState = 10;
}
} else if (this.parseState === 10) {
// 'URL' stateif (html[i] === "\"" || html[i] === "'" || html[i] === " ") {// if we reached the url end symbol (space or quote) than go to 'tag end' state
i--;
this.parseState = 11;
} else {
// Go through URL state and save current urlthis.parseUrl += html[i];
}
} else if (this.parseState === 11) {
// 'tag end' stateif (html[i] === ">") {// if we reached tag end symbol than try to save current parsed url and go back to the raw html statethis.parseState = 0;// Add current parsed URL to the HashSetthis.addUrl(this.parseUrl);
} else {
// skip all non tag end symbolsthis.parseState = 11;
}
}
}
// Add current parsed URL to the HashSetthis.addUrl(this.parseUrl);
}
// Method for save parsed url into the HashSetprivate addUrl(url: string) {// if url does'n exist, do nothingif (!url) return;// resolve url based on download url
url = URL.resolve(this.url, url);
// if it's external url than just ignore itif (!url.startsWith(`${this.urlValue.protocol}//${this.urlValue.hostname}`)) return;// add normalized url into the HastSetthis.urls.add(url);
}
// Method for download urlprivate download() {return new Promise((resolve, reject) => {
// Determine which client we will useconst isHttps = this.urlValue.protocol.indexOf("https") === 0;// Create HTTP or HTTPS requestconst req: HTTP.ClientRequest = ((isHttps ? HTTPS : HTTP) as any).request({
host: this.urlValue.host,
port: isHttps ? 443 : 80,
path: this.urlValue.pathname,
method: "GET",
agent: isHttps ? HTTPS_KEEP_ALIVE_AGENT : HTTP_KEEP_ALIVE_AGENT,
headers: {
"User-Agent": `${packagenpm.name}/${packagenpm.version} (+https://CrazySquirrel.ru/)`
}
}, (res: HTTP.ClientResponse) => {
// Get responce status and headersconst status = res.statusCode;const headers = res.headers;
// Get responce content typeconst contentType = headers["content-type"] || "";
// If url doesn't exist that ignore itif (status !== 200) {return reject(new Error("Not 200 OK"));
}
// If url isn't a html pages than tgnore itif (contentType.indexOf("text/html") === -1) {return reject(new Error("Not HTML"));
}
// Get encoding from responceconst encoding = (contentType.split("charset=")[1] || "utf-8").trim().toLowerCase();
// Set correct encoding for data processing
res.setEncoding(encoding);
// Add listener for data processing// parse each chunk seperetly
res.on("data", this.parserHTML.bind(this));
// Add listener for processing end
res.on("end", () => resolve(this.urls.values()));
// Add error listener
res.on("error", reject);
});
// Add error listener
req.on("error", reject);
});
}
}
// Initialize parser and download example page
(new Parser("https://crazysquirrel.ru/"))
.process()
.then(console.log)
.catch(console.log);
This is Node.js code which implement downloader and parsing state machine.
It can be improved by changing 'else if' statement to switch or table lookup, but main idea stays the same:
- we get html data from stream in buffers;
- parse it in runtime;
- and save url into the HashSet.
When we received and processed all data, we just return HashSet values.
Summary
This approach has few properties:
- linear time complexity;
- space complexity linearly dependent on html chunk size and different url count;
- code can be easily optimised by compiler.
This is example of how simple and powerful algorithms can be used to efficiently solution specific task.