登录查看更多内容

Stream state machine: html parser

Sergei Iastrebov

Senior Software Engineer at Meta

发布日期: 2018年11月28日

Today I want to talk about some interesting and powerful technique that might be useful to parse url from html pages - state machine.

Basically, state machine is a program defined by some states and rules to move between them:

state represents some program condition;
one or few states can be used as an initial (there also can be zero, one or few final states);
each state can have transitions to itself, other states or nowhere.

I think that basic idea is clear. Now, let look at parsing problem and potential solutions, then let think about how to use state machine for html parsing.

Parsing problem

The problem itself is really simple:

download html;
parse/extract data (in this case, url).

This problem is really sensitive to time and space complexity, because we need to parse a lot of html pages in minimum time and memory.

Possible solutions

Download all html, transform it into DOM, extract 'a' tags and and their href attributes

Main problem with this solution is that it's really slow and heavy. Also we have to use some external libraries to generate and work with DOM.

Download all html and use regular expressions to extract url

Main problem with this solution is that it's required non trivial regular expression, really error prone and also slow (depending on regular expression)

Use front and back search to find required characters and use virtual state

This solutions require a lot of data reading and front and back index movement. So, it's can be slow. Also virtual states can be hard to debug.

State machine

HTML parsing state machine

Let's look at a simple state machine for parsing url from html:

initial state will be just a raw html;
from 'html' we can go to 'tag' state, if we find symbol '<', or stay in current position;
when we moved into 'tag' state, we can go to possible 'a tag' state, stays in 'tag' or go to the 'tag end';
if we find 'a tag', we moved from one state to another until we find 'href\s*=' or '>';
from '=' state we can go through ' or " state to 'URL' or go to 'URL' state directly;
when 'URL' state is reached, we finally can extract url from 'a' tag and go to 'tag end' state;
then we go through 'end tag' state back to the 'html' state.

This is the basic idea how we can easily parse url from HTML. It probably need some checks and additional states/transitions, but why it's so useful?

First of all, each step we look at current symbol;
Secondly, we look at each symbol just once;
And finally, at each state we use just a few check to move from one state to another.

This means that we can extrude all urls in a linear time.

Code implementation

Now let's try to write a simple implementation of this approach:

// Package.json with correct name and version for user-sgent
const packagenpm = require("./package.json");

// NodeJs build in URL module for url parsing
import * as URL from "url";

// HTTP and HTTPS clients for download
import * as HTTP from "http";
import * as HTTPS from "https";

// KeepAlive agents for http and https (if we whant to build crawler)
const HTTP_AGENT = require("agentkeepalive");
const HTTPS_AGENT = HTTP_AGENT.HttpsAgent;

// HTTP KeepAlive agents
const HTTP_KEEP_ALIVE_AGENT = new HTTP_AGENT({
  keepAlive: true,
  keepAliveMsecs: 1000,
  freeSocketKeepAliveTimeout: 30000,
  timeout: 60000,
  maxSockets: 100,
  maxFreeSockets: 10,
});

// HTTPS KeepAlive agents
const HTTPS_KEEP_ALIVE_AGENT = new HTTPS_AGENT({
  keepAlive: true,
  keepAliveMsecs: 1000,
  freeSocketKeepAliveTimeout: 30000,
  timeout: 60000,
  maxSockets: 100,
  maxFreeSockets: 10,
});

// Our State Machine Parser class
class Parser {// URL to downloadprivate url: string;// Parsed URLprivate urlValue: URL.UrlWithParsedQuery;// HashSet with excructed urlsprivate urls = new Set();// Current parsed URL for state machineprivate parseUrl = "";// State machine initial stateprivate parseState = 0;

  // Constructor to save raw and parsed URLpublic constructor(url: string) {this.url = url;
    this.urlValue = URL.parse(url, true);
  }

  // Run method to download and parse htmlpublic process() {return this.download();
  }

  // Method for html parsingprivate parserHTML(html: string) {// Got through all characters from current chunk and switch statesfor (let i = 0; i < html.length; i++) {if (this.parseState === 0) {
        // 'html' stateif (html[i] === "<") {// if we find open tag symbol go to 'tag' statethis.parseState = 1;
        }
      } else if (this.parseState === 1) {
        // 'tag' stateif (html[i] === "a" || html[i] === "A") {// if we find 'a' symbol go to the posible 'a tag' statethis.parseState = 2;
        } else if (html[i] !== "a" && html[i] !== "A" && html[i] !== " ") {
          // if we find some other non space or 'a' symbol that it's not an 'a' tag and we go to 'tag end' state
          i--;
          this.parseState = 11;
        } else {
          // otherwise it's space symbol and we just stay in current statethis.parseState = 1;
        }
      } else if (this.parseState === 2) {
        // posible 'a tag' stateif (html[i] === " ") {// if we find space symbol that it's an 'a' tagthis.parseState = 3;
        } else {
          // otherwise it's some other tag and we go to 'tag end' state
          i--;
          this.parseState = 11;
        }
      } else if (this.parseState === 3) {
        // 'a tag' statethis.parseUrl = "";

        if (html[i] === "h" || html[i] === "H") {
          // go through 'href' statesthis.parseState = 4;
        } else if (html[i] === ">") {
          // if we find end tag symbol just go to the 'tag end' state
          i--;
          this.parseState = 11;
        } else {
          // if we find some other symbol than go to the 'a tag' statethis.parseState = 3;
        }
      } else if (this.parseState === 4) {
        // 'h' stateif (html[i] === "r" || html[i] === "R") {// go through 'href' statesthis.parseState = 5;
        } else if (html[i] === ">") {
          // if we find end tag symbol just go to the 'tag end' state
          i--;
          this.parseState = 11;
        } else {
          // if we find some other symbol than go to the 'a tag' statethis.parseState = 3;
        }
      } else if (this.parseState === 5) {
        // 'r' stateif (html[i] === "e" || html[i] === "E") {// go through 'href' statesthis.parseState = 6;
        } else if (html[i] === ">") {
          // if we find end tag symbol just go to the 'tag end' state
          i--;
          this.parseState = 11;
        } else {
          // if we find some other symbol than go to the 'a tag' statethis.parseState = 3;
        }
      } else if (this.parseState === 6) {
        // 'e' stateif (html[i] === "f" || html[i] === "F") {// go through 'href' statesthis.parseState = 7;
        } else if (html[i] === ">") {
          // if we find end tag symbol just go to the 'tag end' state
          i--;
          this.parseState = 11;
        } else {
          // if we find some other symbol than go to the 'a tag' statethis.parseState = 3;
        }
      } else if (this.parseState === 7) {
        // 'f' stateif (html[i] === " ") {// go through 'href' statesthis.parseState = 7;
        } else if (html[i] === "=") {
          this.parseState = 8;
        } else {
          // if we find some other symbol than go to the 'a tag' statethis.parseState = 3;
        }
      } else if (this.parseState === 8) {
        // '=' stateif (html[i] === " ") {// if we find space symbol than just ignore itthis.parseState = 8;
        } else if (html[i] === "\"" || html[i] === "'") {
          // if we find quote symbol got to quote statethis.parseState = 9;
        } else {
          // otherwise got to the URL state
          i--;
          this.parseState = 10;
        }
      } else if (this.parseState === 9) {
        // quote stateif (html[i] === " ") {// ignore space symbolsthis.parseState = 9;
        } else {
          // than go to the URL state
          i--;
          this.parseState = 10;
        }
      } else if (this.parseState === 10) {
        // 'URL' stateif (html[i] === "\"" || html[i] === "'" || html[i] === " ") {// if we reached the url end symbol (space or quote) than go to 'tag end' state
          i--;
          this.parseState = 11;
        } else {
          // Go through URL state and save current urlthis.parseUrl += html[i];
        }
      } else if (this.parseState === 11) {
        // 'tag end' stateif (html[i] === ">") {// if we reached tag end symbol than try to save current parsed url and go back to the raw html statethis.parseState = 0;// Add current parsed URL to the HashSetthis.addUrl(this.parseUrl);
        } else {
          // skip all non tag end symbolsthis.parseState = 11;
        }
      }
    }
    // Add current parsed URL to the HashSetthis.addUrl(this.parseUrl);
  }

  // Method for save parsed url into the HashSetprivate addUrl(url: string) {// if url does'n exist, do nothingif (!url) return;// resolve url based on download url
    url = URL.resolve(this.url, url);
    // if it's external url than just ignore itif (!url.startsWith(`${this.urlValue.protocol}//${this.urlValue.hostname}`)) return;// add normalized url into the HastSetthis.urls.add(url);
  }

  // Method for download urlprivate download() {return new Promise((resolve, reject) => {
      // Determine which client we will useconst isHttps = this.urlValue.protocol.indexOf("https") === 0;// Create HTTP or HTTPS requestconst req: HTTP.ClientRequest = ((isHttps ? HTTPS : HTTP) as any).request({
        host: this.urlValue.host,
        port: isHttps ? 443 : 80,
        path: this.urlValue.pathname,
        method: "GET",
        agent: isHttps ? HTTPS_KEEP_ALIVE_AGENT : HTTP_KEEP_ALIVE_AGENT,
        headers: {
          "User-Agent": `${packagenpm.name}/${packagenpm.version} (+https://CrazySquirrel.ru/)`
        }
      }, (res: HTTP.ClientResponse) => {
        // Get responce status and headersconst status = res.statusCode;const headers = res.headers;
        // Get responce content typeconst contentType = headers["content-type"] || "";

        // If url doesn't exist that ignore itif (status !== 200) {return reject(new Error("Not 200 OK"));
        }

        // If url isn't a html pages than tgnore itif (contentType.indexOf("text/html") === -1) {return reject(new Error("Not HTML"));
        }

        // Get encoding from responceconst encoding = (contentType.split("charset=")[1] || "utf-8").trim().toLowerCase();

        // Set correct encoding for data processing
        res.setEncoding(encoding);

        // Add listener for data processing// parse each chunk seperetly
        res.on("data", this.parserHTML.bind(this));

        // Add listener for processing end
        res.on("end", () => resolve(this.urls.values()));

        // Add error listener
        res.on("error", reject);
      });
      // Add error listener
      req.on("error", reject);
    });
  }
}

// Initialize parser and download example page
(new Parser("https://crazysquirrel.ru/"))
  .process()
  .then(console.log)
  .catch(console.log);

This is Node.js code which implement downloader and parsing state machine.

It can be improved by changing 'else if' statement to switch or table lookup, but main idea stays the same:

we get html data from stream in buffers;
parse it in runtime;
and save url into the HashSet.

When we received and processed all data, we just return HashSet values.

Summary

This approach has few properties:

linear time complexity;
space complexity linearly dependent on html chunk size and different url count;
code can be easily optimised by compiler.

This is example of how simple and powerful algorithms can be used to efficiently solution specific task.

要查看或添加评论，请登录

Sergei Iastrebov的更多文章

Как устроиться на работу в Facebook/Meta и переехать в Лондон

2022年1月9日

Как устроиться на работу в Facebook/Meta и переехать в Лондон

Хотелось ещё раз поделиться своим опытом по подготовке к собеседованиям в FAANG и дать рекомендации о том как учить…
Front-end Online Meetup

2020年10月17日

Front-end Online Meetup

15th — 17th of October, we hosted Front-end Online Meetup, where seven great speakers shared their experience and…
Deep Learning Online Meetup

2020年9月23日

Deep Learning Online Meetup

17th, 18th, 20th, 22nd and 23rd of September, we hosted Deep Learning Online Meetup, where five great speakers shared…
Even Faster Single Page Application: 9.Security

2019年1月20日

Even Faster Single Page Application: 9.Security

https://www.linkedin.
Security hole: when someone @#&! your server

2018年12月31日

Security hole: when someone @#&! your server

Due to recent events, when someone hacked my personal web server several times, I decided to write a short article…
Even Faster Single Page Application: 8.Documentation

2018年12月26日

Even Faster Single Page Application: 8.Documentation

https://www.linkedin.
Even Faster Single Page Application: 7.Native app assembling

2018年12月21日

Even Faster Single Page Application: 7.Native app assembling

https://www.linkedin.
Even Faster Single Page Application: 6.Optimizations and CI

2018年12月19日

Even Faster Single Page Application: 6.Optimizations and CI

https://www.linkedin.
Even Faster Single Page Application: 5.Monitoring and testing

2018年12月19日

Even Faster Single Page Application: 5.Monitoring and testing

https://www.linkedin.
Even Faster Single Page Application: 4.React Express SSR

2018年12月19日

Even Faster Single Page Application: 4.React Express SSR

https://www.linkedin.

See all articles

Stream state machine: html parser

Sergei Iastrebov

Senior Software Engineer at Meta

Parsing problem

Possible solutions

HTML parsing state machine

Code implementation

Summary

Sergei Iastrebov的更多文章

社区洞察

其他会员也浏览了

HTML form validation ??

advertools v0.13.0 new features

Regular Expressions—a Rite of Passage: Completing the Application

How to Create Checkout Form Using HTML

Data-Visualizations using D3.js

A Comprehensive Guide to JavaScript Data Types

Episode 2.6: HTML Tables - Organizing and Displaying Data

JavaScript Array and Array Methods:

Getting data from the Web

Google Page Speed Insight - July 2020

Parsing problem

Possible solutions

HTML parsing state machine

Code implementation

Summary

Sergei Iastrebov的更多文章

Как устроиться на работу в Facebook/Meta и переехать в Лондон

Front-end Online Meetup

Deep Learning Online Meetup

Even Faster Single Page Application: 9.Security

Security hole: when someone @#&! your server

Even Faster Single Page Application: 8.Documentation

Even Faster Single Page Application: 7.Native app assembling

Even Faster Single Page Application: 6.Optimizations and CI

Even Faster Single Page Application: 5.Monitoring and testing

Even Faster Single Page Application: 4.React Express SSR

社区洞察

其他会员也浏览了

HTML form validation ??

advertools v0.13.0 new features

Regular Expressions—a Rite of Passage: Completing the Application

How to Create Checkout Form Using HTML

Data-Visualizations using D3.js

A Comprehensive Guide to JavaScript Data Types

Episode 2.6: HTML Tables - Organizing and Displaying Data

JavaScript Array and Array Methods:

Getting data from the Web

Google Page Speed Insight - July 2020