Extracting Content of HTML Page in Javascript

For my latest project, I had to crawl through websites yielded by search engines and extract their contents to be later processed by the ML sentiment analysis model. When approaching the task, I imagined that extracting article content would be a tedious task since you have to take into account that all web pages behave differently, have different markups, and have a lot of text that does not relate to article content.

Luckily, this problem has already been tackled with the help of the Readability package. To use this package, you have to create an object that accepts a DOM node inside its constructor.

import readability from "@mozilla/readability";

const reader = new readability.Readability(doc.window.document);

Since we run our code in AWS Lambda, we don't have access to browser DOM parsing capabilities. So, to parse DOM from HTML, we'll use the JSDOM library to mimic it.

import jsdom from "jsdom";

const res = await fetch(url);
const html = await res.text();
const doc = new jsdom.JSDOM(html);

Once we've created our Readability object, we call the parse method.

const article = reader.parse();
console.trace(`extracted article text: ${article.textContent}`);

Returns an object containing the following properties:

  • title: article title;
  • content: HTML string of processed article content;
  • textContent: text content of the article, with all the HTML tags removed;
  • length: length of an article, in characters;
  • excerpt: article description or short excerpt from the content;
  • byline: author metadata;
  • dir: content direction;
  • siteName: name of the site;
  • lang: content language;
  • publishedTime: published time;

Polishing the solution

The result, however, is not ideal. For the sake of a better impression of the poem, we want to remove footnotes as well as some weird characters. Let's have a look at a footnote remover.

const removeFootnotes = (str) => {

  let inFootnote = false;
  let result = '';

  for (let i = 0; i < str.length; i++) {

    if (str[i] === '[') {
      inFootnote = true;
      continue;
    }

    if (str[i] === ']') {
      inFootnote = false;
      continue;
    }

    if (!inFootnote) {
      result += str[i];
    }
  }

  return result;
}

export default removeFootnotes;

And weird characters remover.

const sanitize = (str) => {
    let result = '';

    for (let i = 0; i < str.length; i++) {
      let char = str[i];
      if (
        (char >= 'A' && char <= 'Z') ||
        (char >= 'a' && char <= 'z') ||
        (char >= '0' && char <= '9') ||
        char === ' ' ||
        '()",.-?!«»'.includes(char)
      ) {
        result += char;
      }
    }

    return result;
  }

export default sanitize

Let's unit-test it with the help of mocha and expect.js.

import expect from 'expect.js'

import sanitize from '../article-extractor/sanitizer.mjs';
import removeFootnotes from '../article-extractor/footnote-remover.mjs';

describe('sanitize', () => {
    it('removes non-alphanumeric characters', () => {
      expect(sanitize('abc123!@#')).to.be('abc123!');
    });

    it('keeps whitespace', () => {
      expect(sanitize('abc 123')).to.be('abc 123');
    });

    it('keeps punctuation', () => {
      expect(sanitize('abc,.-?!')).to.be('abc,.-?!');
    });

    it('handles empty string', () => {
      expect(sanitize('')).to.be('');
    });

    it('removes unwanted punctuation', () => {
      expect(sanitize('#!@$')).to.be('!');
    });

    it('removes footnotes', () => {
      expect(removeFootnotes('Hello[1] world[2]!')).to.be('Hello world!');
    });

  });

Now, we have to call out sanitizer methods.

return sanitize(removeFootnotes(article.textContent));