Converting a dynamic site into static HTML documents

Its been two times now that I’ve been asked to make a website that was running on a CMS and make it static.

This is an useful practice if you want to keep the site content for posterity without having to maintain the underlying CMS. It makes it easier to migrate sites since the sites that you know you won’t add content to anymore becomes simply a bunch of HTML files in a folder.

My end goal was to make an EXACT copy of what the site is like when generated by the CMS, BUT now stored as simple HTML files. When I say EXACT, I mean it, even as to keep documents at their original location from the new static files. It means that each HTML document had to keep their same value BUT that a file will exist and the web server will find it. For example, if a link points to /foo, the link in the page remain as-is, even though its now a static file at /foo.html, but the web server will serve /foo.html anyway.

Here are a few steps I made to achieve just that. Notice that your mileage may vary, I’ve done those steps and they worked for me. I’ve done it once for a WordPress blog and another on the [email protected] website that was running on ExpressionEngine.


1. Browse and get all pages you think could be lost in scraping

We want a simple file with one web page per line with its full address.
This will help the crawler to not forget pages.

  • Use a web browser developer tool Network inspector, keep it open with “preserve log”.
  • Once you browsed the site a bit, from the network inspector tool, list all documents and then export using the “Save as HAR” feature.
  • Extract urls from har file using underscore-cli

    npm install underscore-cli
    cat site.har | underscore select ‘.entries .request .url’ > workfile.txt

  • Remove first and last lines (its a JSON array and we want one document per line)

  • Remove the trailing remove hostname from each line (i.e. start by /path), in vim you can do %s/http:\/\/www\
  • Remove " and ", from each lines, in vim you can do %s/",$//g
  • At the last line, make sure the " is removed too because the last regex missed it
  • Remove duplicate lines, in vim you can do :sort u
  • Save this file as list.txt for the next step.

2. Let’s scrape it all

We’ll do two scrapes. First one is to get all assets it can get, then we’ll go again with different options.

The following are the commands I ran on the last successful attempt to replicate the site I was working on.
This is not a statement that this method is the most efficient technique.
Please feel free to improve the document as you see fit.

First a quick TL;DR of wget options

  • -m is the same as --mirror
  • -k is the same as --convert-links
  • -K is the same as --backup-converted which creates .orig files
  • -p is the same as --page-requisites makes a page to get ALL requirements
  • -nc ensures we dont download the same file twice and end up with duplicates (e.g. file.html AND file.1.html)
  • --cut-dirs would prevent creating directories and mix things around, do not use.

Notice that we’re sending headers as if we were a web browser. Its up to you.

wget -i list.txt -nc --random-wait --mirror -e robots=off --no-cache -k -E --page-requisites \
     --user-agent='User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.94 Safari/537.36' \
     --header='Accept-Language: fr-FR,fr;q=0.8,fr-CA;q=0.6,en-US;q=0.4,en;q=0.2' \
     --header='Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' \

Then, another pass

wget -i list.txt --mirror -e robots=off -k -K -E --no-cache --no-parent \
     --user-agent='User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.94 Safari/537.36' \
     --header='Accept-Language: fr-FR,fr;q=0.8,fr-CA;q=0.6,en-US;q=0.4,en;q=0.2' \
     --header='Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' \

3. Do some cleanup on the fetched files

Here are a few commands I ran to clean the files a bit

  • Remove empty lines in every .orig files. They’re the ones we’ll use in the end after all

    find . -type f -regextype posix-egrep -regex '.*\.orig$' -exec sed -i 's/\r//' {} \;
  • Rename the .orig file into html

    find . -name '*orig' | sed -e "p;s/orig/html/" | xargs -n2 mv
    find . -type f -name '*\.html\.html' | sed -e "p;s/\.html//" | xargs -n2 mv
  • Many folders might have only an index.html file in it. Let’s just make them a file without directory

    find . -type f -name 'index.html' | sed -e "p;s/\/index\.html/.html/" | xargs -n2 mv
  • Remove files that has a .1 (or any number in them), they are most likely duplicates anyway

    find . -type f -name '*\.1\.*' -exec rm -rf {} \;

Encapsulate a LDAP DN string using Arrays in PHP

During a project I had to fork privileges assignation logic with information coming from an LDAP server. Since the DN string representing users can be very different for each user: their affiliations and roles. I had to find ways to interpret sub-parts of that string to figure out what privileges to attach to them.

My snippet’s purpose is to get the capability to get a subset of that LDAP DN string, assuming that the first index is more precise than the last, but concatenated would give the full context.

While I was trying to find already made function that explodes that string into manageable arrays in PHP, I realized that there was none. I then decided to contribute it as a comment on the website.

Continue reading “Encapsulate a LDAP DN string using Arrays in PHP”

Creating and using Javascript events while combining events on two separates behaviors

I discovered something that chocked me: Did you know that the ‘click’ event is only a string and you can create any event name you may want? Here is an experimentation example.

During web development, it often happens you want to attach events handler on something in your page. A common usage could be you want to flip a plus sign to a minus sign when you click on a button.

<a href="/some/url/324" class="flip-icon" data-target="#generated-324"><i class="icon-plus"></i></a>

Later in a script you may be compelled to do something similar to the following (assuming you are using jQuery):

// Rest of the document in document.ready

        var clicked = $(this);
        var flipElement = clicked.find('i');
        if (flipElement.hasClass('icon-plus')) {
        } else {

But what happens if you want to add other events such as, for example, activating an accordion. You may end up with duplicating events and get some collisions.

Did you know that the ‘click’ event is only a string and you can create any event name you may want?

To describe what I am refering to, I have a add an other behavior that will also, in part, require the previous example.

Imagine we have an accordion managed already grabbing the element’s a[data-target] click event handler.

// Rest of the document in document.ready

        // do the accordion stuff

But, what if for some reason, our page has to reload some sections and our event handler managing the a[data-target] click gets lost

Instead, of creating a click specific event handler (what if we want to change) and be potentially lost with the element to attach event onto.

You can use jQuery’s on method and attach an event to the <body>, a safe element that every document has.

Things to note about the on method:

  • First parameter is an event name, can be ANYTHING (yes, you read it), space separated
  • Second element is on what to listen, can be null
  • a Function object to handle the event

Also, there is nice thing about bubbling.

When an event happens, the event crawls the DOM up to the body (called ‘catch’) then gets back to the triggerer element (called ‘bubbling’) and firing in that order all event handlers.

Knowing all of this now, instead of attaching a single event type handler to a specific element, let’s take advantage of our new knowledge.

'use strict';
// Rest of your document

    // Look at the 'flip-my-icon-event', we just made-up that one. See below.
    $('body').on('click flip-my-icon-event', '.flip-icon', function(){
/* Look here     *************************                                       */
        // Let's put it also in a self-executing anonymous, to isolate scope

            // Same as earlier.
            var clicked = $(this);
            var flipElement = clicked.find('i');
            if (flipElement.hasClass('icon-plus')) {
            } else {
            // End same as earlier

        })($(this)); // this fires the self-executing.

    $('body').on('click', 'a[data-target]', function(event){

        // do the accordion stuff
        var collapsible = $($(this).attr('data-target'));
        if (typeof collapsible.attr('data-collapsible') === 'undefined')  {
                .attr('data-collapsible', 'applied')
                .on('show', function(){
                .on('hide', function(){
        // End do the accordion stuff

/* Look here                         *******************************        */

The following works, because of the following trigger html pattern, as from the begining:

<a href="/some/url/324" class="flip-icon" data-target="#generated-324"><i class="icon-plus"></i></a>

And of the following:

  • We have an icon for .icon-plus and .icon-minus class names
  • The a[data-target] attribute has ALSO a .flip-icon class name
  • The a[data-target] triggers our made-up flip-my-icon-event event to an element that also matches (see the two ‘look here’ comments)


Comment je valide et convertit des documents HTML trop chargés ou provenant de Microsoft Word en HTML valide et simplifié


Cette procédure est faite optimiser la conversion document word en html, spécialement ceux qui sont généré avec beaucoup de «tagsoup» en en simplifier a sa plus simple expression html. Valide.

Sauter à la Procédure


<h2 class="Standard" dir="ltr" lang="fr-FR" style="margin-top: 0; margin-bottom: 0; text-align: center;" xml:lang="fr-FR">
  <span lang="en-CA" style="font-weight: bold; font-size: 16.0px;" xml:lang="en-CA">TERMS AND CONDITIONS OF 1</span> <span lang="en-CA" style="font-weight: bold; font-size: 16.0px;" xml:lang="en-CA">‐</span> <span lang="en-CA" style="font-weight: bold; font-size: 16.0px;" xml:lang="en-CA">YEAR OR 30</span> <span lang="en-CA" style="font-weight: bold; font-size: 16.0px;" xml:lang="en-CA">‐</span> <span lang="en-CA" style="font-weight: bold; font-size: 16.0px;" xml:lang="en-CA">DAY ACCESS AND USE</span>

<span lang="en-CA" style="font-weight: bold; font-size: 16.0px;" xml:lang="en-CA">OF THE SERVICE BY SUBSCRIBERS</span> <span lang="en-CA" style="font-weight: bold; font-size: 16.0px;" xml:lang="en-CA">SECTION 1</span> <span lang="en-CA" style="font-weight: bold; font-size: 16.0px;" xml:lang="en-CA">PURPOSE OF THE SERVICE</span>



Inspiration et pistes

  1. Convertir de format document en ligne de commandeDe Word2000 vers HTML, voir UNOCONV





Version abstraite

  1. Utiliser Open Office (ou peu importe) pour exporter le document en HTML

* Purifier via HTMLTidy
* Nettoyer les attributs inutiles avec la classe htmLawed 

Use cases

Document texte seulement

  • Pas de formulaire, ni images, etc
  • Idéal pour un document légal, par exemple.

Étapes concrètes:

  1. A partir du fichier HTML généré exemple: Fichier appelé “1.1.2.en.html”

* Extraire le fichier
cd ~
mkdir htmlawed
mv htmlawed/
cd htmlawed

Passer au travers de la classe htmLawed

$config = array('safe'=>1,'elements'=>'a,em,strong,p,ul,li,ol,h1,h2,h3,h4,h5,div,tr,td,table','deny_attribute'=>'* -title -href');
$out = htmLawed(file_get_contents('in.html'), $config);
echo $out;

Rouler le script. Déplacer le fichier a utiliser, puis exécuter le script pour en générer dans out.html

cp ~/1.1.2.en.tidy.html in.html
php cleanup.php &gt; out.html</pre>

Rouler Tidy. Normaliser le fichier “1.1.2.en.html”, Nettoyer les balises, minuscules, etc

tidy --drop-font-tags 1 --logical-emphasis 1 --clean 1 --merge-spans 1 --show-body-only 1 --output-xhtml 1 --word-2000 1 --indent "auto" --char-encoding "utf8" --indent-spaces "2" --wrap "90" 1.1.2.en.html &gt; 1.1.2.en.tidy.html



Ordre d’exécution des tâches


J’ai essayé de passer Tidy avant htmLawed et j’ai réalisé que le nettoyage de htmLawed est assez drastique et que Tidy rend le code plus propre. Sans oublier que htmLawed peut générer des balises vides que Tidy va éliminer.


  1. Options Tidy
  2. htmLawed Documentation a PHP Html purification Class
    1. original documentation
    2. Example settings

What is Cloud computing when it is related to web application

During the discussion, the contributor persisted on knowing what would be considered and thresholds to use some kind of push-button-scaling.

Knowing his context, a unzipped install CMS with a buch of plugins I felt the urge to explain that there is not always need to get a bigger server capacity. Here is an overview of what I mean when I talk about cloud computing and continuous integration.

The E-Mail

Let’s talk about cloud! 

I mean in the web application hosting realm. Not the storage (Google Drive, Dropbox) or software as a service (Salesforce, Basecamp).

Let’s talk about a use case before and my own experience.

My former company Evocatio Solutions technologiques manage a pretty large site at the domain

The use-case on my recent experience

This is a complete business management web application that manages an union who represents french speaking artists in north america (mostly residents of Canada). We built a complete web application that manages many aspects an artist needs to represent themselves and be found. A big part of it is a 140 tables worth of artist description listing details as small a hair length and types of musical instruments to voice tones. It also manages renewal, communication with agencies, portfolios, and management of contracts with managers and more.

Not to forget the very heavy databases queries we generate to search, for example: <example>An asian woman with white hair playing yuku lélé who can pilot helicopter AND ride motorcycle …</example>

Yes. Database queries get very big, very quickly. Not only in the search engine I described, but through all the features.

That, to my opinion, is heavy. Also considering that that Artist’s Union has several thousand members.

This information is on top of my head, please do not take this into real numbers, I did not look the latest deployment needs.  But for the server side, it only uses a simple Virtual machine with 4Gb of RAM give or take.

That is my point about expanding hosting without optimizing stuff around.

What your web application has to consider then

Amazon and other Cloud service is about mostly about automated server deployment.

But the powerful offering of “scale tour application” with computing cubes that automatically scales requires more than just nodes.

It requires the code (here again) to support:

  • multiple databases hosts and types support (Cassandra, Solr, MySQL) specialized for the type of data to store
  • User upload files replication
  • Database/Keystore (CouchDB, Mongo)

All spanable on multiple hosts by a mere change of one configuration file.

The code itself should:

  • Be deployable by a simple phing/ant/nant task
  • Hosted on a NAS mount that you could create an other machine and use when time of computing need happens

All this (for some parts) is what is called Continuous integration (Wikipedia) some deployment strategies (also here and this blog post too), and most of the time. It’s not just the continuity and automation that matters, but the underlying deployment mechanism can be provided by third parties, like Heroku and many others.

Les différentes versions du service de tâches planifié CRON

Suite a mon article «Comment automatiser une tâche avec CRON en utilisant Vim» je me suis venu aux questions sur les différences essentielles entre les versions de CRON.

Le concept de CRON est, un «lanceur de commandes» planifié pour les systèmes UNIX. Le nom est inspiré du dieu grec Chronos.

Ayant déjà utilise Gentoo Linux j’avait vu qu’il était possible d’utiliser plus d’une version de CRON mais je ne m’était jamais penché sur les différences. Je l’ai fait aujourd’hui.

Continue reading “Les différentes versions du service de tâches planifié CRON”

Une VM Linux qui sert au développement PHP 5.3 avec Eclipse – partie III

Ce billet est le troisième d’une série d’articles décrivant la fabrication d’une Machine Virtuelle (VM) de développement pour une équipe de dévelopeurs.

Cette partie couvrira l’installation de Apache et de PHP 5.3 (la dernière version depuis Juin 2009) qui offre beaucoup d’avancées. Je pense que c’est devenu le juste minimum a cause de ces nouvelles fonctionnalités. Voir articles faits par IBM developerworksWhat’s new in PHP 5.3 (part 1, part 2, part 3, and part 4)“.

Continue reading “Une VM Linux qui sert au développement PHP 5.3 avec Eclipse – partie III”

Accessibilité et les liens externes

Il existe plusieurs normes en accessibilité du web qui demande des choses qu’on ne prend pas nécessairement le temps de faire.

Soit que c’est pas manque de temps, trop de choses à penser, ou on n’y pense simplement pas.

Dans cet article j’exprime mon opinion sur l’importance (du point de vue utilisabilité) des icones de liens extérieurs. Plus tard je montrerai une méthode pour automatiser [EDIT 2009-08-23] J’ai documenté comment faire dans Manipulation des liens extérieurs et les popup pour améliorer l’Accessibilité.

Continue reading “Accessibilité et les liens externes”