Converting a dynamic site into static HTML documents

Its been two times now that I’ve been asked to make a website that was running on a CMS and make it static.

This is an useful practice if you want to keep the site content for posterity without having to maintain the underlying CMS. It makes it easier to migrate sites since the sites that you know you won’t add content to anymore becomes simply a bunch of HTML files in a folder.

My end goal was to make an EXACT copy of what the site is like when generated by the CMS, BUT now stored as simple HTML files. When I say EXACT, I mean it, even as to keep documents at their original location from the new static files. It means that each HTML document had to keep their same value BUT that a file will exist and the web server will find it. For example, if a link points to /foo, the link in the page remain as-is, even though its now a static file at /foo.html, but the web server will serve /foo.html anyway.

Here are a few steps I made to achieve just that. Notice that your mileage may vary, I’ve done those steps and they worked for me. I’ve done it once for a WordPress blog and another on the [email protected] website that was running on ExpressionEngine.

Steps

1. Browse and get all pages you think could be lost in scraping

We want a simple file with one web page per line with its full address.
This will help the crawler to not forget pages.

  • Use a web browser developer tool Network inspector, keep it open with “preserve log”.
  • Once you browsed the site a bit, from the network inspector tool, list all documents and then export using the “Save as HAR” feature.
  • Extract urls from har file using underscore-cli

    npm install underscore-cli
    cat site.har | underscore select ‘.entries .request .url’ > workfile.txt

  • Remove first and last lines (its a JSON array and we want one document per line)

  • Remove the trailing remove hostname from each line (i.e. start by /path), in vim you can do %s/http:\/\/www\.example.org//g
  • Remove " and ", from each lines, in vim you can do %s/",$//g
  • At the last line, make sure the " is removed too because the last regex missed it
  • Remove duplicate lines, in vim you can do :sort u
  • Save this file as list.txt for the next step.

2. Let’s scrape it all

We’ll do two scrapes. First one is to get all assets it can get, then we’ll go again with different options.

The following are the commands I ran on the last successful attempt to replicate the site I was working on.
This is not a statement that this method is the most efficient technique.
Please feel free to improve the document as you see fit.

First a quick TL;DR of wget options

  • -m is the same as --mirror
  • -k is the same as --convert-links
  • -K is the same as --backup-converted which creates .orig files
  • -p is the same as --page-requisites makes a page to get ALL requirements
  • -nc ensures we dont download the same file twice and end up with duplicates (e.g. file.html AND file.1.html)
  • --cut-dirs would prevent creating directories and mix things around, do not use.

Notice that we’re sending headers as if we were a web browser. Its up to you.

wget -i list.txt -nc --random-wait --mirror -e robots=off --no-cache -k -E --page-requisites \
     --user-agent='User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.94 Safari/537.36' \
     --header='Accept-Language: fr-FR,fr;q=0.8,fr-CA;q=0.6,en-US;q=0.4,en;q=0.2' \
     --header='Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' \
     http://www.example.org/

Then, another pass

wget -i list.txt --mirror -e robots=off -k -K -E --no-cache --no-parent \
     --user-agent='User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.94 Safari/537.36' \
     --header='Accept-Language: fr-FR,fr;q=0.8,fr-CA;q=0.6,en-US;q=0.4,en;q=0.2' \
     --header='Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' \
     http://www.example.org/

3. Do some cleanup on the fetched files

Here are a few commands I ran to clean the files a bit

  • Remove empty lines in every .orig files. They’re the ones we’ll use in the end after all

    find . -type f -regextype posix-egrep -regex '.*\.orig$' -exec sed -i 's/\r//' {} \;
    
  • Rename the .orig file into html

    find . -name '*orig' | sed -e "p;s/orig/html/" | xargs -n2 mv
    find . -type f -name '*\.html\.html' | sed -e "p;s/\.html//" | xargs -n2 mv
    
  • Many folders might have only an index.html file in it. Let’s just make them a file without directory

    find . -type f -name 'index.html' | sed -e "p;s/\/index\.html/.html/" | xargs -n2 mv
    
  • Remove files that has a .1 (or any number in them), they are most likely duplicates anyway

    find . -type f -name '*\.1\.*' -exec rm -rf {} \;
    

Procédure pour avoir un environnement de dévelopement local facile à configurer avec Apache

Je ne sais pour vous, mais je ne peut plus programmer sans avoir l’environement serveur localement sur ma machine. Changer ou ajouter un fichier VirtualHost pour chaque nouveau projet est assez répétitif. Il doit y avoir une façon automatique de le faire?

Oui.

Ça s’appelle VirtualDocumentRoot

J’ai ce tutoriel qui traîne dans mon Wiki personnel depuis des lustres, et c’est maintenant que je commence a migrer mes projets sous NGINX que je décide de le mettre en ligne. Il n’est jamais trop tard pour publier.

Cette méthode de configuration répond exactement au besoin précis de ne pas avoir a configurer un hôte virtuel apache pour chaque projet.

Avec cette procédure, vous n’aurez qu’a maintenir votre fichier hosts, le reste suivra tout seul.

Vous pouvez appliquer cette technique avec n’importe quelle version du serveur http “Apache”. Cette procédure peut même être faite si vous développez sous Windows ou Mac OS avec les distributions du serveur HTTP Apache sous Windows telles que MAMP, XAMPP, et EasyPHP.

Pourtant avec un serveur web local, ce type de configuration est possible depuis longtemps, il faut simplement savoir comment ça s’appelle: VirtualDocumentRoot.

Voici comment je configure mon environnement LAMP depuis quelques temps.

Procédure

Établissement du standard

Tout commence par une certaine convention. Avec celle-ci, tout devrait suivre automatiquement.

L’idée est de pouvoir accéder a un l’espace de travail du projet A du client B sur ma machine locale. L’adresse locale n’est plus localhost, mais quelque chose de plus explicite.

Ce que j’apprécie le plus de cette méthode car elle permet de conserver dans un dossier parent tout ce qui est spécifique pour le projet et le client. Le code a exécuter qui soit dans un sous-dossier ne feait que du sens.

Par exemple, un projet appelé projectname du client client serait classé dans un dossier sous le chemin /home/renoirb/workspace/client/projectname.

Le code du projet web serait accessible via le serveur web à l’adresse http://projectname.client.dev/ qui pointe vers l’adresse IP de la station de travail locale.

L’espace de travail du projet

IMPORTANT
Il faut que les noms de dossiers soient en minuscule et aucun espace, ni caractères accentués, sinon le serveur Apache risque de ne pas trouver le dossier. Principalement parce que l’adresse entrée dans le navigateur est convertie en bas de case, et que généralement un système d’exploitation qui se respecte fait une différence entre, par exemple ‘Allo’ et ‘allo’.

La convention suggérée va comme suit:

  • chaque projet est classé dans un chemin prévisible, similaire à /home/renoirb/workspace/client/projectname
  • le projet a un dossier web/
  • les autres dossiers au même niveau que web/ peuvent être n’importe quoi d’autre.

Idéalement, la logique applicative ne devrait pas être visible publiquement de toute façon. Seulement le fichier principal appelle l'”autoloader” en dehors du DocumentRoot.

De cette façon le vous pouvez classer tout vos projets du même client, et séparer par projets.

La procédure tient aussi en compte
* L’utilisateur courrant puisse écrire dans son dossier workspace/ avec Apache2 comme s’il était son propre utilisateur avec mpm-itk
* Le nom de domaine utilisé définit dans quel dossier de l’utilisateur chercher

Procédure

  • Assurer que les modules sont chargés
     sudo a2enmod vhost_alias
  • Ajouter le fichier default a la config de apache
     sudo vi /etc/apache2/ports.conf
  • Vérifier qu’il y a ceci:
    NameVirtualHost *:80 
    Listen 80 
    UseCanonicalName Off
  • Modifier le fichier de config du VirtualHost par défaut
  • Fichier de configuration par magique
    sudo vi /etc/apache2/sites-available/default
  • Verifier qu’il y a ce bloc dans <VirtualHost ...>:
    <IfModule mpm_itk_module>
        AssignUserId renoirb users
    </IfModule>
  • Remplacer la mention DocumentRoot par ce format:
    VirtualDocumentRoot /home/renoirb/workspaces/%1/%0/web

Sources

Who else is using feature flipping thing on their web applications?

I am currently reading and collecting ideas on how to present and propose an implementation in my projects.

I want to use:

  • Continuous integration
  • Automated builds
  • Feature flipping

And make all of this quick and easy for anybody in the team.

Feature flipping

This is fairly new to me, but I like the idea. The concept is that the code declares in their own components which features they are contributing to.

This way, we can then totally hide it from sight from the users.

Source control branching

I am currently searching and preparing to introduce to my client ways to work with a few elements in our project.

The idea is that instead of managing a complex branch scheme, and skim to the essential.

A trend was to use GitFlow, then, the project grows, developers do not have all the time to manage everything, and things get out of hands.

It may then look like something similar to that:

Quoting a slide from Zach Holman about branching

It doesn’t seem bad in the first place, but even though Git gives an easy way to do so, if you want to adapt quickly, it can bring overhead.

At least, that’s what Flikr, Github, Twitter, Facebook (so I heard) does.

I’ll keep you posted on what I find on the idea soon-ish.

References

What is Cloud computing when it is related to web application

During the discussion, the contributor persisted on knowing what would be considered and thresholds to use some kind of push-button-scaling.

Knowing his context, a unzipped install CMS with a buch of plugins I felt the urge to explain that there is not always need to get a bigger server capacity. Here is an overview of what I mean when I talk about cloud computing and continuous integration.

The E-Mail

Let’s talk about cloud! 

I mean in the web application hosting realm. Not the storage (Google Drive, Dropbox) or software as a service (Salesforce, Basecamp).

Let’s talk about a use case before and my own experience.

My former company Evocatio Solutions technologiques manage a pretty large site at the domain uda.ca.

The use-case on my recent experience

This is a complete business management web application that manages an union who represents french speaking artists in north america (mostly residents of Canada). We built a complete web application that manages many aspects an artist needs to represent themselves and be found. A big part of it is a 140 tables worth of artist description listing details as small a hair length and types of musical instruments to voice tones. It also manages renewal, communication with agencies, portfolios, and management of contracts with managers and more.

Not to forget the very heavy databases queries we generate to search, for example: <example>An asian woman with white hair playing yuku lélé who can pilot helicopter AND ride motorcycle …</example>

Yes. Database queries get very big, very quickly. Not only in the search engine I described, but through all the features.

That, to my opinion, is heavy. Also considering that that Artist’s Union has several thousand members.

This information is on top of my head, please do not take this into real numbers, I did not look the latest deployment needs.  But for the server side, it only uses a simple Virtual machine with 4Gb of RAM give or take.

That is my point about expanding hosting without optimizing stuff around.

What your web application has to consider then

Amazon and other Cloud service is about mostly about automated server deployment.

But the powerful offering of “scale tour application” with computing cubes that automatically scales requires more than just nodes.

It requires the code (here again) to support:

  • multiple databases hosts and types support (Cassandra, Solr, MySQL) specialized for the type of data to store
  • User upload files replication
  • Database/Keystore (CouchDB, Mongo)

All spanable on multiple hosts by a mere change of one configuration file.

The code itself should:

  • Be deployable by a simple phing/ant/nant task
  • Hosted on a NAS mount that you could create an other machine and use when time of computing need happens

All this (for some parts) is what is called Continuous integration (Wikipedia) some deployment strategies (also here and this blog post too), and most of the time. It’s not just the continuity and automation that matters, but the underlying deployment mechanism can be provided by third parties, like Heroku and many others.

Some steps you can look for if you feel your web application is slow

It all started by a discussion thread in a mailing list. A guy who developed a shopping cart and payment gateway using a CMS.

My first reflex, before thinking of scaling the server I thought I should give some pointers on things that can hog the site, before going to think to scaling solutions.

That was all before continuing talking on the cloud that I answered later that I answered later on that blog post.

The thread started as follows:

> (…) I have a Magento modified into a e-commerce site (…) that
> me and my client feels slow, my client has asked about Amazon hosting. They
> do everything else, CDN, the works, shouldn’t their hosting
> be superior?
>
> What would be worth for a test drive, I’d say, if you think
> your site’s performance issues can be addressed by
> throwing CPU, memory, storage, etc (…)

My answer

I doubt that you need bigger hosting for a e-commerce site.

Not for the “first  thing to improve performance” action point though.

Unless your site has to provide (real) HEAVY traffic. Non stop.

It is most likely something somewhere down the execution of the web application that requires to be looked at.

Performance slowing factors

Some common explanations for slow execution time could be basically due because of one or many of the following:

  1. Network latendy
  2. Process communication problem (connection, zombie process, etc)
  3. Application architecture
  4. Hardware/Software performance

Now talking about application architecture.  This one can be a real can of worms!

Some Application architecture bottlenecks

I currently seriously doubt the order here should matter. But this is the ones that pop into my mind at first.

  1. Web service/database queries/files access across network … packetloss could also be a cause
  2. Database queries processing that could require some well picked indexes
  3. Heavy queries and frequent read write or some sleep() hidden here and there to wait other result set
  4. No http/view caching
  5. No code caching/precompiled code at all (can be config, whatever that can be pre-compiled and served into basic arrays of calculated data frequently used)
  6. No memcached/keystore service
  7. No read-only data store

An analogy

So. To my opinion, if you are using a “unpack to install” web based software such as WordPress, then add plusings without testing and looking if all of them works well together.

You are likely to be trying to make a Cheetah kitten into a humanoid Android.

As in,  you can install a lot of metal patches. Doesn’t mean it will have a full AI system and be self sustainable.

That is to illustrate, what it is lie, you should look to alternatives. At least with something closer to a droid :)

Seriously enough.

My professionnal recommendation would be to work with each “application architecture buttolenecks” proposal list before “going cloud”