YA Perl Advent Calendar 2005-12-14

Boston.PerlMongers tech meeting last night, today's column is by William 'N1VUX' Ricker.

Standards are a good thing. Daily updates are a good thing. But how do you keep a website that's hacked together daily like an Advent calendar "standards compliant"? You can generate everything from templates (as that other unofficial Perl Advent Calendar does), or you can test, as we here at YAPAC do. There are public W3C HTML validators, but TIMTOWTDI and of course there is a Perl solution, HTML::Lint, which comes with a script wrapper weblint.

$ weblint http://use.perl.org 
http://use.perl.org (1386:6) <IMG> tag has no HEIGHT and WIDTH attributes.
$ weblint http://web.mit.edu/belg4mit/www/5/
http://web.mit.edu/belg4mit/www/5/ (56:71) </p> with no opening <p>
http://web.mit.edu/belg4mit/www/5/ (270:1) <pre> at (143:1) is never closed
$ weblint http://web.mit.edu/belg4mit/www/8/
http://web.mit.edu/belg4mit/www/8/ (7:62) </head> with no opening <head>
http://web.mit.edu/belg4mit/www/8/ (121:1) <head> tag is required
$ weblint http://web.mit.edu/belg4mit/www/12/
http://web.mit.edu/belg4mit/www/12/ (237:1) <a> at (232:1) is never closed
http://web.mit.edu/belg4mit/www/12/ (237:1) <a> at (232:64) is never closed

Oops, I guess we don't test quite as often as perhaps we might.

Editor> That output isn't too self-documenting though.

N1VUX> They say the teat is the only intuitive user interface

Editor> Okay... for instance I was reading the lines about 12/ and looking at line 237 to correct them but there was nothing there. Similarly colons to seperate coordinates (in y,x even) strikes me as odd. Is it a vi thing?

N1VUX> IDEs expect it this way for compile errors that you can click on to jump to. It's an EMACS thing that was adopted by VIM, IIRC. I don't think old-original vi cared. Most editors display line:col or line,col not x,y. Colons are standard in compiler errors etc as "$ARGV:$.: $_" and filename:lineno:column:message . Since column number is optional, it's the last prefix field. Also, hierarchically, /root/dir/file.ext:line:col is the hierarchical order of granularity, most folks think of their source files in terms of lines of columns, not columns of characters.

line:col isn't really even (y,x), it's more like

   my ($line, $col)=(-$y, $x);
since it's in 4th Quadrant -- the origin is at the upper left. (Much of computer graphics is pschizoid between 1st and 4th quadrants. And ASCII art has the -y,x transform always.)

Editor> Okay, so the last set of numbers on a line is the line, column of the offending bit. The first occurence is where HTML::Lint stopped looking for a matching tag.

Andy Lester's HTML::Lint module has plenty more features than we can demonstrate here.

  1. It can parse files by name.
  2. It can parse HTML you give it piecemeal.
  3. It can be used in an Apache module to give warnings on pages you're serving -- on your test server I hope!

Now for a recursive demo script, which will read this page and check it before I upload it. (This script is a lame imitation of weblint, so don't use it for anything real. Enjoy!


   1 #! /usr/bin/env perl  -l
   2 use strict;
   3 use warnings;
   4 use Carp;
   5 use HTML::Lint;
   7 my $file_name;
   8 my ($total_count, $n_files)=(0,0);
  10 while ($file_name = shift @ARGV) {
  12 	open my $input, '<', $file_name 
  13 		or croak "Can not open script for read";
  14 	my $data = do { local $/; <$input> };  		 # slurp!
  15 	close $input; 
  16 	$n_files++;
  18 	my $lint = HTML::Lint->new;
  19 	    # $lint->only_types( HTML::Lint::STRUCTURE );
  21 	    $lint->parse( $data );
  22 	    # $lint->parse_file( $filename );
  24 	    my $error_count = $lint->errors;
  25 	    carp "Uh oh, $error_count errors found in $file_name."
  26 		if $error_count; 
  28 	    foreach my $error ( $lint->errors ) {
  29 		print $error->as_string;
  30 	    }
  31 }
  33 print "$total_count errors found in $n_files processed";
0 errors found in 1 processed