FedStats 508/Accessibility Workshop
Back to: Conference homepage | Agenda

Regular Expression and Scripting Techniques for Tables

Laurie Brown
Webmaster
Office of Policy, SSA

When I arrived at the Office of Policy two years ago, I found that their practice was to post the majority of their documents just in PDF format. It was a quick and easy way to get information up on the Web, particularly since there were almost a half dozen different programs being used to produce the source files for those various documents. However, with Section 508 on the horizon, I knew that wasn't going to be sufficient anymore. We needed to start converting our documents to HTML.

I assessed the HTML capabilities of the different source programs being used to produce our documents, and was very concerned about with what I found. Some programs had few if any customization options, leaving me stuck with the vendors' idea of HTML. Other programs did have rudimentary mapping capabilities from their own styles to either HTML tags or CSS classes (assuming, of course, that styles were actually being used properly during document preparation).

So, the first challenge I faced was not the actual 508 coding itself but trying to clean up the HTML to the point where I could even start to code it for accessibility.

Overview

Given that background, I'm going to talk a little more about the problems I found with the HTML files, then share with you the solution I developed that uses a combination of regular expressions and JScript in ColdFusion Studio. And, finally, I'll demo the solution using an Excel table as an example.

HTML File Problems

General problems with the HTML files included:

The other two problems with the HTML files were:

As I've mentioned, the solution I developed to deal with those problems combines the use of regular expressions with JScript in ColdFusion Studio.

Generally,

Before I get into the details of my solution, I'm going to first give you a very brief introduction to regular expressions, in case you aren't familiar with them.

Regular Expressions

How many of you have ever used an * or ? wildcard character in looking for files on your hard disk? Good, then you already have an idea of what regular expressions are all about.

They are about matching patterns when you aren't exactly sure what you're looking for or when you want to find multiple similar items. However, regular expressions are far more powerful and flexible than these two little wildcard characters let on. In addition to being able to find patterns, you can store those patterns for future use. Let me give you an example.

Regular Expressions - An Example
(1 of 3)

Before:

<tr>
<td>Number of beneficiaries</td>
<td>1,000,000</td>
<td>2,000,000</td>
</tr>

How many time have you converted a table to HTML only to find that all of the cells have converted as TDs--there is not a TH in sight? If you wanted to code your row stubs with THs, it would be easy enough to do a search-and-replace on the <tr><td> tag combination and replace that with a <tr><th> combination, but what about that closing TD tag? There's no easy way to identify it for replacement. Here's the first instance where regular expressions can come in very handy.

Regular Expression - An Example
(2 of 3)

Search for:

<tr>
<td>(.+)</td>

Replace with:

<tr>
<th>$1</th>

Using regular expressions I can search for TR, TD, some combination of printable characters, and the closing TD. Let's look a little more closely at the regular expression between the TD tags. The period in that expression represents a single printable character. The plus sign, immediately following the period, indicates that I am searching for one or more of those printable characters. And the parentheses around the expression mean that I want to save the pattern that is found for future use.

In my replace statement, the $1 is the back-reference to the pattern that was found by the expression in the parentheses in the search criteria and saved. If the search criteria contain multiple patterns that are saved, they are simply referenced sequentially by number ($1, $2, and so on).

Regular Expressions - An Example
(3 of 3)

After:

<tr>
<th>Number of beneficiaries</th>
<td>1,000,000</td>
<td>2,000,000</td>
</tr>

And here is the result of my regular expression search-and-replace.

Table Example

The exact implementation of scripts such as these is heavily dependent on several factors:

Now let me start to show you, in more detail, the solution I developed.

As I am going through this example, rather than showing you a lot of code snippets, I'm going to emphasize the logic behind how the scripts were written. The exact implementation of scripts such as these is heavily dependent on several factors:

Let's start by taking a look at the original Excel 2000 table.

This is a simplified version of a table that appeared in the SSA Office of Policy's recent publication, Income of the Population 55 or Older. The original table was landscape format with three levels of column headers and fifteen columns of data. However, for the sake of actually being able to fit it on the screen today, I've simplified it down to just five columns of data--but it is certainly enough to give you an idea of the process.

There are two important things to note about how this table is put together. First, things that logically go together are contained in a single cell, and in some cases those cells are merged across multiple rows or columns. Second, blank columns are being used to create the indents on the row stubs. The indent feature of Excel didn't give us as much depth as we wanted for the print publication, and those indents do not convert to HTML in a way that I can then use them for the rest of this process.

Now here is that same table, saved from Excel as HTML. It has lots of extra coding in it that I really don't want. And, if I run the HTML validator on it, I end up with lots of warnings and errors.

So, let's start cleaning it up by getting rid of all the extra tags.

Eliminate Extra Tags

The first two things I am going to do are simple search-and-replaces. First, I want to get rid of all of the BR tags that were included in the document as a result of some of the formatting done for print purposes. I simply search for the BRs and replace them with spaces.

Next, I noticed that there were hard returns included in the middle of some of the tags in the HTML file. That will cause problems with some of the regular expression work later, so let's get rid of those. Again, a simple search-and-replace--looking for hard returns and replacing them with spaces.

And yes, I've just made my HTML file into one enormously long line, but in two steps I'll put some hard returns back in where I actually want them.

First, however, let's get rid of all of the comments that Excel included in the file. And this will include getting rid of all the style codes because they are enclosed inside an HTML comment tag to hide them from older browsers.

In this instance, because I know what the beginning and end of a comment look like but am not sure what is in the middle, I am going to use a regular expression as part of the search criteria. I am going to search for <!.+?> and replace it with nothing. OK, so it looks a little cryptic at first glance, but let's look at the components of the regular expression again. The left bracket and exclamation point are simply the start of an HTML comment. Then we have our old friends period and plus that were used in the previous regular expression example. Again, they are standing in for a pattern of one or more printable characters. Notice, however, that they are not in parentheses this time. The reason is that I don't care what the pattern is that is actually matched because I am not going to be using it as part of the replacement expression. Finally, there is a question mark and right bracket. The right bracket, obviously, is the end of the HTML comment, but what! a! bout the question mark?

Well, regular expressions can be what are called greedy and nongreedy. Greedy regular expressions try to match as much of the search string as possible. In this case, since the right bracket is also a printable character, the expression would find all of the text up to the very last right bracket in the document, which in this case would be the right bracket on the closing HTML tag. Not what I want.

Putting the question mark before the right bracket makes this a nongreedy regular expression. Nongreedy expressions try to match as little as possible, so in this case the match would stop at the first occurrence of a right bracket, which happens to be the right bracket at the end of the HTML comment. Exactly what I am looking for.

Finally, I'm going to run the CodeSweeper in Studio. Running the CodeSweeper is one of the methods exposed to the script as part of the ColdFusion Studio application object. The CodeSweeper allows you to set rules that apply to tags. For example, you can specify whether you want blank lines before or after the tag, how far you want the tag indented, whether you want the tag to be upper or lower case, how you want the tag's attributes quoted, and, most important, whether you want the tag in your document at all. ColdFusion Studio comes with several predefined CodeSweepers, but you can also create your own. Which is exactly what I did.

Customized CodeSweeper

I created a table CodeSweeper that keeps only the most basic HTML document tags, HTML table tags, and bold, italic, and superscript tags since those may be used in the general notes to the table or to superscript footnote references in the table. All the other tags are outta there! It is also in this step that hard returns are put back into the document where I actually want them.

Let me pop over to Studio and run this portion of the script.

Eliminate Extra Attributes
(1 of 3)

Eliminate Extra Attributes
(2 of 3)

Eliminate Extra Attributes
(3 of 3)

Now I need to address all the extra attributes that are running around. Believe it or not, six regular expressions search-and-replaces and all those extra attributes will also be gone. Some tags are going to be consistent in format across most documents, such as the HTML tag or the LINK tag for the CSS. Those are very straightforward search-and-replaces. Other tags, such as the table data tags, are a little more involved, since I may actually want to keep some of the attributes included in them, but they really aren't that hard to deal with using regular expressions.

The steps include:

Again, let me pop over to Studio and run this portion of the script. And what I am left with is just plain vanilla HTML.

Add CSS Classes

Next, I'm going to add in my style sheet class attributes.

I'm going to start by getting the COLSPAN and ROWSPAN attributes from the stub heading (which is the column heading over all of the row stubs). Those attributes give me two very important pieces of information:

That information, combined with the patterns of blank cells, COLSPANs, and ROWSPANs in my table, allow me to identify and properly class the different elements in my table, such as:

The patterns that I am using to identify each of those elements exist in my table for two reasons:

So using a series of regular expression search-and-replaces, I am able to identify and add class attributes to my table.

Again, popping into Studio I'll run that portion of the script.

Now that I have a completed table, let me run the HTML validator again. This time, there are no errors or warnings.

After all that, I am now finally ready to code the table for 508 by adding in ID and HEADERS attributes.

For lack of a better ID system, I simply use "c" for column and "r" for row, along with numbers that I assign sequentially as I move through the code from top to bottom.

In addition to adding the ID attributes to each of the tags, I'll also be adding and erasing them from a series of arrays that will be used to create the headers for each cell.

Add 508 IDs and HEADERS
(1 of 3)

Add 508 IDs and HEADERS
(2 of 3)

Initial arrays:

[c1, c2, c3, c4, c4, c4]

[c1, c2, c3, c5, c6, c7]

Final array:

[c1, c2, c3, c4 c5, c4 c6, c4 c7]

Let me begin with the column headers.

I know how many rows contain column headings, and I also know the overall width, in columns, of the table. Using those values, I am going to create a series of arrays that are essentially a matrix. Then I'm going to walk through the column headings line-by-line, assigning ID attributes and adding them to the appropriate elements of the arrays. Initially, the arrays will have some duplicate elements since I need placeholders for the cells that span multiple rows. However, as a last step I will collapse these arrays and remove the duplicate elements.

Add 508 IDs and HEADERS
(3 of 3)

Now I'm going to walk through the body of the table line by line. If the element is a heading, I'm going to add it to an array based on it's CSS class. When I do this, it is important that I erase the remainder of the array each time. If the element is table data, then I'm going to insert the appropriate HEADERS attribute--using the current elements in my row ID array as the first part of the attribute value and then the appropriate elements from the column heading array I created in the previous step as the second part of the attribute value.

And that's all there is to it.

Finishing Touches

I can then run one final regular expression search-and-replace to insert the document title into all of my tables, including the table number. And I can run another customized CodeSweeper to put in line breaks and indents where I want, to make my HTML code easier to read.

So, in conclusion, by using the patterns that are present in a table--either as a result of the table structure itself or the HTML conversion process--I can use regular expressions to greatly ease my workload. I can run regular expression search-and-replaces to get rid of unwanted tags and attributes from the tables and to add in CSS classes. I can then convert the visual hierarchy created by the CSS classes to a logical hierarchy that can be coded into my tags as IDs and HEADERS for 508 compliance.

References

To the top