Using HTMLTidy to Clean Up HTML with C#

Originally published: 02/2008 by J Wynia

For a while now, I've had a project on the back burner for a different set of tools for RSS reading, writing and publishing. I'd like a single toolchain that lets me keep everything together in one place. I've got piles of notes, a few proof of concept projects and the start of several of the components.

Last night, when I couldn't sleep, I decided to check something off of the list that I wanted to see as a proof of concept for the Atom Publishing Client part of the toolchain: HTML Tidy cleanup to XHTML of HTML before putting it into an Atom entry document.

I currently do most of my writing for this site in Windows Live Writer. However, that's more of a compromise than an ideal choice. While I could probably hack a plugin together that would make Live Writer a more suitable long-term choice, I really want a very specific set of features that includes getting away from the XML-RPC API that all of the server-side engines Live Writer works with are based on.

So, I've been tinkering with a multi-tabbed Windows app for editing posts. The WYSIWYG tab for quick editing uses the MSHTML engine from Internet Explorer. I've looked around and unless you're willing to pony up $299 for a commercial control, that's the most reasonable choice.

However, the HTML that MSHTML spits out is horrible and really needs to be cleaned up. So, I set out to figure out how to use HTMLTidy in a C# project.

I tried to find a .NET wrapper for HTMLTidy and thought I had scored right away when I found one here. However, when I tried to use it, not even the sample code would build without errors on my development machine.

So, I dropped back to trying the COM object version. The last update to it was back in 2000 or so, but it looked like all of the features I needed were in that version, so I decided to give it a shot.

To use the TidyCOM library, you add it as a reference and insert your "using TidyCOM" statement in your class. The actual usage is fairly straightforward.

Example:

TidyObject TidyObj = new TidyObject();
TidyObj.Options.Doctype = "strict";
TidyObj.Options.DropFontTags = true;
TidyObj.Options.OutputXhtml = true;
TidyObj.Options.Indent = TidyCOM.IndentScheme.AutoIndent;
TidyObj.Options.TabSize = 2;
String CleanHTML = TidyObj.TidyMemToMem(HTML);

That code assumes that the "HTML" variable has your messy HTML in it and at the end, "CleanHTML" has your cleaned up XHTML in it.

My little multi-tabbed prototype is using a buffer object to keep the "current" HTML in it. Whenever you switch tabs, the old content is scrubbed through this code before the new tab gets updated out of the buffer. That means that whether it's the WYSIWYG tab that messes it up or you in the HTML editor, you still get valid XHTML in the eventual output.

I also extended my CleanupHTML method (that contains the above code) to scrub out the HTML header tags, body tags, etc. Since the HTML will actually end up as one part of the Atom xml file and not as a standalone HTML file, I only want the content from the editor and both MSHTML and HTML Tidy will always put that stuff back in unless you strip it out.

While I'd still like an assembly that's a little more current, this clearly does the job well enough to check this feature off of my checklist. Now, on to RESTful services on IIS with C#.

Comments

Korayem on 2/28/2008
When I was searching for some C# code to cleanup html (possibly with regex) I came across this awesome post. HTMLTidy is a defacto standard nowadays, so using it in a COM is superior to regex cleaning. This will be used in a ASP.NET webapp btw
Simon Dingley on 9/11/2008
Great post - thanks very much for this as you have saved me a lot of work!
blog comments powered by Disqus
Or, browse the archives.
© 2003- 2014 J Wynia. Very Few Rights Reserved. This article is licensed under the terms of the Creative Commons Attribution License. Quoted content or content included from others is not subject to that license and defaults to normal copyright.