Using AtomPub to Export from Wordpress

Mar
31
2008

Ever since someone gave me an overview of RESTful web development in the same week that someone else gave me an overview of the Atompub protocol, I've been hooked on the idea.

I've tinkered around with starting implementations of both a client and a server on my own ever since. Part of that activity was because there weren't very many tools that supported Atompub. That actually makes doing that kind of development a pain.

That's because you're trying to do both ends of a client-server implementation without having either side ready to work. It's always much easier to work on one end of such a system when the other end is already in place.

While there have been tools for testing Atompub servers, some early servers, etc. out there, but most required quit a bit of yak shaving before you could work on the other side.

Fortunately, that's starting to change. While I still am pursuing my own implementations, I now have adequate implementations to work with on both sides: Windows Live Writer on the client side and Wordpress on the server.

One of the things I'm aiming for in working with this whole chain of tools is a central repository of content that I create: notes, bookmarks, articles, documents, images, etc. all in one place. From there, the content can be pushed out to the various sites I want it on.

Anyway, one of the things that I wanted to do as part of this was to get a copy of all of the content from this site as individual Atom documents. This would give me a large test set of posts that reflect my own real usage.

So, I wrote a bare minimum export to get all ~900 posts. One of the secondary reasons I wanted to use this is that this site's installation of Wordpress is chronically messed up.

Accordingly, a real implementation should actually query the service for the list of posts instead of just looping through the list. For whatever reason, this site's setup, that didn't work, hence the for(i) loop. But, that means some 404 errors in the middle from deleted posts.

Regardless, in just a few lines of C# code, I had a nice directory containing all of my posts. The code follows.

Read the rest of this entry »

Using HTMLTidy to Clean Up HTML with C#

Feb
14
2008

For a while now, I've had a project on the back burner for a different set of tools for RSS reading, writing and publishing. I'd like a single toolchain that lets me keep everything together in one place. I've got piles of notes, a few proof of concept projects and the start of several of the components.

Last night, when I couldn't sleep, I decided to check something off of the list that I wanted to see as a proof of concept for the Atom Publishing Client part of the toolchain: HTML Tidy cleanup to XHTML of HTML before putting it into an Atom entry document.

I currently do most of my writing for this site in Windows Live Writer. However, that's more of a compromise than an ideal choice. While I could probably hack a plugin together that would make Live Writer a more suitable long-term choice, I really want a very specific set of features that includes getting away from the XML-RPC API that all of the server-side engines Live Writer works with are based on.

So, I've been tinkering with a multi-tabbed Windows app for editing posts. The WYSIWYG tab for quick editing uses the MSHTML engine from Internet Explorer. I've looked around and unless you're willing to pony up $299 for a commercial control, that's the most reasonable choice.

However, the HTML that MSHTML spits out is horrible and really needs to be cleaned up. So, I set out to figure out how to use HTMLTidy in a C# project.

I tried to find a .NET wrapper for HTMLTidy and thought I had scored right away when I found one here. However, when I tried to use it, not even the sample code would build without errors on my development machine.

So, I dropped back to trying the COM object version. The last update to it was back in 2000 or so, but it looked like all of the features I needed were in that version, so I decided to give it a shot.

To use the TidyCOM library, you add it as a reference and insert your "using TidyCOM" statement in your class. The actual usage is fairly straightforward.

Example:

TidyObject TidyObj = new TidyObject();
TidyObj.Options.Doctype = "strict";
TidyObj.Options.DropFontTags = true;
TidyObj.Options.OutputXhtml = true;
TidyObj.Options.Indent = TidyCOM.IndentScheme.AutoIndent;
TidyObj.Options.TabSize = 2;
String CleanHTML = TidyObj.TidyMemToMem(HTML);

That code assumes that the "HTML" variable has your messy HTML in it and at the end, "CleanHTML" has your cleaned up XHTML in it.

My little multi-tabbed prototype is using a buffer object to keep the "current" HTML in it. Whenever you switch tabs, the old content is scrubbed through this code before the new tab gets updated out of the buffer. That means that whether it's the WYSIWYG tab that messes it up or you in the HTML editor, you still get valid XHTML in the eventual output.

I also extended my CleanupHTML method (that contains the above code) to scrub out the HTML header tags, body tags, etc. Since the HTML will actually end up as one part of the Atom xml file and not as a standalone HTML file, I only want the content from the editor and both MSHTML and HTML Tidy will always put that stuff back in unless you strip it out.

While I'd still like an assembly that's a little more current, this clearly does the job well enough to check this feature off of my checklist. Now, on to RESTful services on IIS with C#.

C# DataSets and the Magic of ReadXML

Jan
31
2008

I've worked on several applications where we used .NET DataSets as the container for passing records between web services and other components. They work pretty well to keep things nice and loosely coupled when you're building lots of separate components that may or may not all be using the same language, etc.

One of the greatest things that they included in the DataSet classes is the ability to read and write them to XML files. That gives you not only an interchange format, but a file-based version of it pretty quickly. You can easily use those files as your "gold standard" for building all of the components at once. As long as each component emits and consumes that sample file, things are golden.

Anyway, one of the side benefits of that ability to read/write those XML files is that it not only handles the DataSets you create via code. The ReadXml() method actually will convert nearly any XML file into a DataSet. That can come in really handy when your entire application is already passing DataSets around.

That's because nearly any application of reasonable size pulls in information from somewhere outside of the control of your code. In many of those cases, that data will be in XML format. You can, therefore, use the ReadXml() to get DataTable access to all kinds of useful XML stuff.

When it gets read in, .NET does some pretty cool automatic stuff, like creating identifier columns on your tables, etc. However, if, unlike the "normal" DataSets, your imported XML data is nested 2-3 or more levels deep, it can be kind of hard to predict exactly what the DataTable structure will look like.

I'm not a huge fan of automatic or "magic" methods, because you usually have absolutely no way to see inside the black box. That's not the case here because, while the method does some pretty cool magic, it is still possible to see inside of what it does.

I decided that I needed something to deal with the black box today and after dinner tonight, I wrote a quick console app to take an arbitrary XML file and dump out all of the tables, columns and rows in the DataSet in a way that makes it more clear how you'll need to use the tables to grab the data you're after.

That's the information you're going to need to establish your DataRelation objects to tie things together. It's been fairly illuminating for the few files I've sent through it so far and I'm thinking this will be a permanent part of my utilty folder.

I run it using Powershell and the "Out-File" pipe the output to a file, giving me a record of that schema (which I find much easier to read than the output of the WriteXmlSchema() method).

In case you'd like to use it as well, here's a copy of the code.

HTML as Page Layout Language

Dec
28
2007

Off and on over the last 6-8 months, I've been working on a project that needs PDF as its final output format. The plan has been to use DocBook and the toolchain attached to it. However, that's been more frustrating than it first looked when it comes to integrating into the whole system I'm designing.

Then, earlier today, someone posted a link to this YouTube video, which demo's the functionality of the Prince engine. That revealed a system for really nice page layout using HTML and CSS (with CSS3 handling the page breaks and other stuff like it was designed to, making Prince the only implementation of CSS3 out there that works as far as I know).

Given how my project is web-based, being able to just keep it all HTML from end to end and still get really nice PDF's out the other end would be a huge benefit. And, given how this project will be commercial and how much time I've already spent trying to do all of the conversions back and forth, even the steep pricetag for a server license will likely be a net bargain.

Fortunately, the version that puts a little logo in the top, right corner of the PDF (only for display, not printing) is free for development/personal use. So, I messed around with that a bit tonight and got a feel for it. There are versions for pretty much all of the platforms (Windows, Mac, Linux, BSD, etc.) and integration with code for automatic generation is fairly easy.

Really basic conversion using C# only took 3 lines of code. I just grabbed the normal Windows version, also downloaded the DLL and added that DLL to a basic console app.

Then, these 3 lines work to dump out a PDF of the page in question. I just threw together a quick HTML document to test with a few H1, paragraphs, etc.

IPrince pr = new Prince(@"C:\Program Files\Prince\Engine\bin\prince.exe");
pr.AddStyleSheet(@"C:\Program Files\Prince\Engine\style\xhtml.css");
pr.Convert("demo.html", "demo.pdf");

Pretty easy startup as far as I'm concerned. The video is worth watching, despite being somewhat irritating to watch. Like many presentations to a room full of geeks, there's quite a bit of not seeing the forest for the trees. Lots of people shooting it down by saying, "this is would be REALLY great if it supported my one pet feature" kind of stuff. They got a bit hung up on those little nit-picking details and I wonder how much of their presentation ended up left out as a result.

Based on what I've seen so far, I definitely think it's worth tinkering with a bit more and doing the math on that license fee as part of my project budget.

Software Development and Alchemy

Dec
17
2007

Photo: Stian Martinsen

In several conversations recently with other software developers (yep, those are just as exciting as your wildest dreams) and their frustrations with the process, as implemented in modern corporate America, the same analogy kept popping into my head.

More and more, I feel like the things that businesses are after in their software development are similar to medieval alchemy. For 2500 years, the entire field that eventually became chemistry was obsessed with 3 basic questions:

  1. How can we change lead (or other metals) into gold?
  2. How can we create an elixir that will cure all diseases and prolong life indefinitely?
  3. Can we discover a universal solvent?

All of these strike us as goals that weren't even attainable. Yet, the underlying desires often did get met when the focus shifted to what eventually became modern chemistry. By dropping the focus on the single, universal solution and just figuring out how to treat individual diseases or how to dissolve individual compounds or just fundamentally understand chemistry, many advances did happen.

Many/most of the diseases that the alchemists sought to cure or treat are under control today. There's very little in the world of chemistry that we can't tear apart and we can do things like convert coal or corn into one of the most sought after substances on earth: liquid fuel for transportation.

One of the consulting firms I worked with had a project manager that was constantly pushing the developers to find and use "automagical" tools to build our solutions. What he was after was the kind of IDE or tool that, with a few clicks, would just spit out a nearly complete solution.

That would, of course, result in the sales force being able to sell expensive solutions that could be fulfilled in minutes instead of days and weeks. It didn't matter how often I pointed out that, as a consulting company, if our clients' solutions were so simple that a few clicks and config options could solve them, they wouldn't bother coming to us: they'd just buy the software themselves.

This same person wasn't very excited about things like loosely-coupled systems and/or Service Oriented Architecture unless they also came with wizards that let you choose 4 or 5 options and they'd just spit out a fully-realized application. Yet, those approaches keep working for me as a way of looking for patterns in companies' problems and solving them quickly and completely.

Instead of looking for the tool that spits out C#, PHP, ColdFusion and Ruby, I'm looking for repeating problems like managing queues of objects to be processed. Once you have an approach to that general problem, a good developer can probably implement it in whatever language they're most comfortable with.

That's due, in large part, to the fact that the bulk of the work as a software developer is NOT in typing in the text of the programming language in question. Douglas Crockford said in one of his Yahoo video lectures something along the lines of: a developer could probably type up all of their code for an entire year in a day or 2.

Yet, many of these automagical tools really only seem to automate the stuff related to typing code, not for solving problems. And, like I said a couple of days ago, if you're in the consulting game or just looking to stay employed as a developer, the money and jobs are where the problems are.

That's why, when I hear someone looking for that quick and easy tool that will "just" take care of it this afternoon, I tend to interpret it as, "Can't we just change this lead into gold instead of getting real gold?"

« Older Entries  

J Wynia

For better or worse, I'm the guy who runs things here. I'm a web consultant, software developer, writer and geek from Minneapolis, MN. This site is a fairly wide cross-section of the things I'm interested in and enjoy writing about.

Oh, and if you happen to be looking for hosting for your Subversion repositories or just web hosting in general, take a look at Dreamhost. It's what I use for Subversion and your signup helps me out.

Latest Microposts

Follow Microposts on Twitter | Subscribe to Microposts

My Attendance At the Gym

Feeds and Links


www.flickr.com
This is a Flickr badge showing public photos from J Wynia. Make your own badge here.

Search


Pages

Archives

Computers Blog Directory
© 2003-2008 J Wynia. All original content is licensed under the terms of the Creative Commons Attribution license unless otherwise noted. Content from other sources is licensed under its original terms.