Using AtomPub to Export from Wordpress

Mar
31
2008

Ever since someone gave me an overview of RESTful web development in the same week that someone else gave me an overview of the Atompub protocol, I've been hooked on the idea.

I've tinkered around with starting implementations of both a client and a server on my own ever since. Part of that activity was because there weren't very many tools that supported Atompub. That actually makes doing that kind of development a pain.

That's because you're trying to do both ends of a client-server implementation without having either side ready to work. It's always much easier to work on one end of such a system when the other end is already in place.

While there have been tools for testing Atompub servers, some early servers, etc. out there, but most required quit a bit of yak shaving before you could work on the other side.

Fortunately, that's starting to change. While I still am pursuing my own implementations, I now have adequate implementations to work with on both sides: Windows Live Writer on the client side and Wordpress on the server.

One of the things I'm aiming for in working with this whole chain of tools is a central repository of content that I create: notes, bookmarks, articles, documents, images, etc. all in one place. From there, the content can be pushed out to the various sites I want it on.

Anyway, one of the things that I wanted to do as part of this was to get a copy of all of the content from this site as individual Atom documents. This would give me a large test set of posts that reflect my own real usage.

So, I wrote a bare minimum export to get all ~900 posts. One of the secondary reasons I wanted to use this is that this site's installation of Wordpress is chronically messed up.

Accordingly, a real implementation should actually query the service for the list of posts instead of just looping through the list. For whatever reason, this site's setup, that didn't work, hence the for(i) loop. But, that means some 404 errors in the middle from deleted posts.

Regardless, in just a few lines of C# code, I had a nice directory containing all of my posts. The code follows.

Read the rest of this entry »

The Midnight Problem

Feb
18
2008

Somewhere, at this very moment, in an IT department or team of programmers, someone is planning a scheduled task. That task should run in "off" hours, when no one is around. The immediate suggestion that pops up is MIDNIGHT. There is no debate, no questioning of it and the task is scheduled for midnight.

Unfortunately, in many environments, this scenario is repeated a couple of times a month. I say unfortunate because the original reason for running these kinds of jobs overnight in the first place is that batch jobs that consume computing resources shouldn't interfere with regular daytime work activities. However, when *everything* gets scheduled at midnight (and in many shops, that actually ends up happening), all you've done is shift the bottleneck to the middle of the night.

I've worked in environments where the bunched up batch jobs all scheduled at midnight thrashed the hard drive and CPU all night and are often still running at 9 or 10am the next day.

When we changed the scheduling to stagger them out a bit, from 10pm through 6am, the total burden on the server dropped dramatically and we actually dropped the overall time spent running these jobs just by being more careful about scheduling.

So, if you happen to be in that meeting today and someone says "midnight", please, just check to see what else is running at midnight and consider a different time slot.

The Power of the Proof of Concept

Feb
17
2008

If you were to go through my Visual Studio "Projects" directory or my personal development web server, you'll find about 2/3 of the directories are named SomethingPOC or SomethingExperiment. The former has become my convention over the last year or so and stands for Proof of Concept.

I spend a good portion of each day with one or more of these projects open. That's because I consider the use of the Proof of Concept to be integral to software development. Whether the official methodology of the project encourages them, is indifferent or actively discourages them (and I've worked in all of those environments), I will insist on using them.

First, a quick explanation of what exactly I mean by a POC. Basically, it's the simplest possible program that will answer a question that you have about the tasks in front of you.

Say, for instance, that you wanted to retrieve an RSS feed and store the individual entries into a SQL Server database. I'd probably do a quick POC to connect to the database and insert a record. Then I'd do one for fetching an RSS feed. I might also do one that checks a feed for new items vs one's seen before.

In other words, each POC tests out one concept and proves that you can do what was a question mark in your project approach. If you end up with more than one method in a POC, you're probably doing too much and it should be broken down into more than one POC.

I deliberately name the projects and classes with names that can NOT work in the final project. This helps to hedge against the inclination to do a quick copy and paste into your real project. This is important because POC's should be quick and loose. They don't have error handling, don't do validation (unless that's what you're proving) and generally don't follow many of the rules of good software development. That's a good thing.

That freedom means you can quickly explore the problem and work through some possible solutions in a "sandbox" without worrying about whether you're doing it "right". However, it's also a good thing that you throw the POC away or only use it as a reference.

If you do lots of POC's, name them to encourage disposable coding and then move on to do your "real" development, you'll find that you have often left those crappy early mistakes in the POC, have already run into and overcome many of the typical problems you run into in new solutions.

Once you're into your "real" development, lots of people abandon POC's. However, I keep using them throughout the project (even into the bugfixing and testing phases). Every time I'm asking myself a question about whether an idea will work, rather than trying the experimental code in the permanent code.

Over time, this ends up being a constant cycle. You ask yourself a question that can only be answered with code, do a POC to come up with an answer and then move back to the full project to implement it. If you aren't used to this kind of cycle, I'd recommend giving it a shot. I won't work without it.

Using HTMLTidy to Clean Up HTML with C#

Feb
14
2008

For a while now, I've had a project on the back burner for a different set of tools for RSS reading, writing and publishing. I'd like a single toolchain that lets me keep everything together in one place. I've got piles of notes, a few proof of concept projects and the start of several of the components.

Last night, when I couldn't sleep, I decided to check something off of the list that I wanted to see as a proof of concept for the Atom Publishing Client part of the toolchain: HTML Tidy cleanup to XHTML of HTML before putting it into an Atom entry document.

I currently do most of my writing for this site in Windows Live Writer. However, that's more of a compromise than an ideal choice. While I could probably hack a plugin together that would make Live Writer a more suitable long-term choice, I really want a very specific set of features that includes getting away from the XML-RPC API that all of the server-side engines Live Writer works with are based on.

So, I've been tinkering with a multi-tabbed Windows app for editing posts. The WYSIWYG tab for quick editing uses the MSHTML engine from Internet Explorer. I've looked around and unless you're willing to pony up $299 for a commercial control, that's the most reasonable choice.

However, the HTML that MSHTML spits out is horrible and really needs to be cleaned up. So, I set out to figure out how to use HTMLTidy in a C# project.

I tried to find a .NET wrapper for HTMLTidy and thought I had scored right away when I found one here. However, when I tried to use it, not even the sample code would build without errors on my development machine.

So, I dropped back to trying the COM object version. The last update to it was back in 2000 or so, but it looked like all of the features I needed were in that version, so I decided to give it a shot.

To use the TidyCOM library, you add it as a reference and insert your "using TidyCOM" statement in your class. The actual usage is fairly straightforward.

Example:

TidyObject TidyObj = new TidyObject();
TidyObj.Options.Doctype = "strict";
TidyObj.Options.DropFontTags = true;
TidyObj.Options.OutputXhtml = true;
TidyObj.Options.Indent = TidyCOM.IndentScheme.AutoIndent;
TidyObj.Options.TabSize = 2;
String CleanHTML = TidyObj.TidyMemToMem(HTML);

That code assumes that the "HTML" variable has your messy HTML in it and at the end, "CleanHTML" has your cleaned up XHTML in it.

My little multi-tabbed prototype is using a buffer object to keep the "current" HTML in it. Whenever you switch tabs, the old content is scrubbed through this code before the new tab gets updated out of the buffer. That means that whether it's the WYSIWYG tab that messes it up or you in the HTML editor, you still get valid XHTML in the eventual output.

I also extended my CleanupHTML method (that contains the above code) to scrub out the HTML header tags, body tags, etc. Since the HTML will actually end up as one part of the Atom xml file and not as a standalone HTML file, I only want the content from the editor and both MSHTML and HTML Tidy will always put that stuff back in unless you strip it out.

While I'd still like an assembly that's a little more current, this clearly does the job well enough to check this feature off of my checklist. Now, on to RESTful services on IIS with C#.

C# DataSets and the Magic of ReadXML

Jan
31
2008

I've worked on several applications where we used .NET DataSets as the container for passing records between web services and other components. They work pretty well to keep things nice and loosely coupled when you're building lots of separate components that may or may not all be using the same language, etc.

One of the greatest things that they included in the DataSet classes is the ability to read and write them to XML files. That gives you not only an interchange format, but a file-based version of it pretty quickly. You can easily use those files as your "gold standard" for building all of the components at once. As long as each component emits and consumes that sample file, things are golden.

Anyway, one of the side benefits of that ability to read/write those XML files is that it not only handles the DataSets you create via code. The ReadXml() method actually will convert nearly any XML file into a DataSet. That can come in really handy when your entire application is already passing DataSets around.

That's because nearly any application of reasonable size pulls in information from somewhere outside of the control of your code. In many of those cases, that data will be in XML format. You can, therefore, use the ReadXml() to get DataTable access to all kinds of useful XML stuff.

When it gets read in, .NET does some pretty cool automatic stuff, like creating identifier columns on your tables, etc. However, if, unlike the "normal" DataSets, your imported XML data is nested 2-3 or more levels deep, it can be kind of hard to predict exactly what the DataTable structure will look like.

I'm not a huge fan of automatic or "magic" methods, because you usually have absolutely no way to see inside the black box. That's not the case here because, while the method does some pretty cool magic, it is still possible to see inside of what it does.

I decided that I needed something to deal with the black box today and after dinner tonight, I wrote a quick console app to take an arbitrary XML file and dump out all of the tables, columns and rows in the DataSet in a way that makes it more clear how you'll need to use the tables to grab the data you're after.

That's the information you're going to need to establish your DataRelation objects to tie things together. It's been fairly illuminating for the few files I've sent through it so far and I'm thinking this will be a permanent part of my utilty folder.

I run it using Powershell and the "Out-File" pipe the output to a file, giving me a record of that schema (which I find much easier to read than the output of the WriteXmlSchema() method).

In case you'd like to use it as well, here's a copy of the code.

« Older Entries   Newer Entries »

J Wynia

For better or worse, I'm the guy who runs things here. I'm a web consultant, software developer, writer and geek from Minneapolis, MN. This site is a fairly wide cross-section of the things I'm interested in and enjoy writing about.

Oh, and if you happen to be looking for hosting for your Subversion repositories or just web hosting in general, take a look at Dreamhost. It's what I use for Subversion and your signup helps me out.

Latest Microposts

Follow Microposts on Twitter | Subscribe to Microposts

My Attendance At the Gym

Feeds and Links


www.flickr.com
This is a Flickr badge showing public photos from J Wynia. Make your own badge here.

Search


Pages

Archives

Computers Blog Directory
© 2003-2008 J Wynia. All original content is licensed under the terms of the Creative Commons Attribution license unless otherwise noted. Content from other sources is licensed under its original terms.