The Midnight Problem

Feb
18
2008

Somewhere, at this very moment, in an IT department or team of programmers, someone is planning a scheduled task. That task should run in "off" hours, when no one is around. The immediate suggestion that pops up is MIDNIGHT. There is no debate, no questioning of it and the task is scheduled for midnight.

Unfortunately, in many environments, this scenario is repeated a couple of times a month. I say unfortunate because the original reason for running these kinds of jobs overnight in the first place is that batch jobs that consume computing resources shouldn't interfere with regular daytime work activities. However, when *everything* gets scheduled at midnight (and in many shops, that actually ends up happening), all you've done is shift the bottleneck to the middle of the night.

I've worked in environments where the bunched up batch jobs all scheduled at midnight thrashed the hard drive and CPU all night and are often still running at 9 or 10am the next day.

When we changed the scheduling to stagger them out a bit, from 10pm through 6am, the total burden on the server dropped dramatically and we actually dropped the overall time spent running these jobs just by being more careful about scheduling.

So, if you happen to be in that meeting today and someone says "midnight", please, just check to see what else is running at midnight and consider a different time slot.

The Power of the Proof of Concept

Feb
17
2008

If you were to go through my Visual Studio "Projects" directory or my personal development web server, you'll find about 2/3 of the directories are named SomethingPOC or SomethingExperiment. The former has become my convention over the last year or so and stands for Proof of Concept.

I spend a good portion of each day with one or more of these projects open. That's because I consider the use of the Proof of Concept to be integral to software development. Whether the official methodology of the project encourages them, is indifferent or actively discourages them (and I've worked in all of those environments), I will insist on using them.

First, a quick explanation of what exactly I mean by a POC. Basically, it's the simplest possible program that will answer a question that you have about the tasks in front of you.

Say, for instance, that you wanted to retrieve an RSS feed and store the individual entries into a SQL Server database. I'd probably do a quick POC to connect to the database and insert a record. Then I'd do one for fetching an RSS feed. I might also do one that checks a feed for new items vs one's seen before.

In other words, each POC tests out one concept and proves that you can do what was a question mark in your project approach. If you end up with more than one method in a POC, you're probably doing too much and it should be broken down into more than one POC.

I deliberately name the projects and classes with names that can NOT work in the final project. This helps to hedge against the inclination to do a quick copy and paste into your real project. This is important because POC's should be quick and loose. They don't have error handling, don't do validation (unless that's what you're proving) and generally don't follow many of the rules of good software development. That's a good thing.

That freedom means you can quickly explore the problem and work through some possible solutions in a "sandbox" without worrying about whether you're doing it "right". However, it's also a good thing that you throw the POC away or only use it as a reference.

If you do lots of POC's, name them to encourage disposable coding and then move on to do your "real" development, you'll find that you have often left those crappy early mistakes in the POC, have already run into and overcome many of the typical problems you run into in new solutions.

Once you're into your "real" development, lots of people abandon POC's. However, I keep using them throughout the project (even into the bugfixing and testing phases). Every time I'm asking myself a question about whether an idea will work, rather than trying the experimental code in the permanent code.

Over time, this ends up being a constant cycle. You ask yourself a question that can only be answered with code, do a POC to come up with an answer and then move back to the full project to implement it. If you aren't used to this kind of cycle, I'd recommend giving it a shot. I won't work without it.

Using HTMLTidy to Clean Up HTML with C#

Feb
14
2008

For a while now, I've had a project on the back burner for a different set of tools for RSS reading, writing and publishing. I'd like a single toolchain that lets me keep everything together in one place. I've got piles of notes, a few proof of concept projects and the start of several of the components.

Last night, when I couldn't sleep, I decided to check something off of the list that I wanted to see as a proof of concept for the Atom Publishing Client part of the toolchain: HTML Tidy cleanup to XHTML of HTML before putting it into an Atom entry document.

I currently do most of my writing for this site in Windows Live Writer. However, that's more of a compromise than an ideal choice. While I could probably hack a plugin together that would make Live Writer a more suitable long-term choice, I really want a very specific set of features that includes getting away from the XML-RPC API that all of the server-side engines Live Writer works with are based on.

So, I've been tinkering with a multi-tabbed Windows app for editing posts. The WYSIWYG tab for quick editing uses the MSHTML engine from Internet Explorer. I've looked around and unless you're willing to pony up $299 for a commercial control, that's the most reasonable choice.

However, the HTML that MSHTML spits out is horrible and really needs to be cleaned up. So, I set out to figure out how to use HTMLTidy in a C# project.

I tried to find a .NET wrapper for HTMLTidy and thought I had scored right away when I found one here. However, when I tried to use it, not even the sample code would build without errors on my development machine.

So, I dropped back to trying the COM object version. The last update to it was back in 2000 or so, but it looked like all of the features I needed were in that version, so I decided to give it a shot.

To use the TidyCOM library, you add it as a reference and insert your "using TidyCOM" statement in your class. The actual usage is fairly straightforward.

Example:

TidyObject TidyObj = new TidyObject();
TidyObj.Options.Doctype = "strict";
TidyObj.Options.DropFontTags = true;
TidyObj.Options.OutputXhtml = true;
TidyObj.Options.Indent = TidyCOM.IndentScheme.AutoIndent;
TidyObj.Options.TabSize = 2;
String CleanHTML = TidyObj.TidyMemToMem(HTML);

That code assumes that the "HTML" variable has your messy HTML in it and at the end, "CleanHTML" has your cleaned up XHTML in it.

My little multi-tabbed prototype is using a buffer object to keep the "current" HTML in it. Whenever you switch tabs, the old content is scrubbed through this code before the new tab gets updated out of the buffer. That means that whether it's the WYSIWYG tab that messes it up or you in the HTML editor, you still get valid XHTML in the eventual output.

I also extended my CleanupHTML method (that contains the above code) to scrub out the HTML header tags, body tags, etc. Since the HTML will actually end up as one part of the Atom xml file and not as a standalone HTML file, I only want the content from the editor and both MSHTML and HTML Tidy will always put that stuff back in unless you strip it out.

While I'd still like an assembly that's a little more current, this clearly does the job well enough to check this feature off of my checklist. Now, on to RESTful services on IIS with C#.

Keeping Track of Everything You Print

Feb
13
2008

A few weeks ago, I was staring at my browser which was presenting me the now ubiquitous receipt page after buying something online. That page was, as is so common, recommending that I "print this page for your records".

The thing is that I usually don't really want a printed copy of it, despite really wanting to keep a copy. As I was on my Linux laptop, I just printed it to PDF. That way, I have a copy of it in a form that matches what I would have gotten if I had printed it. I could have saved the HTML page, but like the single document approach of PDF for this.

On my Mac laptop, this is just as easy and on Windows not much harder. Both Ubuntu and Mac OSX make it really easy to have a PDF printer. However, what I noticed as I went to print this particular receipt to a PDF was that on none of my machines was this PDF printer the default printer.

Because of that, I was only getting a PDF when I saw in advance that I might want one instead of printing it for real. That sparked a bit of curiosity in me. What would happen if I made the PDF printer my default and sent everything through there first.

So, for the past few weeks, that's been the setup on all of my workstations. The results make it clear that I want to make this the default setup from here on out for a few reasons.

First is the number of times where I printed something to PDF, sent it to the printer marked up the printout and eventually dropped the paper into the recycling only to go looking on my desk for that printout a couple of days later. No problem, since the PDF was sitting in my PDF output directory.

It's also become a really decent way to save a web page article or snapshot of a document in an easily retrievable format. When combined with my recent JungleDisk installations on all of those machines and the automatic backups that include those PDF directories on all of the machines, I now have access to anything I've printed or wanted to keep, no matter where I was when I printed it.

While I still use bookmarking engines quite a bit for marking things to find later, it's happened more often than I am comfortable with that the page/article in question goes away by the time I want it a few months down the road: not the case with exported PDF's.

Finally, when you turn off your browser's headers and footers, you can easily use straight HTML or any of the online word processors for document editing and get nice PDF's for sharing by email, etc.

Given how I can quite easily write simple documents in raw HTML faster and make them look more consistent (with standardized CSS) than I can do the same in MS Word or OpenOffice, this is pretty useful.

Overall, pretty slick and handy. If you haven't ever tried setting your computer up this way, I highly recommend giving it a shot.

Beyond Wikipedia: Researching and Exploring Online

Feb
09
2008

Every few weeks I seem to see clusters of discussions about "young people" and technology. Typically, it starts off as I notice someone doing a news story or just spouting off in a restaurant about how amazing it is that "kids today" are growing up with computers/cellphones/iPods and how amazed they are by how adept and sophisticated they are in using those devices.

Nearly always, within 1-2 days, I see another article or just happen to see an incident that points to just how wrong that generalization is. From computers ripe with thousands of viruses and bits of spyware to reports of college professors citing how poorly students grasp the very concept of citing sources and the simple basics of research, examples seem to point to a much more complicated picture.

It's clear to me that there seems to be a segment inside EVERY age group that seems to just "get" technology. Many of the sharpest technologists I know are in their 50's or 60's and some of the most clueless are 16-25. Of course, the plural of anecdote isn't data, but there certainly seems to be enough indication that the full spectrum from tech novice to tech genius exists in nearly all of the age brackets.

One of the criticisms leveled at the non-savvy portion of the younger brackets is how often they will pretty much stop at the first level of Wikipedia when researching a topic. It's so common that many colleges and Universities have had to put actual bans on citing Wikipedia in academic papers.

Given that I was told that the encyclopedia stopped being a valid primary source at some time in 8th grade, this troubles me like it does many others. Wikipedia and Google are starting points for exploring or researching a topic.

I've mentioned before how often I've been asked how/why I know something. That's been followed more than a few times by people asking how I manage to learn as much as I do about the topics that sparked the discussion in the first place.

As I recently used my "normal" process just recently on a topic, I took note of how I dig into a topic and I thought I'd share. This isn't an approach to writing a formal paper/thesis/dissertation. Rather, it's an approach to to satisfying curiosity, getting acquainted with a topic, and getting a dedicated hobbiest level of knowledge in a given topic.

Read the rest of this entry »

 

J Wynia

For better or worse, I'm the guy who runs things here. I'm a web consultant, software developer, writer and geek from Minneapolis, MN. This site is a fairly wide cross-section of the things I'm interested in and enjoy writing about.

Oh, and if you happen to be looking for hosting for your Subversion repositories or just web hosting in general, take a look at Dreamhost. It's what I use for Subversion and your signup helps me out.

Latest Microposts

jwynia: is ripping the first DVD on the new Thinkpad. Holy crap this DVD drive is quiet and smooth. No jet engine takeoff.
jwynia: is unsubscribing to a bunch of mailing lists that he's been deleting without reading for WAY too long.
jwynia: @bethdean if I ever get to the point of having an office and staff for my consulting, there WILL be a microwave popcorn ban.
jwynia: is wondering whether his intent to spend his stimulus check in Ireland is weird.
jwynia: is baffled by what a hot commodity the screen-cleaning spray has become in this office.
Follow Microposts on Twitter | Subscribe to Microposts

My Attendance At the Gym

Feeds and Links


www.flickr.com
This is a Flickr badge showing public photos from J Wynia. Make your own badge here.

Search


Pages

Archives

© 2007 J Wynia. All original content is licensed under the terms of the Creative Commons Attribution license unless otherwise noted. Content from other sources is licensed under its original terms.