Home arrow 5 Minute Tips arrow Chunkers arrow Resources for Converting Microsoft Word Files to HTML
Resources for Converting Microsoft Word Files to HTML Print
Wednesday, 09 April 2008
Few people would argue that Microsoft Word is a versatile program. The problem is that the software may not be the best tool for converting your Word documents to HTML. This is true if you need watch files sizes as it adds code that increases page size. It may also include information you don’t expect. If you need to get your Word content to the web, we’ve got some tips and alternatives.

Like many people, I use Microsoft Word to write everything. I write my articles in Word and convert them for my content management system (CMS). I would prefer not to rewrite the article in another application, nor do I want to write in my CMS editor since it has limited functionality. It can be a trade-off between convenience, functionality and file size.

Microsoft Word has 2 Different HTML File Types

You might think the best and most convenient way to get your Word document online is to use the Save as type: Web Page. Then you could upload the saved HTML file to a web server. There are two issues you should review with this file type.

This Web Page format appends the information from the File Properties dialog and other descriptive information to the top of the document. These data elements include author, last author, company, document stats and so on. You can see some of these elements in the thumbnail example below.

Extra-Word-info-in-HTML-file
Click to enlarge

The Web File version is probably fine for company intranets, where users aren’t as concerned about privacy. Some of this information could be seen if you emailed the Word file to a co-worker. In contrast, I wouldn’t use this format to post your resume on the web especially if you wrote it using a company PC.

The second issue is this HTML format adds tags to the file. One function of these tags is to convert your Microsoft Word style information from your document template. This info also helps you to maintain Word functionality should you need to edit the document.

Example-of-tags-added-to-Word-HTML
Click to enlarge

This extra code increases the size of your web page. This may not sound like an issue, but it can be based on your document size. According to sources such as WebSiteOptimization.com, slow response times are one of the most common reasons people leave a site. One part of response time is the web page size. More recently, Google announced they will be factoring in page load times for their AdWords advertisers.

Microsoft’s Word Filtered Web Page

Microsoft has another HTML file format called the Web Page, Filtered. This file type strips most of the document information. It also cuts the amount of style codes. Although smaller, this file format still contains numerous <p class=”MsoNormal”> references. As mentioned above, some of this coding allows you to edit your work in Word. This file format is best used for final document versions.

In my small test page, the size was cut from 9.82K to 4.03K with this format. Much of the savings in this example was from the removal of the document information. In my first file, the <h1> heading tag for Example 1 was on line 175. In the Web Page, filtered format, the same heading is at line 58.

Example-of-Word-web-filtered-format
Click to enlarge

Use Gmail to Reduce Word HTML File Size

One way to convert a final version of a Word document to HTML is to send it to a Gmail account as an attachment. With this method, you can maintain your file format as .doc. (I’ve not tried this tip with Word 2007 .docx file format.)

To have Gmail convert a Word doc to HTML,

1. Open your favorite email program

2. Attach your Word document (the .doc file format) to the email.

3. Send the email to your Gmail account.

4. Open the Gmail item with the attachment.

5. Click the link at the bottom that says “View as HTML”. The document will open in your browser.

6. Right-click in your web browser and select “View Page Source” or “View Source

7. Copy and paste the contents into a HTML editor or Notepad. Don’t paste it back into Microsoft Word.

8. Scroll toward the top of your file and look for the code Google adds to download your file. You should remove this link.

<div style="background:#ffffcc;padding:4 8;border-bottom:thin solid #eeeeee;font-family:Arial,sans-serif"><a href="/mail/?attid=0.1&disp=attd&view=att&th=1192fa6dbxxxxxxx">Download the original attachment</a></div></div><div style="margin:1ex">

9. Make any changes in your editor.

10. Save your file with the .htm or .html extension.

Use Textism Word HTML Cleaner for Small Word Files

Another free resource is Textism. This web site has a simple interface that can convert files less than 20K. The author says the service is designed for “fairly basic styled documents”. The size may be a limiting factor for some documents, but you can convert larger files by getting a subscription.

To use the Textism service,

1. Save your Word document with the Web page file format

2. Go to http://textism.com/wordcleaner/

3. Click the Browse… button

4. Find your Word file with the .htm or .html extension

5. Click the Process button.

6. Copy and paste the clean code into your HTML editor of choice.

Using a Word Document in a Content Management System

The promise of many content management systems is that it’s easy to create online content. Ideally, you write your article in their HTML editor. I’ve yet to find an editor that gives me the functionality or space I need, which is why I write in Microsoft Word.

Some HTML editors provide a Paste from Word toolbar button.

Paste-to-Word-editor-feature
Click to enlarge

These utilities will remove many tags such as the <p class=”MsoNormal”>. Depending on your document, there might be lots of these tags. Some systems also offer a Paste as Text button. This button will remove all formatting. This button works well for simple documents but it you have any formatting for lists, tables, paragraphs and so on you might spend more time reapplying the formatting.

Although these editors do an admirable job of removing extra Microsoft Word code, plenty remains. I’ve seen many examples in content and email managements systems where this extra code causes problems when the user makes changes. As example, they can’t change the color or font size of some text. It’s not until look at the HTML code do you see different font tags stepping on each other.

Special Word Conversion Scenarios

The suggestions above work for short documents such as articles. Sometimes, people need a solution to convert large Word documents with a Table of Contents to an online format. An example might be a report or technical manual. Since these documents usually need a navigation system to link pages, I would consider a commercial product. One Microsoft Word to HTML conversion program I’ve used is called WordToWeb from Solutionsoft. The current version sells for $299, but that may be a bargain if it cuts the amount of time to convert your document.

The other situation some companies run into is how to convert hundreds of Word documents to HTML files. I doubt anyone would want to do the conversions one document at a time. Instead, they could use another commercial program from Zappado called Word Cleaner. The $99 program allows you to batch convert .doc, .docx, .rtf and .txt files to HTML or XHTML.

Word Cleaner does more than batch HTML conversion. I use this software to convert my Word documents for this site because it removes all traces of the Word code. One of the nice features is you can create conversion templates. For example, I might have one template for content on this website where I use css files to handle the presentation. On another site, I might choose to embed the style information in the file.

As you can see, even though Microsoft Word doesn’t do the best job in converting your documents to HTML, it doesn’t deter me from using the program. It still remains my starting point. I just use some other tools to supplement the program.


Related Microsoft Word Tips

How to Remove Word Format Codes
How to create Word Keyboard Shortcuts

Last Updated ( Sunday, 08 June 2008 )