Few people would argue that Microsoft Word is a versatile program. The problem is that the software may not be the best tool to convert Word to HTML for your purposes. One reason is that Microsoft adds some extra code so that you can easily switch from one document format to another. The end result is larger files and code that may cause issues. But there are alternatives
Like many people, I use Microsoft Word to write content. I sometimes write my articles in Word and convert them into my content management system (CMS). I would prefer not to rewrite the article in another application, nor do I want to write in my CMS editor since it has limited functionality. It can be a trade-off between convenience, functionality and file size.
Document Complexity & Changes
When I first wrote this article in 2008, the conversion options were much different. For example, Gmail had an option to view an attachment as HTML. There were also a couple of commercial vendors who have since disappeared. I’ve updated the article to reflect those changes. The main considerations I see are:
- How complicated is your Word document?
- Does your Word document have images?
- Are you comfortable uploading a file to a 3rd party service?
- Where will the HTML document be seen?
- Will the document be viewed on a mobile phone?
- Are you willing to pay for conversion?
- How frequently do you need to convert Word documents?
Microsoft Word Options
You might think the best and most convenient way to get your Word document to HTML is to use the Save as type: Web Page. Then you could upload the saved HTML file to a web server. There are two issues you should review with this file type.
This Web Page format appends the information from the File Properties dialog and other descriptive information to the top of the document. These data elements include author, last author, company, document stats, and so on. You can see some of these elements in the image below.
The Web File version is probably fine for company intranets, where users aren’t as concerned about privacy. Some of this information could be seen if you emailed the Word file to a co-worker. In contrast, I wouldn’t use this format to post your resume on the web especially if you wrote it using a company PC that shows the organization name.
The second issue is this HTML format adds tags to the file. One function of these tags is to convert your Microsoft Word style information. These tags also make it easier to go from one file type version to another. For example, you wanted to go from .HTML to .RTF or back to .DOCX.
This extra code increases the size of your web page. This may not sound like an issue, but it can be based on your document size. And the extra code may cause rendering issues on mobile devices.
Another drawback of this extra code is if you need to edit the HTML file. Most HTML documents have a separate CSS file that controls styling. With converted documents, this styling is done inline. However, based on how the initial Microsoft Word document was styled, you might have to make changes to every paragraph or span. With a CSS file, you’d probably make one change.
Microsoft’s Word Filtered Web Page
Microsoft has another HTML file format called the Web Page, Filtered. This file type strips most of the document information. It also cuts the number of style codes. Although smaller, this file format still contains numerous references.
In my small test page, the size was cut from 9.82K to 4.03K with this format. Much of the savings in this example was from the removal of the document information. In my first file, the heading tag for Example 1 was on line 175. In the Web Page, filtered format, the same heading is at line 58.
The bottom line is the Microsoft Word conversion options are free and offer you convenience. The downside is you may reveal too much info in your document and future HTML editing may be tougher with all the extra info.
Using a Content Management System (CMS)
The promise of many content management systems is that it’s easy to create content. Ideally, you write your article in their HTML editor. I’ve yet to find an editor that gives me the functionality or space I need, which is why I sometimes write in Microsoft Word.
Some CMS editors provide a Paste from Word toolbar button.
These utilities will remove some tags, but not all. Depending on your document, there might be lots of these tags. Some systems also offer a Paste as Text button. This button will remove all formatting. This button works well for simple documents but if you have any formatting for lists, tables, paragraphs and so on you might spend more time reapplying the formatting.
Paste Convert Solutions
Another way to convert Word documents is to use an online service. These are free services that best handle simple text. One advantage is these services are not saving your files and no uploads are required. The major drawback is images. One caveat about these systems is most give items after the
One of the simplest to use is TextFixer. You just copy the Word document and then paste the contents into a textbox. The service does a reasonable job of providing HTML but strips out any images. This also means if you used fancy list bullets or wingdings you may see another character.
If you only have a few images, this service is still a good consideration because it’s pretty easy to add the images back if you know a little HTML or use an editor.
This site also has a number of other free text tools and tutorials that might interest you such as converting HTML to text or text sorting.
A similar free service is provided by Word2CleanHTML. Like TextFixer, you copy and paste your Word document into a textbox. The advantage is they provide several checkboxes for additional filtering such as removing blank paragraph tags or converting “smart quotes”.
They also provide tabs so you can see the “Original HTML”, “Clean HTML” and “Preview”. These tabs were useful as they helped me spot that I had a blank table in my original document. The main drawback is they don’t handle images, but as I said above those are easy to add back in if you don’t have too many and know a bit of HTML.
OK, technically you can create a new Google doc and paste in your Word doc. You can then download the file as a zipped HTML file. If you want to clean HTML, avoid this solution. It has a lot of extra formatting. I was disappointed.
Upload File and Convert Solutions
Another group of services allows you to upload the Word .doc or .docx file for conversion. In some cases, the service did other file type conversions aside from Microsoft Word. Typically, these services did a better job of conversion as they would handle images. However, some people don’t like uploading files to another site especially if it’s confidential information that might go on a corporate intranet.
Online-Convert.Com is a service that converts many file types as the image below shows. It’s similar to a service I reviewed sometime back called Zamzar. The process is simple in that you choose the end file type you need such as HTML. You then upload your Word source document. One advantage is you can upload files from a URL, Google Drive or DropBox.
Your file will be converted including any images. The images are converted to PNGs and are included in the zipped file. The resulting file contains more inline styling than the previous solutions, but not nearly as much as using Microsoft Word. There are some additional META tags added in thesection. If you prefer, you can also opt to get a download link for the file sent to an email address.
Word to HTML
You’ll also need to copy the code from the HTML editor as the free version doesn’t allow file downloads.
The WordtoHTML Pro version costs $49 a year and includes additional features. If you routinely convert Microsoft Word files or PDF files to HTML this is a good option as you can customize and save your settings in template files. For example, you may want to include head or meta-information or prefer a certain type of formatting.
This is one of the few services I’ve seen that has PDF to HTML conversions. It worked pretty well, but like complex Word documents, you may see issues. For example, drop caps don’t always come in. And if people have tweaked the kerning to get words to fit, you may see errors. The Find and Replace option is also handy if you need to replace character entities.
Stand Alone Conversion Programs
Some people prefer having a desktop program to do these conversions and bypass online services. There are two programs I found, but they both use Windows. Apologies to Mac users.
Batch and Complex Conversions
The other situation some companies run into is how to convert hundreds of Word documents to HTML files. I doubt anyone would want to do these file conversions one document at a time. Instead, they could use another commercial program called DocConverterPro. The program allows you to batch convert .doc, .docx, .rtf and PDF files to HTML or XHTML. This product is the new and improved version of the WordCleaner program I used back in 2008. It’s been rebranded.
DocConverterPro does more than batch HTML conversion. I used this software years ago to convert my Word documents for this site. At the time I wasn’t familiar enough with HTML. One of the nice features is you can create conversion templates. For example, I might have one template for content on this website where I use CSS files to handle the presentation. On another site, I might choose to embed the style information in the file. It’s very flexible and powerful.
The service also has a Windows program. That was the method I used many years ago, but I think the online version is easier. The pricing does vary between the services and options. For example, the online version is $99.