Many people would agree that Microsoft Word is a versatile program. However, it may not be the best tool to convert Word to HTML. One reason is that Microsoft adds extra code that allows you to switch from one document format to another easily. The result is larger files and code that may cause issues. In this tutorial, I show some alternatives that convert Word to HTML.
Like many people, I use Microsoft Word to write content. I sometimes write my articles in Word and convert them into my content management system (CMS). However, I would prefer not to rewrite the article in another application, nor do I write in my CMS editor since it has limited functionality. It can be a trade-off between convenience, functionality, and file size.
When I first wrote this article in 2008, the conversion options were much different. For example, Gmail had an option to view an attachment as HTML. There were also a couple of commercial vendors who have since disappeared. I’ve updated the article to reflect those changes. The main considerations I see are:
- How complicated is your Word document?
- Does your Word document have images?
- Does your HTML document need to have the same formatting?
- Are you comfortable uploading a file to a 3rd party service?
- Where will the HTML document be seen?
- Will the document be viewed on a mobile phone?
- Are you willing to pay for conversion?
- How often do you need to convert Word documents?
Microsoft Word Web Page
You might think the best and most convenient way to get your Word document to HTML is to use the Save as type: Web Page. Then you could upload the saved HTML file to a web server. However, there are two issues you should review with this file type.
This Web Page format appends the information from the File Properties dialog and other descriptive information to the document’s top. These data elements include author, last author, company, document stats, and so on. You can see some of these elements in the image below.
The Web File version is probably fine for company intranets, where users aren’t as concerned about privacy. Some of this information could be seen if you emailed the Word file to a co-worker. In contrast, I wouldn’t use this format to post your resume on the web, especially if you wrote it using a company PC that shows the organization’s name.
The second issue is this HTML format adds tags to the file. One function of these tags is to convert your Microsoft Word style information. These tags also make it easier to go from one file type version to another. For example, if you wanted to go from .HTML to .RTF or back to .DOCX.
This extra code increases the size of your web page. This may not sound like an issue, but it can be based on your document size. And the extra code may cause rendering issues on mobile devices.
Another drawback of this extra code is if you need to edit the HTML file. Most HTML documents have a separate CSS file that controls styling. With converted documents, this styling is done inline. However, based on how the initial Microsoft Word document was styled, you might have to change every paragraph or span. With a CSS file, you’d probably make one change.
Microsoft Word Filtered Web Page
Microsoft has another HTML file format called the Web Page, Filtered. This file type strips most of the document information. It also cuts the number of style codes. Although smaller, this file format still contains numerous references.
The size was cut from 9.82K to 4.03K with this format on my small test page. Much of the savings in this example was from the removal of the document information. For example, in my first file, the heading tag for Example 1 was on line 175. In the Web Page, filtered format, the same heading is at line 58.
The bottom line is the Microsoft Word conversion options are free and offer you convenience. The downside is you may reveal too much info in your document, and future HTML editing may be tougher with all the extra info.
Content Management Systems (CMS)
Many content management systems promise that it’s easy to create content. Ideally, you write your article in their HTML editor. I’ve yet to find an editor that gives me the functionality or space I need, which is why I sometimes write in Microsoft Word.
Some CMS editors provide a Paste from Word toolbar button.
These utilities will remove some tags, but not all. Depending on your document, there might be lots of these tags. Some systems also offer a Paste as Text button. This button will remove all formatting. This button works well for simple documents, but if you have any formatting for lists, tables, paragraphs, and so on, you might spend more time reapplying the formatting.
Security Concerns and Risks
Anytime you use another software or service to convert your files, please read their Terms of Service (TOS) first. There are many tools that can assist you; however, some tools add code to your page that wasn’t on the original. This could be something like indicating the file was converted by XYZ service. In many cases, that’s the “price” you pay for using a free service. I think it’s fine for the vendor to note that.
Sadly, other vendors inject non-relevant links that amount to SEO spam. This is something I have seen with WordPress plugins, but not Word to HTML conversion services. However, a SaaS bootstrapper has a nice blog post detailing an issue he discovered with link injection. It’s worth a read, and I applaud his desire to find out what was going on.
The good news is that none of services referenced below were mentioned in the blog post. The author does provide a list of bad actors.
Paste & Convert Solutions
Another way to convert Word documents is to use an online service. These are free services that best handle simple text. One advantage is these services are not saving your files, and no uploads are required. The major drawback is images are typically ignored.
One of the simplest converters to use is TextFixer. You copy the Word document and then paste the contents into a textbox. The service does a reasonable job of providing HTML but strips out any images. Unfortunately, if you used fancy list bullets or windings, you might see a different character.
If you only have a few images, this service is still a good consideration because it’s pretty easy to add the images back if you know a little HTML.
This site also has a number of other free text tools and tutorials that might interest you such as converting HTML to text or text sorting.
Word2CleanHTML provides a similar free conversion service. Like TextFixer, you copy and paste your Word document into a textbox. The advantage is they provide several checkboxes for additional filterings, such as removing blank paragraph tags or converting “smart quotes.”
They also provide tabs so you can compare the “Original HTML,” “Clean HTML,” and “Preview.” These tabs were useful as they helped me spot a blank table in my original document. The main drawback is they don’t handle images, but as I said above, those are easier to add back if you don’t have too many and know a bit of HTML.
Upload File and Convert Solutions
Another group of services allows you to upload the Word .doc or .docx file for conversion. In some cases, the service did other file type conversions aside from Microsoft Word. Typically, these services did a better job of conversion as they would retain images. However, some people don’t like uploading files to another site, especially if it’s confidential information that might go on a corporate intranet.
Online-Convert.Com is a service that converts many file types, as the image below shows. It’s similar to a service I reviewed sometime back called Zamzar. The process is simple in that you choose the end file type you need, such as HTML. You then upload your Word source document. One advantage is you can upload files from a URL, Google Drive or DropBox.
Your file will be converted, including any images. The images are converted to PNGs and are included in the zipped file. The resulting file contains more inline styling than the previous solutions, but not nearly as much as Microsoft Word. You may also want to check to see if further image compression needs to do done. There are some additional META tags added in the section. If you prefer, you can also opt to get a download link for the converted file sent to an email address.
Word to HTML
You’ll also need to copy the code from the HTML editor as the free version doesn’t allow file downloads.
The WordtoHTML Pro version costs $49 a year and includes additional features. If you routinely convert Microsoft Word files or PDF files to HTML, this is a good option because you can customize and save your settings in template files. For example, you may want to include head or meta-information or prefer a certain type of formatting.
This is one of the few services I’ve seen that has PDF to HTML conversions. It worked pretty well, but like complex Word documents, you may see issues. For example, drop caps don’t always come in. And if people have tweaked the letter kerning to get words to fit, you may see errors. The Find and Replace option is also handy if you need to replace character entities.
Complex and Batch Conversions
The other situation some companies run into is how to convert hundreds of Word documents to HTML files. I doubt anyone would want to do these file conversions one document at a time. Instead, they could use another commercial program called DocConverterPro. The program allows you to batch convert .doc, .docx, .rtf and PDF files to HTML or XHTML. This product is the new and improved version of the WordCleaner program I used back in 2008. It’s been rebranded.
DocConverterPro does more than batch HTML conversion. I used this software years ago to convert my Word documents for this site. At the time, I wasn’t familiar enough with HTML. One of the nice features is you can create conversion templates. For example, I might have one template for content on this website where I use CSS files to handle the presentation. On another site, I might choose to embed the style information in the file. It’s very flexible and powerful.
The service also has a Windows program. That was the method I used many years ago, but I think the online version is easier. The pricing does vary between the services and options. For example, the online version is $99.
Clearly, there are a lot of options when it comes to converting a Word file to HTML. A lot of it comes down to time, budget and document structure. If you have a lot of documents, I’d lean toward some of the paid solutions. These tools will reduce the overall conversion time and give you a more consistent result.