Looking for a reliable way to convert Word documents to HTML? Microsoft Word might not be your best bet. It often adds unnecessary code, resulting in large files and potential rendering issues. In this guide, I’ll explore some free and paid alternatives for a cleaner conversion.
Like many people, I use Microsoft Word to write. I sometimes write my articles in Word and convert them into my content management system, WordPress. It can be a trade-off between convenience, functionality, and file size. Many online tools can do this file conversion, but they don’t always deliver clean HTML.
Document Complexity & Your Needs
When I first wrote this article in 2008, the conversion options differ greatly. For example, Gmail had an option to view an attachment as HTML. There were also a couple of commercial vendors who have since disappeared. I’ve updated the article to reflect those changes. The main considerations I see are:
- How complicated is your Word document?
- Does your Word document have images?
- Does your HTML document need to have the same formatting?
- Are you comfortable uploading a doc file to a 3rd party service?
- Where will the HTML document be seen?
- Will the document be viewed on a mobile phone?
- Are you willing to pay for conversion?
- How often do you need to convert Word documents?
Regardless of complexity, there are common issues I see. None of the systems below are perfect, so you’ll probably have to do some tweaking. The issues I encountered include:
- Most conversion systems will mark your Title as a paragraph.
- Apostrophes and special characters may be displayed on black backgrounds.
- Embedded images may not be shown or be placed in a new folder as separate image files
Microsoft Word Options
One logical place to start would be with Microsoft and to see if Word can format the file as HTML. While Word is not an HTML editor like VS Code, it can save your files in different file formats. There are three options, but before you convert your document, make sure you’ve saved the original in the .docx format. This will allow you to test the various options without overwriting your content.
Word – Save As Web Page
You might think the best and most convenient way to get your Word document to HTML is to use the Save as type: Web Page. Then you could upload the saved HTML file to a web server. However, there are two issues you should review with this file type.
This Web Page format appends information from the File Properties dialog and the document template. These data elements include author, last author, company, document stats, and so on. You can see some of these elements in the image below. As you might guess, a lot of this relates to style information from your Microsoft Word template or normal.dot.
The Web File version is probably fine for company intranets, where users aren’t as concerned about privacy. Some of this information could be seen if you emailed the Word file to a co-worker. In contrast, I wouldn’t use this format to post your resume on the web, especially if you wrote it using a company PC that shows the organization’s name. These files contain too many proprietary tags.
The second issue is that this HTML format adds tags to the file. One function of these tags is to convert your Microsoft Word style information. These tags also make it easier to go from one file-type version to another. For example, if you wanted to go from .HTML to .RTF or back to .DOCX. However, the <body> tag doesn’t even show up until line 1093.
This extra code increases the size of your web page. This may not sound like an issue, but it can be based on your document size. And the extra code may cause rendering issues on some devices.
Another drawback of this extra code is if you need to edit the HTML file. Most HTML documents have a separate CSS file that controls styling. With converted documents, this styling is done inline. However, based on how the initial Microsoft Word document was styled, you might have to change every paragraph or span. With a CSS file, you’d probably make one change.
Lastly, the program will create another folder that contains supporting files such as your images.
Word – Save As Filtered Web Page
Microsoft has another HTML file format called the Web Page, Filtered. This file type strips most of the document information and focuses on the content. It also cuts the number of document and template style codes. Although considerably smaller than “save as web page”, this file format still contains numerous classes and span references.
The size was cut from 92K to 31K with this format on my test page. Much of the savings in this example came from the removal of the document information. For example, in my first file, the heading tag for Example 1 was on line 1093. In the Web Page, filtered format, the same heading is at line 128.
The bottom line is the Microsoft Word conversion options are free and offer you convenience. The downside is you may reveal too much info in your document, and future HTML editing may be tougher with all the extra info. If you decide to use the Save as Filtered Web Page, you will lose some Microsoft Office formatting features if you re-open the file in Word.
Content Management Systems (CMS)
Many content management systems promise that it’s easy to create content using WYSIWYG. Ideally, you write your article in their HTML editor. I’ve yet to find an editor that gives me the functionality or space I need, which is why I sometimes write in Microsoft Word.
Some CMS editors provide a Paste from Word toolbar button.
These utilities will remove some tags, but not all. Depending on your document, there might be lots of these tags. Some systems also offer a Paste as Text button. This button will remove all text formatting. This button works well for simple documents, but if you have any formatting for lists, tables, paragraphs, and so on, you might spend more time reapplying the formatting.
WordPress Gutenberg Block Editor
If you haven’t tried the Gutenberg block editor in WordPress, you should. I was pleasantly surprised to see how much styling it retained. The screen snap below shows a recent paste and created blocks. If I go to view the HTML code of the second block, I can see it properly coded the unordered list.
The process isn’t foolproof, but if you’re already using WordPress, it’s probably the best option.
Security Concerns and Risks
Anytime you use another software or service to convert your files, please read their Terms of Service (TOS) first. There are many tools that can assist you; however, some tools add code to your page. This could be something like indicating the file was converted by XYZ service. In many cases, that’s the “price” you pay for using a free service. I think it’s fine for the vendor to note that.
Sadly, other vendors inject non-relevant links that amount to SEO spam. This is something I have seen with WordPress plugins, but not with Word to HTML conversion or HTML cleaning services. However, SaaS bootstrapper has a nice blog post detailing an issue he discovered with link injection. It’s worth a read, and I applaud his desire to find out what was going on.
The good news is that none of the services referenced below were mentioned in the blog post. The author does provide a list of bad actors.
Online Doc to HTML Converters
Another way to convert Word documents is to use an online service. These are free services that best handle simple text. One advantage is these services are not saving your files, and no uploads are required. The major drawback is that images are typically ignored.
Word2CleanHTML provides a free conversion service. You can copy and paste your Word document into a textbox. The advantage is they provide several checkboxes for additional filtering, such as removing blank paragraph tags or converting “smart quotes.”
They also provide tabs so you can compare the “Original HTML,” “Clean HTML,” and “Preview.” These tabs were useful as they helped me spot a blank table in my original document. The main drawback is they don’t handle images, but as I said above, those are easier to add back if you don’t have too many and know a bit of HTML.
One of the simplest converters to use is TextFixer. You copy the Word document and then paste the contents into a textbox. The service does a reasonable job of providing HTML but strips out any images. Unfortunately, if you used fancy list bullets or windings, you might see a different character.
This site also has a number of other free text tools and tutorials that might interest you, such as converting HTML to text or text sorting.
Upload File and Convert Solutions
Another group of services allows you to upload the Word .doc or .docx file for conversion. In some cases, the service did other file-type conversions aside from Microsoft Word. Typically, these services did a better job of conversion as they would retain images. However, some people don’t like uploading files to another site, especially if it’s confidential information that might go on a corporate intranet.
Online-Convert.Com is a service that converts many file types, as the image below shows. It’s similar to a service I reviewed sometime back called Zamzar. The process is simple in that you choose the end file type you need, such as HTML. You then upload your Word source document. One advantage is that you can upload files from a URL, Google Drive, or DropBox.
Your file will be converted, including any images. The images are converted to PNGs and are included in the zipped file. The resulting file contains more inline styling than the previous solutions, but not nearly as much as Microsoft Word or Google Docs. You may also want to check to see if further image compression needs to be done. There are some additional META tags added in the section. If you prefer, you can also opt to get a download link for the converted file sent to an email address.
Word to HTML
You’ll also need to copy the code from the HTML editor as the free version doesn’t allow file downloads.
The WordtoHTML Pro version costs $90 per year or $10 per month and includes additional features. If you routinely convert Microsoft Word files or PDF files to HTML, this is a good option because you can customize and save your settings in template files. For example, you may want to include head or meta-information or prefer a certain type of formatting.
This is one of the few services I’ve seen that has PDF to HTML conversions. It worked pretty well, but like complex Word documents, you may see issues. For example, drop caps don’t always come in. And if people have tweaked the letter kerning to get words to fit, you may see errors. The Find and Replace option is also handy if you need to replace character entities.
Complex Word Documents and Batch Conversions
The other situation some companies run into is how to convert hundreds of Word documents to HTML files. I doubt anyone would want to do these file conversions one document at a time. Instead, they could use another commercial program called DocConverterPro. The program allows you to batch convert .doc, .docx, .rtf and PDF files to HTML or XHTML. This product is the new and improved version of the WordCleaner program I used back in 2008. It’s been rebranded.
DocConverterPro does more than batch HTML conversion. I used this software years ago to convert my Word documents for this site when I was using Joomla. At the time, I wasn’t familiar enough with HTML. One of the nice features is you can create conversion templates. For example, I might have one template for content on this website where I use CSS files to handle the presentation. On another site, I might choose to embed the style information in the file. It’s very flexible and powerful.
The service also has a Microsoft Windows program. That was the method I used many years ago, but I think the online version is easier. Pricing does vary between the services and options. For example, the online version is $100 yearly or $10 per month.
Clearly, there are a lot of options when it comes to converting a Word file to HTML. A lot of it comes down to time, budget, and document structure. If you have a lot of documents, I’d lean toward some of the paid solutions. These tools will reduce the overall conversion time and give you a more consistent result.