Home Conticreative Blogs Web Dev blog Using Microsoft Word to write website content

Blogs - Web Technologies Blog

Using Microsoft Word to write website content

Article Index
Using Microsoft Word to write website content
MS Word Text Based clean up
web based cleaning
word and Joomla
All Pages

How to convert MS Word to HTML cheaply and effectively

Many people are comfortable writing their website content with a Word Processor. For most that means using Microsoft Word.

Unfortunately, Microsoft Word is not very compliant with most modern browsers and it adds a plethora of superfluous code to your HTML that is very hard to clean properly. The result is that if you copy and paste from MS Word to your HTML document, regardless of the system you use, the result will be less than satisfactory and it often will even break your page.

The reason is that MS Word uses a number of proprietary tags and even saving your document a HTML page won't cure the problem.

(Pressed for time? Don't want to read the entire post? Skip the article and use the Checklist)

Here is why. Let's look at a simple paragraph written in MS word:

Sample paragraph:

I use my Energy and Honesty to Teach and Inspire the People of the World to be Confidently Healthy, and to Joyfully Encourage individuals to follow their Life's Passion to the fullest, while they Encourage Others to do the same with Harmony, Peace, Joy, Respect and Cooperation between People, Animals and the Environment.

And this is what the HTML should look like.

word-htmlHere is the same sentence in the MS Word format. It's too long for me to insert it into this article and, in fact, it's too long for even a screenshot. But trust me, you don't want that mess in your website.

Clearly we have a problem. Not only the MS Word version adds an inordinate amount of text, increasing the file size, but very likely some of the rules in the code will break your existing page layout, often with disastrous results. In addition, apostrophes and other commonly used punctuation marks are different from the HTML standard and they are often substituted with gibberish.

Take for instance the apostrophe. How many times have you seen this:
"I don?t want to".
That's MS Word at work. Or this:
"The word is §bumblebee§"?
It can get very frustrating and with most WYSIWYG web tools there is no way to fix those issues and the inexperienced operator risks to make things even worse by applying even more styles.

Over the years however, I had to find ways to quickly and effectively process MS Word files for inclusion into web sites. As a Joomla Developer I have a built in Tool to help me out. That's the JCE Text editor which, if setup properly, has some very good built in tools to clean the Word HTML output. But, if the document is complicated enough, even JCE will not clean every single offending code.

I found several solutions to the issue and they all involve some set up, but the results are very good and I know from experience anyone can include them into their workflow.

First though, here are some guidelines for writing Word documents that will be easy to translate into HTML. It is important to follow these guidelines not only to get the best results, but to limit the amount of time spent on processing the text:

Get the updated Checklist

  1. Use as plain a formatting as possible.
  2. Do not use underline except for Hyperlinks, use Bold instead for emphasis.
  3. Limit the use of Italics to short sentences. Italics are harder to read on screen.
  4. Do not use font colors, custom fonts, centered titles, indented text or any other formatting of the sort. Let your CSS handle that in your website.
  5. Above all: keep it simple and let your website handle all the formatting.
Remember that your website's CSS file should be the one containing all your important formatting: your titles color and position, blockquotes, etc. A well formatted HTML page should have no "inline" formatting. One of the reasons is consistency. Let's say that you wanted a blue title on your site using a specific font. Like this one:

Title

Will you or your employees remember from day to day, week to week the exact color, font size and positioning of your titles? They won't. Instead, use an H1 or H2 tag for your titles and let the CSS file define it. It will stay consistent across your website. Now, let's look at the two methods I found most expedient for dealing with MS Word cleaning.

switch the positions on