Written by Marco Conti Wednesday, 19 November 2008
| Article Index |
|---|
| Using Microsoft Word to write website content |
| MS Word Text Based clean up |
| web based cleaning |
| word and Joomla |
| All Pages |
Many people are comfortable writing their website content with a Word Processor. For most that means using Microsoft Word.
Unfortunately, Microsoft Word is not very compliant with most modern browsers and it adds a plethora of superfluous code to your HTML that is very hard to clean properly. The result is that if you copy and paste from MS Word to your HTML document, regardless of the system you use, the result will be less than satisfactory and it often will even break your page.
The reason is that MS Word uses a number of proprietary tags and even saving your document a HTML page won't cure the problem.
(Pressed for time? Don't want to read the entire post? Skip the article and use the Checklist)
Here is why. Let's look at a simple paragraph written in MS word:
Sample paragraph:I use my Energy and Honesty to Teach and Inspire the People of the World to be Confidently Healthy, and to Joyfully Encourage individuals to follow their Life's Passion to the fullest, while they Encourage Others to do the same with Harmony, Peace, Joy, Respect and Cooperation between People, Animals and the Environment.
And this is what the HTML should look like.
Here is the same sentence in the MS Word format. It's too long for me to insert it into this article and, in fact, it's too long for even a screenshot. But trust me, you don't want that mess in your website.
Clearly we have a problem. Not only the MS Word version adds an inordinate amount of text, increasing the file size, but very likely some of the rules in the code will break your existing page layout, often with disastrous results. In addition, apostrophes and other commonly used punctuation marks are different from the HTML standard and they are often substituted with gibberish.
Take for instance the apostrophe. How many times have you seen this:
"I don?t want to".
That's MS Word at work. Or this:
"The word is §bumblebee§"?
It can get very frustrating and with most WYSIWYG web tools there is no way to fix those issues and the inexperienced operator risks to make things even worse by applying even more styles.
Over the years however, I had to find ways to quickly and effectively process MS Word files for inclusion into web sites. As a Joomla Developer I have a built in Tool to help me out. That's the JCE Text editor which, if setup properly, has some very good built in tools to clean the Word HTML output. But, if the document is complicated enough, even JCE will not clean every single offending code.
I found several solutions to the issue and they all involve some set up, but the results are very good and I know from experience anyone can include them into their workflow.
First though, here are some guidelines for writing Word documents that will be easy to translate into HTML. It is important to follow these guidelines not only to get the best results, but to limit the amount of time spent on processing the text:
Get the updated Checklist