Home Conticreative Blogs Web Technologies Blog Using Microsoft Word to write website content

Blogs - Web Technologies Blog

Using Microsoft Word to write website content

Written by Marco Conti Wednesday, 19 November 2008 14:29

Share |
User Rating: / 0
PoorBest 
Article Index
Using Microsoft Word to write website content
MS Word Text Based clean up
web based cleaning
word and Joomla
All Pages

How to convert MS Word to HTML cheaply and effectively

Many people are comfortable writing their website content with a Word Processor. For most that means using Microsoft Word.

Unfortunately, Microsoft Word is not very compliant with most modern browsers and it adds a plethora of superfluous code to your HTML that is very hard to clean properly. The result is that if you copy and paste from MS Word to your HTML document, regardless of the system you use, the result will be less than satisfactory and it often will even break your page.

The reason is that MS Word uses a number of proprietary tags and even saving your document a HTML page won't cure the problem.

(Pressed for time? Don't want to read the entire post? Skip the article and use the Checklist)

Here is why. Let's look at a simple paragraph written in MS word:

Sample paragraph:

I use my Energy and Honesty to Teach and Inspire the People of the World to be Confidently Healthy, and to Joyfully Encourage individuals to follow their Life's Passion to the fullest, while they Encourage Others to do the same with Harmony, Peace, Joy, Respect and Cooperation between People, Animals and the Environment.

And this is what the HTML should look like.

word-htmlHere is the same sentence in the MS Word format. It's too long for me to insert it into this article and, in fact, it's too long for even a screenshot. But trust me, you don't want that mess in your website.

Clearly we have a problem. Not only the MS Word version adds an inordinate amount of text, increasing the file size, but very likely some of the rules in the code will break your existing page layout, often with disastrous results. In addition, apostrophes and other commonly used punctuation marks are different from the HTML standard and they are often substituted with gibberish.

Take for instance the apostrophe. How many times have you seen this:
"I don?t want to".
That's MS Word at work. Or this:
"The word is §bumblebee§"?
It can get very frustrating and with most WYSIWYG web tools there is no way to fix those issues and the inexperienced operator risks to make things even worse by applying even more styles.

Over the years however, I had to find ways to quickly and effectively process MS Word files for inclusion into web sites. As a Joomla Developer I have a built in Tool to help me out. That's the JCE Text editor which, if setup properly, has some very good built in tools to clean the Word HTML output. But, if the document is complicated enough, even JCE will not clean every single offending code.

I found several solutions to the issue and they all involve some set up, but the results are very good and I know from experience anyone can include them into their workflow.

First though, here are some guidelines for writing Word documents that will be easy to translate into HTML. It is important to follow these guidelines not only to get the best results, but to limit the amount of time spent on processing the text:

Get the updated Checklist

  1. Use as plain a formatting as possible.
  2. Do not use underline except for Hyperlinks, use Bold instead for emphasis.
  3. Limit the use of Italics to short sentences. Italics are harder to read on screen.
  4. Do not use font colors, custom fonts, centered titles, indented text or any other formatting of the sort. Let your CSS handle that in your website.
  5. Above all: keep it simple and let your website handle all the formatting.
Remember that your website's CSS file should be the one containing all your important formatting: your titles color and position, blockquotes, etc. A well formatted HTML page should have no "inline" formatting. One of the reasons is consistency. Let's say that you wanted a blue title on your site using a specific font. Like this one:

Title

Will you or your employees remember from day to day, week to week the exact color, font size and positioning of your titles? They won't. Instead, use an H1 or H2 tag for your titles and let the CSS file define it. It will stay consistent across your website. Now, let's look at the two methods I found most expedient for dealing with MS Word cleaning.
The simplest way to clean an MS Word file is to copy directly from the Word document, paste in a plain text editor and then paste into the HTML editor of choice. There is a rub: all formatting will be lost and most likely you'll need to reproduce the original document's formatting.
An alternative is to save the document in MS Word as Plain text. Both methods will retain the paragraph breaks if pasted into a WYSIWYG HTML editor, but not if pasted into the HTML view. You'll need to paste into the Visual edit view. Here is a screenshot of the results in Adobe Dreamweaver:

ms-word-dw-test

As you can see, all the formatting, bold, Italic, etc. is gone, but that's usually fairly easy to reproduce and often is preferable.

I have tried using the Dreamweaver built in "Clean Word HTML" and while it works to an extent, it still leaves about half the unnecessary code untouched. It's just not good enough.


Web based cleaning

Another excellent method for larger or more complicated documents is to use an online Word Cleaner. The one I prefer is called "Textism" and it's free to use for documents smaller than 20K. For larger documents there is an annual fee costing about $28 (20 euros). Believe me it's worth it.

Textism does an excellent job at cleaning Word HTML, leaving only P, BR and the occasional CENTER tag (which is why it's not a good idea to center text in Word. Once again, leave it to the CSS).

Textism is also very easy to use. Just save your MS Word document as HTML and upload it  to Textism web site. It will return the cleaned code and the old code for comparison. It's quite staggering how much junk Word leaves behind and it's not a wonder that it breaks websites so often.


MS Word and Joomla

There are no big differences between Joomla and a regular HTML editor, but given that most people will use Joomla to publish articles, press releases and other traditionally formatted documents, it's worth it to take a look at how to handle a regular MS Word document with either Textism or the plain text system.

A normal document will often have this formatting:

Title
Date (1,1 2009)
Byline (written by first and last name)
Body Text
Footer (with possible attributions, collaborators, etc.

ms-word-dw-testIt's all too easy to get lazy and paste the entire article into Joomla without giving it a second thought. Depending on your Joomla content settings you may end up with a repetition of the title, date and byline because Joomla already is set to insert them in every article if the preference is turned on. It is important to remove them from the article and add them to the proper Joomla fields as illustrated in the screenshot at right.

Conclusion

I hope you found this article useful and here is a checklist for you or your employees to print and keep handy. It always pays to take 5 more minutes when setting up your website content rather than doing things in a hurry and ruin your website in the process. If you like to use Word for your word processing, with a little bit of time and training there is no reason why you should change your habits.

Good luck.

Trackback(0)

TrackBack URI for this entry

Comments (0)

Subscribe to this comment's feed

Show/hide comments

Write comment

smaller | bigger

busy

10 Minute Joomla! Tips Blog

Conticreative joomla book reviews

Independent joomla hosting reviews

Joomla Training

Conticreative offers Individual and Corporate training (in person or online) on Joomla, Wordpress, Zen Cart and other leading Open Source scripts.

[Read More...]

Books

Books we suggest...

 

Spreadfirefox Affiliate Button
switch the positions on