Written by Marco Conti Wednesday, 19 November 2008 14:29
| Article Index |
|---|
| Using Microsoft Word to write website content |
| MS Word Text Based clean up |
| web based cleaning |
| word and Joomla |
| All Pages |
Many people are comfortable writing their website content with a Word Processor. For most that means using Microsoft Word.
Unfortunately, Microsoft Word is not very compliant with most modern browsers and it adds a plethora of superfluous code to your HTML that is very hard to clean properly. The result is that if you copy and paste from MS Word to your HTML document, regardless of the system you use, the result will be less than satisfactory and it often will even break your page.
The reason is that MS Word uses a number of proprietary tags and even saving your document a HTML page won't cure the problem.
(Pressed for time? Don't want to read the entire post? Skip the article and use the Checklist)
Here is why. Let's look at a simple paragraph written in MS word:
Sample paragraph:I use my Energy and Honesty to Teach and Inspire the People of the World to be Confidently Healthy, and to Joyfully Encourage individuals to follow their Life's Passion to the fullest, while they Encourage Others to do the same with Harmony, Peace, Joy, Respect and Cooperation between People, Animals and the Environment.
And this is what the HTML should look like.
Here is the same sentence in the MS Word format. It's too long for me to insert it into this article and, in fact, it's too long for even a screenshot. But trust me, you don't want that mess in your website.
Clearly we have a problem. Not only the MS Word version adds an inordinate amount of text, increasing the file size, but very likely some of the rules in the code will break your existing page layout, often with disastrous results. In addition, apostrophes and other commonly used punctuation marks are different from the HTML standard and they are often substituted with gibberish.
Take for instance the apostrophe. How many times have you seen this:
"I don?t want to".
That's MS Word at work. Or this:
"The word is §bumblebee§"?
It can get very frustrating and with most WYSIWYG web tools there is no way to fix those issues and the inexperienced operator risks to make things even worse by applying even more styles.
Over the years however, I had to find ways to quickly and effectively process MS Word files for inclusion into web sites. As a Joomla Developer I have a built in Tool to help me out. That's the JCE Text editor which, if setup properly, has some very good built in tools to clean the Word HTML output. But, if the document is complicated enough, even JCE will not clean every single offending code.
I found several solutions to the issue and they all involve some set up, but the results are very good and I know from experience anyone can include them into their workflow.
First though, here are some guidelines for writing Word documents that will be easy to translate into HTML. It is important to follow these guidelines not only to get the best results, but to limit the amount of time spent on processing the text:
Get the updated Checklist
As you can see, all the formatting, bold, Italic, etc. is gone, but that's usually fairly easy to reproduce and often is preferable.
I have tried using the Dreamweaver built in "Clean Word HTML" and while it works to an extent, it still leaves about half the unnecessary code untouched. It's just not good enough.
Another excellent method for larger or more complicated documents is to use an online Word Cleaner. The one I prefer is called "Textism" and it's free to use for documents smaller than 20K. For larger documents there is an annual fee costing about $28 (20 euros). Believe me it's worth it.
Textism does an excellent job at cleaning Word HTML, leaving only P, BR and the occasional CENTER tag (which is why it's not a good idea to center text in Word. Once again, leave it to the CSS).
Textism is also very easy to use. Just save your MS Word document as HTML and upload it to Textism web site. It will return the cleaned code and the old code for comparison. It's quite staggering how much junk Word leaves behind and it's not a wonder that it breaks websites so often.
There are no big differences between Joomla and a regular HTML editor, but given that most people will use Joomla to publish articles, press releases and other traditionally formatted documents, it's worth it to take a look at how to handle a regular MS Word document with either Textism or the plain text system.
A normal document will often have this formatting:
Title
Date (1,1 2009)
Byline (written by first and last name)
Body Text
Footer (with possible attributions, collaborators, etc.
It's all too easy to get lazy and paste the entire article into Joomla without giving it a second thought. Depending on your Joomla content settings you may end up with a repetition of the title, date and byline because Joomla already is set to insert them in every article if the preference is turned on. It is important to remove them from the article and add them to the proper Joomla fields as illustrated in the screenshot at right.
I hope you found this article useful and here is a checklist for you or your employees to print and keep handy. It always pays to take 5 more minutes when setting up your website content rather than doing things in a hurry and ruin your website in the process. If you like to use Word for your word processing, with a little bit of time and training there is no reason why you should change your habits.
Good luck.
| < Prev | Next > |
|---|
Conticreative offers Individual and Corporate training (in person or online) on Joomla, Wordpress, Zen Cart and other leading Open Source scripts.