html tidy, word 2003 and "smart quotes"

Subject: html tidy, word 2003 and "smart quotes"
Posted by:  Ron (98spo…
Date: 13 Apr 2005

Hello, I'm having an aggravating time getting the "html" spewed by Word
2003 to display correctly in a webpage.

The situation here is that the people creating the documents only know
Word, and aren't very computer savvy.  I created a system where they
can save their Word documents as "html" and upload them to a certain
directory, and the web page dynamically runs them through tidylib using
the tidy extension to php4, thus causing the document to display
correctly.  I also run the files through a couple sed expressions to
remove xml tags that have no business being there.

It alllllmost works.  The resulting document follows the page's css
rules and displays correctly, except for those durned "smart quotes".

As you know, Word defaults to replacing straight quotes with fancy
quotes using an encoding that doesn't work on web pages.  When you
"save as html", the resulting code doesn't display correctly.  You can
turn off "smart quotes" (which I have suggested) but that only counts
towards *new* documents -- existing documents still have the problem.

Now when I use TidyUI on Windows XP, I can SEE the fancy quotes turn
into straight quotes.  But when I use tidy on the command line or
tidylib through the php extension, the substitution does *not* take
place.  (Freshly downloaded version of tidy in every case.)

On the Linux box I have "bare", "clean" and "word-2000" turned on.
(The code looks different if I turn any of them off, so I'm sure
they're getting turned on.)  What it seems to come down to is that
tidy, with the same options, cleans up *different* things on Linux than
it does on Windows.

What are my options at this point?  The users will continue to use Word
2003 -- no help there.  My web server is Apache on Linux -- that's not
going to change.  How do I get from here to there, dynamically, with no
user intervention?

Thanks very much for any and all suggestions.  If I can solve this,
I've made it that much less likely that we'll switch to IIS.

