I have hundreds of .doc files with text that I need put on web pages.
I realize I could convert every .doc file to .txt, then use a server side include to embed the contents of each page into a webpage. This would save a lot of time because I could simply have one .php?txt=... page which will display a different .txt include depending on the link the user pressed to get there. This works perfectly content-wise.
However, all formatting is lost when it is converted to .txt (titles should be in bold)
When I convert these .doc files to .html using Microsoft Word, the ~20 line documents become bloated >300 line .htm files (probably because each paragraph is put into textboxes)
Dreamweaver's "Clean up Word HTML" helped a bit but the code was still extremely bloated.
How would you suggest going about this?
edit: I may have solved my own question, trying to embed Google docs into my page.
There is a program suite called wv (former mswordview). It has a program wvWare. This software can transform Word documents to HTML.
Furthermore you can use the output from Word and send it through tidy. This corrects markup and usually can handle the mistakes made by Word.
You can try converting the Word documents to a DocBook intermediate format, then you can easily transform the DocBook with existing tools to (X)HTML.
MS Word is bloatware. Its own markup is bloated, and therefore any attempt to automatically convert it to HTML will inherit these problems. You end up with garbage like: <strong><strong></strong></strong> for no good reason.
Dreamweaver can clean it up a lot, but nothing short of strip/remarkup is going to get you clean results.
That's why most people use PDFs for this type of issue.
My immediate reaction would be to convert the docs to PDFs. That will normally preserve formatting quite well, and users typically have their browsers set up to view PDFs one way or another (and the few who don't are undoubtedly accustomed to being unable to view a lot of documents on a lot of sites).
Alright thanks everyone for your suggestions, but I wanted to make this page accessible to everyone without pdf viewers as well.
Google docs allows you to bulk upload your text files (and converts them for you too)
You can then export them into an iframe to embed in any html document.
Related
I am trying to find a solution for a website that I made so that the word documents will not download automatically anymore, but that they display directly in a new page in the browser when I click them.
Can someone help me?? How can I do that in HTML?
Thanks a lot in advance
There are two approaches you can take for this:
Ensure that all visitors to your site have a browser extension which can display Word documents. (This isn't something you can control for a typical website but might be an option for a company Intranet)
Convert the Word documents to a format that the browser can display (i.e. HTML). You could do this manually or with code (which could be client-side or server-side). The Word document formats are notoriously complex so you would need to find a third party library that could do this for you.
I'm building a website using Wordpress on Localhost. I'm learning the structure of the webpage by editing the HTML and CSS using Google Developer Tools. I want to know which file I'm editing and where on the hard drive it is located.
I have edited the height and width of an element inside the circle marked but when I try to save the file, it asks me for a location to save which I'm unaware of. One the left is the HTML code, how can I locate the file with that HTML code?
how can I locate the file with that HTML code?
You can’t – not really, not from within your browser, because your browser doesn’t see individual “files”, it only sees the complete HTML source code of the one resource it requested, that might have been composed of lots of different files, plus functions that generate HTML code dynamically – so that actual piece of HTML code might not even be written as such within a file.
You might be able to identify different sections of the HTML document though – and with a little knowledge of the template structure and output logic of WordPress, you should be able to find out what the relevant file to look in might most likely be.
Another thing I’d suggest, is that you get yourself an IDE that allows you to search across all files in the whole project folder – and than look for certain class names, IDs etc. on the HTML element in question or near/above it. If you search for those, you might get lucky as well. (Although a lot of times those classes/IDs might be output dynamically as well, so you won’t find them inside of a template file as such.)
Especially with little knowledge of WP template structures, it might take some trial and error to find the piece of code and file you are actually looking for.
The Google Developer Tools is not a code editor, so whilst you can try out different options I'm not aware that you can save it, and if you can, I wouldn't say it's a good idea.
Wordpress uses PHP, a language which HTML code is embedded with PHP code. For example the code <a href='<?php echo(link1);?>'>Home</a> has had the href attribute embedded with a PHP variable. If you want to find the HTML code, look at the PHP files in your Wordpress directory, index.php is the landing page code.
One thing to bear in mind is that not all the HTML code will be included in one PHP file, it is usually included from several files, and much of the content will be in the wp-content directory, keep an eye out for the PHP include or require commands.
Google developer tool is just to check, once you are done with the editing, You have to copy your css code- and paste in your css file.
To get the css file look at the below image.
Hope your question got clarified!!
We are having Multiple PDF which have account tables and balance sheet within it. We have tried many Converters but the result is not satisfactory. Can anybody please suggest any good converter that would replicated the contents of PDF to Exact structure in HTML. IF any paid Converter is there please suggest me .
This is the PDF we want to convert and Show in html "http://www.marico.com/html/investor/pdf/Quarterly_Updates/Consolidated%20Financial%20Results%20-%20Q3FY11.pdf"
Have you looked into this? http://pdftohtml.sourceforge.net/
It's open source as well, so it's free and can be modified if necessary.
There's even a demo showing the before PDF and the after HTML version. Not bad if you ask me.
If you're having issues specifically with tables in PDFs, perhaps the issue are the table themselves and whatever program is being used to generate them. Not all PDFs are created equal.
ALSO: Be aware that all PDFs that I've created and come across over the years have had lots of issues when it comes to copy/pasting blocks/lines of text that have other blocks/lines of text at equal or higher height on any given page. I think Acrobat lacks the ability to define a "sequence order" of what block is selected after what (or most programs don't use it properly), so the system sorta moves from a top-down, left-to-right way of selecting content.....even if that means jumping over large blank areas or grabbing lines from multiple columns at once when you wouldn't expect it. This may be part of your tabular data issue. Your weak link here is the PDF format itself and I think perhaps you may be expecting too much from it. Turning anything into a PDF is pretty much a one-way street, especially when you start putting lots of editable text into it.
Have you tried http://www.jpedal.org/html_index.php - there is also a free online version
I'm looking to export a page that looks good in print media, to word.
Can this be done automatically, or mostly automatically with office apis?
The alternative is to create a program that reads all our style meta data and font meta data and convert to word and force a download.
The issue is our style metadata is already built for css, its a web app after all. And writing my own css parser, doesn't sound like a good use of time.
I know this sounds too simple to be true, but I belive you can simply rename a ".html" file to ".doc" to force it to open in word, and let office's html rendering take care of the rest.
If it's for reporting purposes, and you think you might have use for more of the same in the future, you could look at something like reporting services as a way of creating a report that can be downloaded in various formats. I'm not 100% sure if the newest version allows the creation of .doc files, but you can purchase plugins to permit this.
I need to add a bunch of word documents to a wiki but want to clean up the resulting html so ideally I have text and image tags... Anyone up for a challenge? :o)
It's ok if the solution involves using a text editor and doing some "gymnastics" on it.
There are tools that perform much of this cleaning for you, like here or here and Dreamweaver includes such a tool as well.
I don't know what these tools do with images though... If you choose a more DIY route, this can help you I think.
I would copy the text out of Word and paste it into Notepad and then manually enter my images into the Wiki document.
Hi I've worked a little with Open XML.
You can just either cycle through the word document checking each paragraph and converting each element into literalcontrols. Or you can also use LINQ to filter specific nodesets. You could also just treat your word file as an XML nodeset and navigate with XPath, LINQ to XML, DOM.
Just try downloading the Open XML toolset with SDK and start looking inside of your documents.