Html to Word long document

Html to Word long document - html

I can create an extensive word document using html including a cover page, header & footer, page numbers etc.
But my problem is; when my document is too long (like 100 pages or more) and I open the doc with Word 2003:
the document can be loaded and I can see the cover page.
but when I try to scroll down a little bit to examine the report, Word starts a long lasting process ( I don't know what it is) and does not respond.
if the doc is about 60 pages, the process lasts about 5 min. And then I can navigate through the document.
I have tried the following:
Disabled Spelling and Grammar check
Disabled auto-save
Is there anyone with a similar experience? I am creating the document with html and a few vml tags embedded in the document. What can be the cause of this unresponsive behavior?

Word is not created for handling large documents. There are several places where behavior is not O(n log n) (with n length of the document). You need to at least disable page numbering.
If you really want to find out: create some test cases and find out:
start with a plain text only, nothing fancy at all, generate a 100 pages and see if the problem persists.
step by step add features until the problem surfaces (fastest is to halve the feature difference).
it is likely that there are more features than just one leading to these performance problems, so you have to be careful with the application of Newton's method
And when you know, tell us.

Related

PDFium to create page numbers of a book

I am converting HTML to PDFs with some page numbering requirements that are standard in book printing.
Page numbers should be hidden on some pages
Front matter, which means everything up to the actual text, should have Roman page numbers (i, ii)
Page numbers should restart from 1 at the beginning of actual text
Table of Contents should show page numbers
Hiding page numbers is very important. The three others are big nice-to-haves.
I am using html-pdf-node, which uses Puppeteer -> Chromium -> PDFium.
Since html-pdf-node only supports very simple page numbering, I am looking at possibilities of doing it in one of the deeper levels.
I tried to search the documentation of the four levels and I checked possibilities of removing or editing page numbers with other tools. With Chromium/PDFium the answer might still be in the docs even I did not see it. I also thought of forking our own version of PDFium if possible.

Why does Amazon have so much white space in its HTML?

If you view the source of the home page, there's somewhere around 14 newlines before the DOCTYPE. For logged-in pages in Seller Central, I'm seeing more than 400! In between components of the page, there will often be 20 newlines of padding, and sometimes much more.
To show I'm not making this up:
Amazon seems to be fanatical about speed, so I'm incredulous that they'd be doing this unintentionally, especially before the DOCTYPE. (Everything else could be a nicely formatted FOR loop that elects not to show some template each iteration, maybe?)
Could they be initiating a connection before the logic of the application is ready to start spitting out code, and "streams" some whitespace?

I recommend that you contact Amazon regarding your query, as only they can provide a valid answer to your question. Below is the link to their 'Contact Us' page:
https://www.amazon.com/gp/help/contact-us/general-questions.html?skip=true
Hope this helps.

My guess it that they are trying to fake a "first byte" short response time.
Many web speed test (ej: http://www.webpagetest.org) value positively a fast first byte is sent to the browser, probably search engines also value this metric, since this is considered an small "processing time".
Printing some white space while the content of the page is been computed gets something to the browser fast and can reduces "first byte" times from 0.5 to 0.0 seconds.
Or maybe it is just a bug :)

allowing users to add html formatted notes

We want to allow the users of our web application, to leave notes formatted with html.
On client side we are providing them with ckeditor [http://ckeditor.com/] which is a wisywig editor that generates html, that is then submitted to the server via a form
We then want to display the notes created by the users, with exactly the same formatting as they submitted them
My concerns are:
Putting attacks and bad intentions aside, how can I encapsulate the note when displayed on the site, so that
a. They don't inherit the design from the rest of the page
b. They don't influence the rest of the page, for example by opening and not closing a tag accidentally, or closing without opening.
Malicious code injection attacks
At the moment, the first is much more important, as it's an in house product for our clients, and is not open to the wide public. But security comments are very wellcome as well
Possible solutions that I consider are:
Ideally, I look for a way to encapsulate this pieces of user html, like : inside this area I show what you submitted (rendered, not source), you cannot influence and are not influenced by the code on other parts of the page
Specifically, we thought of displaying the notes inside iframes.
Other natural direction is dealing with parsing the inserted contents, and stripping out stuff.
Any inputs are welcome, and mainly:
How can I "encapsulate" the inserted contents, if I can?
Any comments on the iframe direction
Do I have to parse the contents anyway? What do I absolutely have to strip out?

How can I "encapsulate" the inserted contents, if I can?
The truth is unless you 'fix' their code (via some kind of check) you will get issues (think broken divs, etc). I don't see how you can encapsulate HTML FROM HTML. I would however only let them put in content like bold, italicize, center, etc;
Any comments on the iframe direction
Personally I wouldn't go that route, new can of worms for security and not a 'clean' way of doing this.
Do I have to parse the contents anyway? What do I absolutely have to strip out?
Yes don't be lazy, some devs always say "well I dont need it, its internal" and then it becomes an external thing, and at that point its so big that ONLY a full re-write will set it right, and it keeps chugging along until something is broken, then shit hits the fan and the big boss cries out why hasn't this been done. Long story short.
Yes you have to parse / validate / check all your input, wether internal or external. Anything other than that is just lazy.
In closing I would do it by using an editor like here on SO, which only allows some types of selective formatting. After all a broken <b> will not kill your whole layout, a <div> will...

Markdown formatting
You could use exactly the same type of intermediary solution that this site (StackOverflow) uses in it's user-generated-content (questions, answers, comments).
It's not the complete solution that could replace WYSIWYG solutions like the code editor, but it's just what a usual user-generated-content woudl require. It even allows you to include images.
For a complete guide:
https://www.markdownguide.org/cheat-sheet

Minimize html, doubts and questions

Minimizing html is the only section on Google's Page Speed where there is still room for improvement.
My site is all dynamic and the HTML is already Deflated so there is no reason to put any more pressure on the server (I don't want to minimize pages real time before sending).
What I could do was to minimize the template files. My templates files are a mix of PHP and HTML so I've come up with some code that I think is pretty safe but would like to be community revised.
// this will loop trough all template files
// php is cleaned first so that line-comments will not interfere with the regex
$original = file_get_contents($dir.'/'.$file);
$php_clean = php_strip_whitespace($dir.'/'.$file);
$minimized = preg_replace('/\s+/', ' ', $php_clean);
This will make my template files as a single very long file alternated with some places where DB content is inserted. Google's homepage source looks more or less like what I get so I wonder if they follow a similar approach.
Question 1: Do you antecipate potencial problems?
Question 2: Is there anyway better (more efficient to do this)?
And please remember that I'm not trying to validate HTML as the templates are not valid HTML (header and footer are includes, for example).
Edit: Do take into consideration that the template files will be minimized on deploy. As CSS and Javascript files are minimized and compressed using YUI Compressure and Closure, the template files would be minimized like-wise, on deploy. Not on client-request.
Thank you.

Google's own Closure Templates (Soy) strips whitespace at the end of the line by default, and the template designer explicitly inserts a space using {sp}. This probably isn't a good enough reason to switch away from PHP, but I just wanted to bring it to your attention.
In addition, realize that HTML 4 allows you to exclude some tags, as recommended by the Page Speed documentation on minifying HTML (http://code.google.com/p/page-speed/wiki/MinifyHtml). You can exclude </p>, </td>, </tr>, etc. For a complete list of elements for which you can omit the end tag, search for "- O" in the HTML 4 DTD (http://www.w3.org/TR/REC-html40/sgml/dtd.html). You can even omit the <html>, <head>, <body>, and <tbody> tags entirely, as both start and end tags are optional ("O O" in the DTD).
You can also omit the quotes around attributes (http://www.w3.org/TR/REC-html40/intro/sgmltut.html#h-3.2.2) such as id, class (with a single class name), and type that have simple content (i.e., matches /^[-A-Za-z0-9._:]+$/). For attributes that have a single possible value, you can exclude the value (e.g., say simply checked rather than checked=checked).
Some people may find these tips repulsive because we've been conditioned for so many years to prepare for the upcoming world of simple LALR parsers for XHTML. Thus, tools like Dave Raggett's HTML Tidy generate HTML with proper closing tags and quotes around attribute values. But let's face it, all the browsers already have parsers that understand HTML 4, any new browser will use the HTML 5 parser rather than XHTML, and we should get comfortable writing HTML that is optimized for size.
That being said, besides a couple large companies like Google and Facebook, my guess is that page size is a negligible component of latency, so if you're optimizing your own site it's probably because of your own obsessive tendencies rather than performance.

White space can be significant (e.g. in pre elements).
When I had a particularly large page (i.e. large enough that there was a benefit in minifying the HTML) I used HTML Tidy and cached the results.
tidy -c -n -omit -ashtml -utf8 --doctype strict \
--drop-proprietary-attributes yes --output-bom no \
--wrap 0

I think you'll end up running into issues with load time with this approach, as the get contents, strip whitespace, and preg replace calls are going to take a lot longer to do than whatever bandwidth the minified HTML is saving you.

I've been running tests on all my sites for a couple of weeks and I can say that this method is pretty consistent. It will only affect template content, so there is little risk of messing up with unknown <pre> or similar.
It is run before deploy so there is no impact on server - actually there should be a little speed up as the file becomes smaller.
Do remember that all content that comes from the database will not suffer any influence as, like said before, this runs before deploy and on template files only.
The method seams solid enough to pass it into production.
If anything goes wrong I'll post it here.

Programmatically detecting "most important content" on a page

What work, if any, has been done to automatically determine the most important data within an html document? As an example, think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?
Note: Ideally, the method would work with well-formed markup, and terrible markup. Whether somebody uses paragraph tags to make paragraphs, or a series of breaks.

Readability does a decent job of exactly this.
It's open source and posted on Google Code.
UPDATE: I see (via HN) that someone has used Readability to mangle RSS feeds into a more useful format, automagically.

think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?
I would probably try something like this:
open URL
read in all links to same website from that page
follow all links and build a DOM tree for each URL (HTML file)
this should help you come up with redundant contents (included templates and such)
compare DOM trees for all documents on same site (tree walking)
strip all redundant nodes (i.e. repeated, navigational markup, ads and such things)
try to identify similar nodes and strip if possible
find largest unique text blocks that are not to be found in other DOMs on that website (i.e. unique content)
add as candidate for further processing
This approach of doing it seems pretty promising because it would be fairly simple to do, but still have good potential to be adaptive, even to complex Web 2.0 pages that make excessive use of templates, because it would identify similiar HTML nodes in between all pages on the same website.
This could probably be further improved by simpling using a scoring system to keep track of DOM nodes that were previously identified to contain unique contents, so that these nodes are prioritized for other pages.

Sometimes there's a CSS Media section defined as 'Print.' It's intended use is for 'Click here to print this page' links. Usually people use it to strip a lot of the fluff and leave only the meat of the information.
http://www.w3.org/TR/CSS2/media.html
I would try to read this style, and then scrape whatever is left visible.

You can use support vector machines to do text classification. One idea is to break pages into different sections (say consider each structural element like a div is a document) and gather some properties of it and convert it to a vector. (As other people suggested this could be number of words, number of links, number of images more the better.)
First start with a large set of documents (100-1000) that you already choose which part is the main part. Then use this set to train your SVM.
And for each new document you just need to convert it to vector and pass it to SVM.
This vector model actually quite useful in text classification, and you do not need to use an SVM necessarily. You can use a simpler Bayesian model as well.
And if you are interested, you can find more details in Introduction to Information Retrieval. (Freely available online)

I think the most straightforward way would be to look for the largest block of text without markup. Then, once it's found, figure out the bounds of it and extract it. You'd probably want to exclude certain tags from "not markup" like links and images, depending on what you're targeting. If this will have an interface, maybe include a checkbox list of tags to exclude from the search.
You might also look for the lowest level in the DOM tree and figure out which of those elements is the largest, but that wouldn't work well on poorly written pages, as the dom tree is often broken on such pages. If you end up using this, I'd come up with some way to see if the browser has entered quirks mode before trying it.
You might also try using several of these checks, then coming up with a metric for deciding which is best. For example, still try to use my second option above, but give it's result a lower "rating" if the browser would enter quirks mode normally. Going with this would obviously impact performance.

I think a very effective algorithm for this might be, "Which DIV has the most text in it that contains few links?"
Seldom do ads have more than two or three sentences of text. Look at the right side of this page, for example.
The content area is almost always the area with the greatest width on the page.

I would probably start with Title and anything else in a Head tag, then filter down through heading tags in order (ie h1, h2, h3, etc.)... beyond that, I guess I would go in order, from top to bottom. Depending on how it's styled, it may be a safe bet to assume a page title would have an ID or a unique class.

I would look for sentences with punctuation. Menus, headers, footers etc. usually contains seperate words, but not sentences ending containing commas and ending in period or equivalent punctuation.
You could look for the first and last element containing sentences with punctuation, and take everything in between. Headers are a special case since they usually dont have punctuation either, but you can typically recognize them as Hn elements immediately before sentences.

While this is obviously not the answer, I would assume that the important content is located near the center of the styled page and usually consists of several blocks interrupted by headlines and such. The structure itself may be a give-away in the markup, too.
A diff between articles / posts / threads would be a good filter to find out what content distinguishes a particular page (obviously this would have to be augmented to filter out random crap like ads, "quote of the day"s or banners). The structure of the content may be very similar for multiple pages, so don't rely on structural differences too much.

Instapaper does a good job with this. You might want to check Marco Arment's blog for hints about how he did it.

Today most of the news/blogs websites are using a blogging platform.
So i would create a set of rules by which i would search for content.
By example two of the most popular blogging platforms are wordpress and Google Blogspot.
Wordpress posts are marked by:
<div class="entry">
...
</div>
Blogspot posts are marked by:
<div class="post-body">
...
</div>
If the search by css classes fails you could turn to the other solutions, identifying the biggest chunk of text and so on.

As Readability is not available anymore:
If you're only interested in the outcome, you use Readability's successor Mercury, a web service.
If you're interested in some code how this can be done and prefer JavaScript, then there is Mozilla's Readability.js, which is used for Firefox's Reader View.
If you prefer Java, you can take a look at Crux, which does also pretty good job.
Or if Kotlin is more your language, then you can take a look at Readability4J, a port of above's Readability.js.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008