python-docx correlating tables and paragraphs - python-docx

I've been successful in modifying a well structured source document using docx by examining styles and looking for text patterns. Thanks to all who have developed and contributed. One vexing problem I have is understanding where tables are at. One of my needs is to delete an entire section of the doc. Deleting the paragraphs gets the text and graphics but misses the tables. My current table deletion routine relies on a knowledge of where in the document the tables live. Is there a way to correlate the location of a table relative to the paragraphs (or some other element) around it?
Thanks,
-Chris

Related

Storing site data in columns or rows

This is a question of how to perform the best practice of storing data from a webpage. Like texts/image-urls/links etc.
I have an CMS were you can create web pages. Here you can edit texts/upload images. In the future it would also be nice to "add new elements", add links to a-tags etc.
I need to have a robust and flexible solution that also have good performance. In both getting/recieving this data.
Lets consider I have 1000 pages with each around 25 elements on each page that can be updated and stored in the database.
Alternative 1)
Create a table and 1 column for each element on these pages for example columns like:
title_1, title_2,image_1,image_2.
Here we have a set of columns that we can update, these we can use on the web page.
Alternative 2)
Create 1 table with the columns (id, namespace, page_id, data)
And for each element on the page I add the namespace in association with the page_id to make the data output unique. In the data I can add any kind of information; text, links etc.
What do you suggest as a good solution for this issue? I'm ofcourse also open for other alternatives.
Thanks!
I would recommend option two, with the addition of a column identifying the element id/or type, if indeed the element id is somehow comparable. That is to say, if anchor text (say) is always stored as element id = 4, then you might want an element id = 4 so that you could compare anchor texts across multiple documents.
If, on the other hand (and this is the scenario I imagine is more likely), you may have 1-25 elements on a page and each of them could be different (eg document one has three anchor texts and four images, document two has one anchor text and no images, etc) it would make sense to add an element_type_id table that stores a bit of information about the element types. This is assuming that you ever have any interest in comparing (say) images across multiple documents, or anchor texts across multiple documents, etc.
Another thing to consider: if you are likely to see the same element over and over again, it actually makes more sense to effectively parameterize those elements by way of a lookup table. So basically store each (say) unique anchor text in one table and reference its id in your actual data table.
If I may add one additional thing: SO may not be the best place for the particular question you are asking. I'm not totally sure of that and maybe I'm wrong... but I would poke around the Stack Exchange network and see if other forums more closely deal with the type of question you asking. In the very least, I'd observe that your question is fairly vague and the goal of achieving a "robust and flexible solution that also {has} good performance. In both getting/recieving this data." is not likely to be accomplished simply by asking for advice on SO. There is a LOT that goes into data architecture, and certainly many of the details I would consider important in designing this myself are not present in your questions. And if you're not sure what those details are, I am not sure if SO is really the best place to set about learning them. I think https://softwareengineering.stackexchange.com/ may be a better fit for this question.
Just my opinion, and I could be wrong. Either way, I would consider learning a bit about database normal forms (http://www.bkent.net/Doc/simple5.htm or Google it) as well as do a little research on the types of design considerations that go into building a database (an old but still good SO article on that is here: What are the most important considerations when designing a database?)

Avoid HTML table cells being cut when printed

I have a HTML document with many tables which I want to be printed. The problem is that sometimes, the paper end is reached in the middle of a row, so half of it is printed in one page and the rest in the next page, even cutting a single line of text in two parts.
Is there any way to avoid this?
NOTE: I have already read this question, but I need a solution which not involves CSS, because is not working at the target computer, and I can't change that.
Even with CSS, the issue is difficult due to limited browser support to CSS pagination (as can be seen from the answers to the question you refer to).
Through years, this problem has existed, and I don't think anyone has souped up an HTML trick for the purpose. There have been some tricks for trying to prevent page breaks inside a paragraph or list by placing it in a one-cell table, but this has worked occasionally only, and besides, in your case you already have a table.
So I’m afraid there is no solution, apart from using elements that cause extra vertical spacing, like a pre element containing empty lines (to push the entire table to next page—this may of course make things much worse when the parameters of the situation, like page formatting and paper size, differ from your expectations) or splitting a table into two tables, possibly with extra space between them (even more problematic).
If the target computer doesn't support (enough of) CSS, then you can create a PDF document on the server. If you set the Content-Type correctly, the browser will download the document and start the PDF reader of the system.
If this isn't possible, then there is no solution.

Matching two tables in HTML

I have a table that is filled with data from a database. When you scroll down though the headings are lost.
In order to solve this problem I made a second table above the first tables that held the headings. The headings are taken from the database with the data.
The problem is that the content in the second table changes the sizes of the fields so the headings in the first table no longer line up properly.
Does anyone know how I can make the first table copy the second tables field sizes.
I am editing my predecessors code which is a .jsp file which I am told is a form of Java.
This is not possible in pure HTML: You would have to use JavaScript to match the cell sizes.
However, there are better approaches to your problem. See this question for an overview:
HTML table with fixed headers?
here is a CSS only approach that looks nice.
I decided to make each of the columns a fixed width. not the best solution but since it is an in house application It will do.

Programmatically detecting "most important content" on a page

What work, if any, has been done to automatically determine the most important data within an html document? As an example, think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?
Note: Ideally, the method would work with well-formed markup, and terrible markup. Whether somebody uses paragraph tags to make paragraphs, or a series of breaks.
Readability does a decent job of exactly this.
It's open source and posted on Google Code.
UPDATE: I see (via HN) that someone has used Readability to mangle RSS feeds into a more useful format, automagically.
think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?
I would probably try something like this:
open URL
read in all links to same website from that page
follow all links and build a DOM tree for each URL (HTML file)
this should help you come up with redundant contents (included templates and such)
compare DOM trees for all documents on same site (tree walking)
strip all redundant nodes (i.e. repeated, navigational markup, ads and such things)
try to identify similar nodes and strip if possible
find largest unique text blocks that are not to be found in other DOMs on that website (i.e. unique content)
add as candidate for further processing
This approach of doing it seems pretty promising because it would be fairly simple to do, but still have good potential to be adaptive, even to complex Web 2.0 pages that make excessive use of templates, because it would identify similiar HTML nodes in between all pages on the same website.
This could probably be further improved by simpling using a scoring system to keep track of DOM nodes that were previously identified to contain unique contents, so that these nodes are prioritized for other pages.
Sometimes there's a CSS Media section defined as 'Print.' It's intended use is for 'Click here to print this page' links. Usually people use it to strip a lot of the fluff and leave only the meat of the information.
http://www.w3.org/TR/CSS2/media.html
I would try to read this style, and then scrape whatever is left visible.
You can use support vector machines to do text classification. One idea is to break pages into different sections (say consider each structural element like a div is a document) and gather some properties of it and convert it to a vector. (As other people suggested this could be number of words, number of links, number of images more the better.)
First start with a large set of documents (100-1000) that you already choose which part is the main part. Then use this set to train your SVM.
And for each new document you just need to convert it to vector and pass it to SVM.
This vector model actually quite useful in text classification, and you do not need to use an SVM necessarily. You can use a simpler Bayesian model as well.
And if you are interested, you can find more details in Introduction to Information Retrieval. (Freely available online)
I think the most straightforward way would be to look for the largest block of text without markup. Then, once it's found, figure out the bounds of it and extract it. You'd probably want to exclude certain tags from "not markup" like links and images, depending on what you're targeting. If this will have an interface, maybe include a checkbox list of tags to exclude from the search.
You might also look for the lowest level in the DOM tree and figure out which of those elements is the largest, but that wouldn't work well on poorly written pages, as the dom tree is often broken on such pages. If you end up using this, I'd come up with some way to see if the browser has entered quirks mode before trying it.
You might also try using several of these checks, then coming up with a metric for deciding which is best. For example, still try to use my second option above, but give it's result a lower "rating" if the browser would enter quirks mode normally. Going with this would obviously impact performance.
I think a very effective algorithm for this might be, "Which DIV has the most text in it that contains few links?"
Seldom do ads have more than two or three sentences of text. Look at the right side of this page, for example.
The content area is almost always the area with the greatest width on the page.
I would probably start with Title and anything else in a Head tag, then filter down through heading tags in order (ie h1, h2, h3, etc.)... beyond that, I guess I would go in order, from top to bottom. Depending on how it's styled, it may be a safe bet to assume a page title would have an ID or a unique class.
I would look for sentences with punctuation. Menus, headers, footers etc. usually contains seperate words, but not sentences ending containing commas and ending in period or equivalent punctuation.
You could look for the first and last element containing sentences with punctuation, and take everything in between. Headers are a special case since they usually dont have punctuation either, but you can typically recognize them as Hn elements immediately before sentences.
While this is obviously not the answer, I would assume that the important content is located near the center of the styled page and usually consists of several blocks interrupted by headlines and such. The structure itself may be a give-away in the markup, too.
A diff between articles / posts / threads would be a good filter to find out what content distinguishes a particular page (obviously this would have to be augmented to filter out random crap like ads, "quote of the day"s or banners). The structure of the content may be very similar for multiple pages, so don't rely on structural differences too much.
Instapaper does a good job with this. You might want to check Marco Arment's blog for hints about how he did it.
Today most of the news/blogs websites are using a blogging platform.
So i would create a set of rules by which i would search for content.
By example two of the most popular blogging platforms are wordpress and Google Blogspot.
Wordpress posts are marked by:
<div class="entry">
...
</div>
Blogspot posts are marked by:
<div class="post-body">
...
</div>
If the search by css classes fails you could turn to the other solutions, identifying the biggest chunk of text and so on.
As Readability is not available anymore:
If you're only interested in the outcome, you use Readability's successor Mercury, a web service.
If you're interested in some code how this can be done and prefer JavaScript, then there is Mozilla's Readability.js, which is used for Firefox's Reader View.
If you prefer Java, you can take a look at Crux, which does also pretty good job.
Or if Kotlin is more your language, then you can take a look at Readability4J, a port of above's Readability.js.

If you were programming a calendar in HTML would you use Table tags or Div tags?

I converted my company's calendar to XSL and changed all the tables to divs. It worked pretty well, but I had a lot of 8 day week bugs to work out initially owing to precarious cross-browser spacing issues. But I was reading another post regarding when to use tables v. divs and the consensus seemed to be that you should only use divs for true divisions between parts of the webpage, and only use tables for tabular data.
I'm not sure I could even have used tables with XSL but I wanted to follow up that discussion of Divs and Tables with a discussion of the ideal way to make a web calendars and maybe a union of the two.
A calendar is the perfect reason to use a table! Calendars inherently present tabular data and HTML tables are good at presenting tabular data. And HTML table markup provides nearly all the CSS hooks you need to associate CSS selectors with various parts of the table to dress it up.
I'm all for using DIVs for layout--but stick with tables for tabular data.
Here is a cool article on how to dress up tables with CSS:
http://www.smashingmagazine.com/2008/08/13/top-10-css-table-designs
I would say that a calendar is a table, therefore making the table the proper markup for its representation.
Edit: Definition 11 for "table" from answers.com says:
An orderly arrangement of data, especially one in which the data are arranged in columns and rows in an essentially rectangular form.
I think this is definitely a case for using tables. The biggest issue when using divs would be box height for each individual day. If you're styling each box with a border, they could look off if the content for one day is longer than another. The additional markup to make it look right would be more than it would take to create it with a table, so I don't think divs are worth the extra effort in this case.
It makes sense to use tables, but if you were to look at Google Calender, they seem to be using div tags. It is possible that using div tags lowers the file size, so in an enterprise environment it might be worth the 'trouble'.
Do it up in a table.
Also don't think of it as "divs vs. tables" Think of it as tables vs. a proper semantic tag with meaning. When I author pages I try to use divs as little as possible, in a lot of cases you could be using a paragraph, a list item, etc.
You might also consider an ordered list (weeks) of ordered lists (days), or simply one ordered list (days).
There are others who agree that the list approach is a good one.
Others prefer tables.
Just came across this thread after posing the same question elsewhere. While I completely agree a calendar is more of a tabular representation of data, I think there's truth in the prolific "it depends" answers. For example, I want to show a floating DIV popup when each day in the calendar is moused over. Using a table, the popup flickers as the cursor moves across the calendar since the popup is only active on the cell border and the day number in the cell itself. Using DIVs, the popup is solid (no flicker) the entire time the cursor mouses over the calendar cell.
Tables are for displaying tabular data. So I would say <table> is ideal.