I'm writing a stylesheet to process RSS / Atom feeds on Dreamwidth.org (a LiveJournal-based blogging site).
One of the feeds I've got insists on prefixing paragraphs with " Text ..." to create an indented paragraph effect. I prefer my paragraphs aligning flush with the left of the page. I don't have the ability to rewrite the content or write Javascript to deal with this, and prevailing on arbitrary bloggers to change their habits seems somewhat untenable ...
My understanding of CSS is that HTML entities and characters cannot themselves be used as selectors.
Update: the site I'm styling is here: http:// dredmorbius.dreamwidth.org/read/
The specific feed in question is James Howard Kunstler's piquantly named blog, and this entry in particular has the non-breaking-space paragraph indents.
For a comparative rendering on DreamWidth see here.
Your understanding is correct. HTML entities are regular characters, simply expressed in a different HTML syntax, so as far as CSS is concerned they are the same thing and cannot be targeted with a selector.
You're going to have to use text-indent in the paragraphs that you're styling and experiment with different negative values to get as close to a perfect reverse indent (or outdent) as possible, depending on the font that the text is being rendered in.
Related
This question already has answers here:
Why should I use 'li' instead of 'div'?
(15 answers)
Are new HTML5 elements like <section> and <article> pointless? [closed]
(8 answers)
Why use HTML5 tags? [duplicate]
(1 answer)
Closed 9 years ago.
Why use HTML5 semantic tags like headers, section, nav, and article instead of simply div with the preferred css to it?
I created a webpage and used those tags, but they do not make a difference from div. What is their main purpose?
Is it only for the appropriate names for the tags while using it or more than that?
Please explain. I have gone through many sites, but I could not find these basics.
The Oxford Dictionary states:
semantics: the branch of linguistics and logic concerned with meaning.
As their name says, these tags are meant to improve the meaning of your web page. Good semantics plays an important role the automated processing of documents. This automated processing happens more often than you realize - each website ranking from search engines is derived from automated processing of all the website out there.
If you visit a (well designed) web page, you as the human reader can immediately (visually) distinguish all the page elements and more importantly understand the content. In the top left you see the company logo, next to it is the site navigation, there is a search bar and some text about the company, a link to a product you can buy and a legal disclaimer at the bottom.
However, machines are dumb and cannot do this:
Looking at the same page as you, all the web crawler would see is an image, a list of anchors tags, a text node, an input field and an image with a link on it. At the bottom there is another text node.
Now, how should they know, what part of the document you intended to be the navigation or the main article, or some not-so-important footnote? They can guess by analyzing your document structure using some common criteria which are a hint for a specific element.
E.g. an ul list of internal links is most likely some kind of page navigation and the text at the end of the document is something necessary but not so important to the everyday viewer (the legal disclaimer).
Now imagine instead of a plain div, a nav element would be used – the machine immediately knows what the purpose of this element is:
// machine: okay, this structure looks like it might be a navigation element?
<div><ul><li><a href="internal_link">...</div>
// machine: ah, a navigation element!
<nav><ul><li><a>...</nav>
Now the text inside a main tag – this is clearly the most important information of the page! Over there to the left, that text node, the image and the anchor node all belong together, because they are grouped inside a section tag, and down at the bottom there is some text inside a footer element (they still don't know the meaning of that text, but now they can deduce it's some sort of fine print).
Example:
You, as the user (reading a page without seeing the actual markup), don't care if an element is enclosed in an <i> or <em> tag. In most browsers both of these tags will be rendered identically – as italic text – and as long as it stands out between the surrounding text it serves its purpose.
However, there is a big difference in terms of semantics:
<i> means italic - it's simply a presentational hint for the browser on how to render it (italic) and does not necessarily contain deeper semantic information.
<em> means emphasize - it indicates an important piece of information. Now the browser is not bound to the italic instruction any more, but could render it in italic or bold or underlined or in a different color... For visually impaired persons, the screen readers can raise the voice - whatever method seems most suited in a specific situation to emphasise this important information.
Final thought:
Semantic tags are not the end. There are things like metadata, ontologies, resource description languages which go a step further and help connect data between different web pages and can even help create new knowledge!
E.g. wikipedia is doing a really bad job at semantically presenting data.
https://en.wikipedia.org/wiki/Barack_Obama
https://en.wikipedia.org/wiki/Donald_Trump
https://en.wikipedia.org/wiki/Joe_Biden
All three are persons who at some point in time where president of the USA.
All three articles contain a sidebar that displays these information, and you can compare them (by opening both pages and then switching back and forth), but they are not semantically described.
Instead, if wikipedia used an ontology to describe a person: http://dbpedia.org/ontology/Person
<!-- President is a subclass of Politician which is a subclass of Person -->
<President>
<birthname>Barrack Hussein Obama II</birthname>
<birthdate>1961-08-04</birthdate>
<headOf>country::USA</headOf>
<tenure>2009-01-20 – 2017-01-20</tenure>
</President>
Not only could you (and machines) now directly compare those three directly (on a dynamically generated page!), but you could even create new knowledge, e.g. show a list of all presidents of the United States - quite boring but also cool stuff like who are all the current world leaders, how many female world leaders do we have, who is the youngest leader, how many types of leaders are there (presidents/emperors/queens/dictators), who served the longest, how many of them are taller than 175cm and have brown eyes, etc. etc.
In conclusion, good semantics is super cool (but also – on a technical level – hard to achieve and maintain).
There's a nice little article on HTML5 semantics on HTML5Doctor.
Semantics have been a part of HTML in some form or another. It helps you understand what's happening where on the page.
Earlier when <div> was used for pretty much everything, we still implemented semantics by giving it a "semantic" class name or an id name.
These tags help in proper structuring and understanding of the layout.
If you do,
<div class="nav"></div>
as opposed to,
<nav></nav>
OR
<div class="sidebar"></div>
as opposed to,
<aside></aside>
there's nothing wrong, but the latter helps in providing better readability for you as well as crawlers, readers, etc..
In the div tag you have to give an id which tells about what kind of content it is holding, either body, header, footer, etc.
While in case of semantic elements of HTML5, the name clearly defines what kind of code it is holding, and it is for which part of the website.
Semantic elements are <header>, <footer>, <section>, <aside>, etc.
I'm working on an online audio book whose content comes from a WORD file. Of course my content is Unicode (Persian font, similar to arabic). Texts copied from Word document and pasted into my application, show small differences in terms of character/word space and/or justify rules. I'm going make my HTML exactly the same as my WORD file (I've applied correct padding in my HTML in order to simulate WORD margins, also WORD document is in a DIN-A4 paper which its size is emulated in my DIV element). What are the best practices for having the best-achievable similarity between the two?
Rather than re-inventing the wheel, there are a large number of online Word >> HTML converters available - many of which are free! I would suggest leaning on these to at least get the baseline of your HTML and then make any further enhancements on top of this.
Personally I've used https://cloudconvert.com/doc-to-html before, they have an API also so you could automate the conversion.
Hope this helps!
In normal case, I can separate the text and the style, but how should I do it, when the text is dynamic (it is editable by the admin user)? The user of course wants to use bold, italic, etc, but if I put a common html-editor (I think) I broke the rule of the separation, because there will be html elements in the text. (I can use BB codes, but it is the same.)
In a long term I think it can cause problems when I want to use the text in any non-html environment. Of course I can strip the html tags, but it is not the way I would like to use (not because it won't work, but the original theoretical issues).
In some cases I can break apart the sentences to solve this problem, but I think it's a bad way, because the parts are pointless alone, and it won't be so easily editable too.
Is there any good solution for this?
That's perfectly ok.
You give the user the oppertuniny to set some attributes for the text (BBCodes recomended).
That is content. Then it's part of the design to interpret the attributes and style it.
For example you may provide the feature to let the user define something like [headline]MyHeadline[/headline]. This is pure content.
How to replace [headline] with HTML and how to style the resulting text is up to the design.
Edit: I recommend BBCodes to provide a closed set of features. That may be easier to deal with. You could just use them in another context and interpret them, instead of stripping out HTML.
If the tags entered are semantic, ie they are using an <i> tag for italic, rather than style="font-style:italic", then your design and content are still separate.
Separating design and content is about separating a site's presentation from the readable code, rather than removing the markup altogether.
I'd advise you focus on Semantic HTML.
I would like to convert HTML to plain text but retain the minimum structure.
All sections which contain stuff only the browser needs to see such as <script> and <style> to be stripped completely.
Convert all block tags to <div> and all inline ones to <span> or remove inlines completely without leaving whitespace and turning anything delineatd by block levels into paragraphs with two linebreaks.
The idea is to turn random web pages into something suitable for natural language text processing without artefacts left from naively removing markup artifically break words up or making unrelated blocks look like sentences.
Any binary, library, or source in any programming language is OK.
Is there a standard source preferably machine-readable with a full list of elements defining which are block, which inline, and which are like <script> and <style> above?
The list of HTML 4 Block-level elements is here: http://htmlhelp.com/reference/html40/block.html
The most popular HTML parsing libraries for Perl are HTML::Parser which is a SAX-style parser and HTML::TreeBuilder which is more DOM-like.
Beyond that, you'll have to decide which elements are important and which are not based on what you're trying do to.
You may want to do some research yourself. Then, when you run into a problem, ask a question related to the problem. This sounds more like specification for a project that you want someone to do for you.
For starters, websites use tags for all sorts of things, and the problem is very complex. You would probably want to save information in h# and p tags, but you also may want to save div tag information if they use the id tag. In short, you'd have to write rules for each website you encounter, or employ some sort of fuzzy logic.
Instead of doing it on a tag by tag basis, why not try detecting sentences and grammar, or things likely to be in headings, and choose tags that include those things while stripping out the rest?
Here's my own tool to solve this problem in Perl using HTML::Parser as a github gist: html2txt.pl
It's unfinished and perhaps slightly Windows-centric but I thought I'd share it since a few people have viewed my question here. Feel free to play with it.
What work, if any, has been done to automatically determine the most important data within an html document? As an example, think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?
Note: Ideally, the method would work with well-formed markup, and terrible markup. Whether somebody uses paragraph tags to make paragraphs, or a series of breaks.
Readability does a decent job of exactly this.
It's open source and posted on Google Code.
UPDATE: I see (via HN) that someone has used Readability to mangle RSS feeds into a more useful format, automagically.
think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?
I would probably try something like this:
open URL
read in all links to same website from that page
follow all links and build a DOM tree for each URL (HTML file)
this should help you come up with redundant contents (included templates and such)
compare DOM trees for all documents on same site (tree walking)
strip all redundant nodes (i.e. repeated, navigational markup, ads and such things)
try to identify similar nodes and strip if possible
find largest unique text blocks that are not to be found in other DOMs on that website (i.e. unique content)
add as candidate for further processing
This approach of doing it seems pretty promising because it would be fairly simple to do, but still have good potential to be adaptive, even to complex Web 2.0 pages that make excessive use of templates, because it would identify similiar HTML nodes in between all pages on the same website.
This could probably be further improved by simpling using a scoring system to keep track of DOM nodes that were previously identified to contain unique contents, so that these nodes are prioritized for other pages.
Sometimes there's a CSS Media section defined as 'Print.' It's intended use is for 'Click here to print this page' links. Usually people use it to strip a lot of the fluff and leave only the meat of the information.
http://www.w3.org/TR/CSS2/media.html
I would try to read this style, and then scrape whatever is left visible.
You can use support vector machines to do text classification. One idea is to break pages into different sections (say consider each structural element like a div is a document) and gather some properties of it and convert it to a vector. (As other people suggested this could be number of words, number of links, number of images more the better.)
First start with a large set of documents (100-1000) that you already choose which part is the main part. Then use this set to train your SVM.
And for each new document you just need to convert it to vector and pass it to SVM.
This vector model actually quite useful in text classification, and you do not need to use an SVM necessarily. You can use a simpler Bayesian model as well.
And if you are interested, you can find more details in Introduction to Information Retrieval. (Freely available online)
I think the most straightforward way would be to look for the largest block of text without markup. Then, once it's found, figure out the bounds of it and extract it. You'd probably want to exclude certain tags from "not markup" like links and images, depending on what you're targeting. If this will have an interface, maybe include a checkbox list of tags to exclude from the search.
You might also look for the lowest level in the DOM tree and figure out which of those elements is the largest, but that wouldn't work well on poorly written pages, as the dom tree is often broken on such pages. If you end up using this, I'd come up with some way to see if the browser has entered quirks mode before trying it.
You might also try using several of these checks, then coming up with a metric for deciding which is best. For example, still try to use my second option above, but give it's result a lower "rating" if the browser would enter quirks mode normally. Going with this would obviously impact performance.
I think a very effective algorithm for this might be, "Which DIV has the most text in it that contains few links?"
Seldom do ads have more than two or three sentences of text. Look at the right side of this page, for example.
The content area is almost always the area with the greatest width on the page.
I would probably start with Title and anything else in a Head tag, then filter down through heading tags in order (ie h1, h2, h3, etc.)... beyond that, I guess I would go in order, from top to bottom. Depending on how it's styled, it may be a safe bet to assume a page title would have an ID or a unique class.
I would look for sentences with punctuation. Menus, headers, footers etc. usually contains seperate words, but not sentences ending containing commas and ending in period or equivalent punctuation.
You could look for the first and last element containing sentences with punctuation, and take everything in between. Headers are a special case since they usually dont have punctuation either, but you can typically recognize them as Hn elements immediately before sentences.
While this is obviously not the answer, I would assume that the important content is located near the center of the styled page and usually consists of several blocks interrupted by headlines and such. The structure itself may be a give-away in the markup, too.
A diff between articles / posts / threads would be a good filter to find out what content distinguishes a particular page (obviously this would have to be augmented to filter out random crap like ads, "quote of the day"s or banners). The structure of the content may be very similar for multiple pages, so don't rely on structural differences too much.
Instapaper does a good job with this. You might want to check Marco Arment's blog for hints about how he did it.
Today most of the news/blogs websites are using a blogging platform.
So i would create a set of rules by which i would search for content.
By example two of the most popular blogging platforms are wordpress and Google Blogspot.
Wordpress posts are marked by:
<div class="entry">
...
</div>
Blogspot posts are marked by:
<div class="post-body">
...
</div>
If the search by css classes fails you could turn to the other solutions, identifying the biggest chunk of text and so on.
As Readability is not available anymore:
If you're only interested in the outcome, you use Readability's successor Mercury, a web service.
If you're interested in some code how this can be done and prefer JavaScript, then there is Mozilla's Readability.js, which is used for Firefox's Reader View.
If you prefer Java, you can take a look at Crux, which does also pretty good job.
Or if Kotlin is more your language, then you can take a look at Readability4J, a port of above's Readability.js.