Background: I'm writing a DocBook 5 document (and including in it some already-written text) with the intention of generating HTML from it. I would like to get the semantic markup correct from the beginning so I don't need to re-do it later, but the standard way does not seem to generate what I'm looking for, so I'm not sure if I should deviate from it or not, depending on what is possible with XSL.
Current setup: My glossary only has abbreviated items. It consists of glossentrys each containing a full-spelling glossterm and some non-zero amount of acronyms and/or abbrevs. I suppose it could all just be glossterms instead. It doesn't matter to me. Suppose for example I have this:
<glossentry xml:id="ff"><glossterm>Firefox</glossterm>
<acronym>FF</acronym>
<glossdef>
<para>The web browser made by Mozilla.</para>
</glossdef>
</glossentry>
Ideal
Suppose wherever I want to refer to Firefox, I put FF in the text. Ideally, without any additional markup, wherever "FF" (case sensitive) appeared as a plain whole word in a paragraph (or title, but not, for example, code or programlisting or inside a URL inside an attribute...) in my DocBook file, it would come out in HTML as the text "FF" but marked up as a link to the glossary entry, but not with the standard link CSS, and furthermore with a title attribute having the value Firefox. That way a reader can hover to get the acronym/abbreviation spelled out for them, and if that is insufficient, they can click to be taken to a fuller definition. Meanwhile I would style it black and underlined, so that they know this feature is there, but it doesn't distract one's attention like a normal link does, especially with how often it occurs in the text.
Main question: is such replacement of plaintext, markup-less terms even possible in XSL (without creating something like the Scunthorpe problem)? If so, can it do this for every acronym or abbrev found in the glossary, automatically?
I could not figure out how to do this directly, but that is still my goal. Meanwhile I've tried other things:
Approach 1
Set up a keyboard macro so I can type ff and have that be transformed while I'm typing into <xref linkend="ff"/>.
Pros:
links to the glossary
spells out the abbreviation
Cons:
spells out the abbrevation (it would be nice to keep it short to read, not just short to type)...workaround: make the acronym into a glossterm and put it first in the glossentry (loss of semantics, but maybe that's OK here?)
links to the glossary (I would like it styled differently)...solution: CSS for a.xref
even with the above two worked around, the title comes out as FFFirefox instead of just Firefox (and with others that have more than one synonym, the mashing-together continues)...solution: put an alternate xml:id on your preferred acronym/abbrev, and then make the links in it refer to that id in their endterm attribute (as well as the linkend referring to the first id)
I have to remember to use the keyboard macro rather than just typing and letting the system do the work; any imported text then has to have text replacements done on it for each glossary entry
Approach 2
Using <xsl:param name="glossterm.auto.link" select="1"/> and a keyboard macro to change FF into <glossterm>FF</glossterm>.
Pros:
links to the glossary
Cons:
spells out the abbrevation (it would be nice to keep it short to read, not just short to type)...workaround: make the acronym into a glossterm and put it first in the glossentry (loss of semantics, but maybe that's OK here?)
links to the glossary (I would like it styled differently)...solution: CSS for .glossterm
even with the above two worked around, no title attribute is given
I have to remember to use the keyboard macro rather than just typing and letting the system do the work; any imported text then has to have text replacements done on it for each glossary entry
Approach 1 so far seems better after the workarounds, but is there a way to achieve the ideal I outlined?
Related
I'm creating a very simple form that has a text area. The text area takes in a formatted block of names separated by newlines. To make the application slightly more useable, it would be nice if I could include a placeholder example that had multiple lines of text. Unfortunately, that doesn't seem to be possible with the HTML5 specification. Does anybody know why?
<placeholder> is like <blockquote> to me. It has a specific niche.
In the case of the <placeholder> attribute, it's mainly used in one-line form fields; not text areas.
How often do you use a carriage return in a one-line form field? Never.
The <placeholder> attribute represents a short hint (a word or short phrase) intended to aid the user with data entry. A hint could be a sample value or a brief description of the expected format. The attribute, if specified, must have a value that contains no U+000A LINE FEED (LF) or U+000D CARRIAGE RETURN (CR) characters.
Since HTML5 is still fresh, new, and continues to be optimized and tweaked in various browsers; who knows what crazy things would happen cross browser-wise if the <placeholder> attribute didn't have such strict guidelines set up?
The web seems to be moving in the direction to help designers/developers type less code, and make less mistakes.
I've seen a few posts (by people like Paul Irish and Jeffrey Way) talking about omitting things like closing tags, and many standard elements have been modified in HTML5 to be shorter/easier (e.g.<!doctype html>). Also, what used to be traditional attributes required to make a webpage function well can now be easily thrown out all together. The web is getting simpler, and more complex at the same time.
All in all though, if you're wanting something to fix the dilemma (that you are seemingly suffering from by the tone of your question), then just use the <title> attribute instead. Refer to the selected answer in the question located at the following link:
Can you have multiline HTML5 placeholder text in a <textarea>?
Having the HTML of a webpage, what would be the easiest strategy to get the text that's visible on the correspondent page? I have thought of getting everything that's between the <a>..</a> and <p>...</p> but that is not working that well.
Keep in mind as that this is for a school project, I am not allowed to use any kind of external library (the idea is to have to do the parsing myself). Also, this will be implemented as the HTML of the page is downloaded, that is, I can't assume I already have the whole HTML page downloaded. It has to be showing up the extracted visible words as the HTML is being downloaded.
Also, it doesn't have to work for ALL the cases, just to be satisfatory most of the times.
I am not allowed to use any kind of external library
This is a poor requirement for a ‘software architecture’ course. Parsing HTML is extremely difficult to do correctly—certainly way outside the bounds of a course exercise. Any naïve approach you come up involving regex hacks is going to fall over badly on common web pages.
The software-architecturally correct thing to do here is use an external library that has already solved the problem of parsing HTML (such as, for .NET, the HTML Agility Pack), and then iterate over the document objects it generates looking for text nodes that aren't in ‘invisible’ elements like <script>.
If the task of grabbing data from web pages is of your own choosing, to demonstrate some other principle, then I would advise picking a different challenge, one you can usefully solve. For example, just changing the input from HTML to XML might allow you to use the built-in XML parser.
Literally all the text that is visible sounds like a big ask for a school project, as it would depend not only on the HTML itself, but also any in-page or external styling. One solution would be to simply strip the HTML tags from the input, though that wouldn't strictly meet your requirements as you have stated them.
Assuming that near enough is good enough, you could make a first pass to strip out the content of entire elements which you know won't be visible (such as script, style), and a second pass to remove the remaining tags themselves.
i'd consider writing regex to remove all html tags and you should be left with your desired text. This can be done in Javascript and doesn't require anything special.
I know this is not exactly what you asked for, but it can be done using Regular Expressions:
//javascript code
//should (could) work in C# (needs escaping for quotes) :
h = h.replace(/<(?:"[^"]*"|'[^']*'|[^'">])*>/g,'');
This RegExp will remove HTML tags, notice however that you first need to remove script,link,style,... tags.
If you decide to go this way, I can help you with the regular expressions needed.
HTML 5 includes a detailed description of how to build a parser. It is probably more complicated then you are looking for, but it is the recommended way.
You'll need to parse every DOM element for text, and then detect whether that DOM element is visible (el.style.display == 'block' or 'inline'), and then you'll need to detect whether that element is positioned in such a manner that it isn't outside of the viewable area of the page. Then you'll need to detect the z-index of each element and the background of each element in order to detect if any overlapping is hiding some text.
Basically, this is impossible to do within a month's time.
I have seen it a lot in css talk. What does semantically correct mean?
Labeling correctly
It means that you're calling something what it actually is. The classic example is that if something is a table, it should contain rows and columns of data. To use that for layout is semantically incorrect - you're saying "this is a table" when it's not.
Another example: a list (<ul> or <ol>) should generally be used to group similar items (<li>). You could use a div for the group and a <span> for each item, and style each span to be on a separate line with a bullet point, and it might look the way you want. But "this is a list" conveys more information.
Fits the ideal behind HTML
HTML stands for "HyperText Markup Language"; its purpose is to mark up, or label, your content. The more accurately you mark it up, the better. New elements are being introduced in HTML5 to more accurately label common web page parts, such as headers and footers.
Makes it more useful
All of this semantic labeling helps machines parse your content, which helps users. For instance:
Knowing what your elements are lets browsers use sensible defaults for how they should look and behave. This means you have less customization work to do and are more likely to get consistent results in different browsers.
Browsers can correctly apply your CSS (Cascading Style Sheets), describing how each type of content should look. You can offer alternative styles, or users can use their own; as long as you've labeled your elements semantically, rules like "I want headlines to be huge" will be usable.
Screen readers for the blind can help them fill out a form more easily if the logical sections are broken into fieldsets with one legend for each one. A blind user can hear the legend text and decide, "oh, I can skip this section," just as a sighted user might do by reading it.
Mobile phones can switch to a numeric keyboard when they see a form input of type="tel" (for telephone numbers).
Semantics basically means "The study of meaning".
Usually when people are talking about code being semantically correct, they're referring to the code that accurately describes something.
In (x)HTML, there are certain tags that give meaning to the content they contain. For example:
An H1 tag describes the data it contains as a level-1 heading. An H2 tag describes the data it contains as a level-2 heading. The implied meaning behind this is that each H2 under an H1 is in some way related (i.e. heading and subheading).
When you code in a semantic way, you basically give meaning to the data you're describing.
Consider the following 2 samples of semantic VS non-semantic:
<h1>Heading</h1>
<h2>Subheading</h2>
VS a non-semantic equivalent:
<p><strong>Heading</strong></p>
<p><em>Subheading</em></p>
Sometimes you might hear people in a debate saying "You're just talking semantics now" and this usually refers to the act of saying the same meaning as the other person but using different words.
"Semantically correct usage of elements means that you use them for what they are meant to be used for. It means that you use tables for tabular data but not for layout, it means that you use lists for listing things, strong and em for giving text an emphasis, and the like."
From: http://www.codingforums.com/archive/index.php/t-53165.html
HTML elements have meaning. "Semantically correct" means that your elements mean what they are supposed to.
For instance, you definition lists are represented by <dl> lists in code, your abbreviations are <abbr>s etc.
It means that HTML elements are used in the right context (not like tables are used for design purposes), CSS classes are named in a human-understandable way and the document itself has a structure that can be processed by non-browser clients like screen-readers, automatic parsers trying to extract the information and its structure from the document etc.
For example, you use lists to build up menus. This way a screen reader for disabled people will know these list items are parts of the same menu level, so it will read them in sequence for a person to make choice.
I've never heard it in a purely CSS context, but when talking about CSS and HTML, it means using the proper tags (for example, avoiding the use of the table tag for non-tabular data), providing proper values for the class and id that identify what the contained data is (and using microformats as appropriate), and so on.
It's all about making sure that your data can be understood by humans (everything is displayed properly) and computers (everything is properly identified and marked up).
What work, if any, has been done to automatically determine the most important data within an html document? As an example, think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?
Note: Ideally, the method would work with well-formed markup, and terrible markup. Whether somebody uses paragraph tags to make paragraphs, or a series of breaks.
Readability does a decent job of exactly this.
It's open source and posted on Google Code.
UPDATE: I see (via HN) that someone has used Readability to mangle RSS feeds into a more useful format, automagically.
think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?
I would probably try something like this:
open URL
read in all links to same website from that page
follow all links and build a DOM tree for each URL (HTML file)
this should help you come up with redundant contents (included templates and such)
compare DOM trees for all documents on same site (tree walking)
strip all redundant nodes (i.e. repeated, navigational markup, ads and such things)
try to identify similar nodes and strip if possible
find largest unique text blocks that are not to be found in other DOMs on that website (i.e. unique content)
add as candidate for further processing
This approach of doing it seems pretty promising because it would be fairly simple to do, but still have good potential to be adaptive, even to complex Web 2.0 pages that make excessive use of templates, because it would identify similiar HTML nodes in between all pages on the same website.
This could probably be further improved by simpling using a scoring system to keep track of DOM nodes that were previously identified to contain unique contents, so that these nodes are prioritized for other pages.
Sometimes there's a CSS Media section defined as 'Print.' It's intended use is for 'Click here to print this page' links. Usually people use it to strip a lot of the fluff and leave only the meat of the information.
http://www.w3.org/TR/CSS2/media.html
I would try to read this style, and then scrape whatever is left visible.
You can use support vector machines to do text classification. One idea is to break pages into different sections (say consider each structural element like a div is a document) and gather some properties of it and convert it to a vector. (As other people suggested this could be number of words, number of links, number of images more the better.)
First start with a large set of documents (100-1000) that you already choose which part is the main part. Then use this set to train your SVM.
And for each new document you just need to convert it to vector and pass it to SVM.
This vector model actually quite useful in text classification, and you do not need to use an SVM necessarily. You can use a simpler Bayesian model as well.
And if you are interested, you can find more details in Introduction to Information Retrieval. (Freely available online)
I think the most straightforward way would be to look for the largest block of text without markup. Then, once it's found, figure out the bounds of it and extract it. You'd probably want to exclude certain tags from "not markup" like links and images, depending on what you're targeting. If this will have an interface, maybe include a checkbox list of tags to exclude from the search.
You might also look for the lowest level in the DOM tree and figure out which of those elements is the largest, but that wouldn't work well on poorly written pages, as the dom tree is often broken on such pages. If you end up using this, I'd come up with some way to see if the browser has entered quirks mode before trying it.
You might also try using several of these checks, then coming up with a metric for deciding which is best. For example, still try to use my second option above, but give it's result a lower "rating" if the browser would enter quirks mode normally. Going with this would obviously impact performance.
I think a very effective algorithm for this might be, "Which DIV has the most text in it that contains few links?"
Seldom do ads have more than two or three sentences of text. Look at the right side of this page, for example.
The content area is almost always the area with the greatest width on the page.
I would probably start with Title and anything else in a Head tag, then filter down through heading tags in order (ie h1, h2, h3, etc.)... beyond that, I guess I would go in order, from top to bottom. Depending on how it's styled, it may be a safe bet to assume a page title would have an ID or a unique class.
I would look for sentences with punctuation. Menus, headers, footers etc. usually contains seperate words, but not sentences ending containing commas and ending in period or equivalent punctuation.
You could look for the first and last element containing sentences with punctuation, and take everything in between. Headers are a special case since they usually dont have punctuation either, but you can typically recognize them as Hn elements immediately before sentences.
While this is obviously not the answer, I would assume that the important content is located near the center of the styled page and usually consists of several blocks interrupted by headlines and such. The structure itself may be a give-away in the markup, too.
A diff between articles / posts / threads would be a good filter to find out what content distinguishes a particular page (obviously this would have to be augmented to filter out random crap like ads, "quote of the day"s or banners). The structure of the content may be very similar for multiple pages, so don't rely on structural differences too much.
Instapaper does a good job with this. You might want to check Marco Arment's blog for hints about how he did it.
Today most of the news/blogs websites are using a blogging platform.
So i would create a set of rules by which i would search for content.
By example two of the most popular blogging platforms are wordpress and Google Blogspot.
Wordpress posts are marked by:
<div class="entry">
...
</div>
Blogspot posts are marked by:
<div class="post-body">
...
</div>
If the search by css classes fails you could turn to the other solutions, identifying the biggest chunk of text and so on.
As Readability is not available anymore:
If you're only interested in the outcome, you use Readability's successor Mercury, a web service.
If you're interested in some code how this can be done and prefer JavaScript, then there is Mozilla's Readability.js, which is used for Firefox's Reader View.
If you prefer Java, you can take a look at Crux, which does also pretty good job.
Or if Kotlin is more your language, then you can take a look at Readability4J, a port of above's Readability.js.
My problem is as follows:
I have a column: ProductName. Now, the text entered here is entered from tinyMCE so it has all kinds of tags. The user wants to be able to do a Find-And-Replace on all products, and it has to support coloring.
For example - let's say this is a portion of a ProductName:
other text.. <strong>text text <font color="#ff6600">colortext®</font></strong> ..other text
Now, the user wants to replace the :
<font color="#ff6600">colortext®</font>
The original name has the <strong> tag in it so it appears bold. So the users makes it bold - now the text he is searching for is:
<strong><font color="#ff6600">colortext®</font></strong>
Obviously I'm not going to find it. Plus there's the matter of spaces: in one place it has a space in another it doesn't.
Is there a way to overcome this?
Strip the HTML tags from the search text and do a plain text search first. Then, part by part (i.e., text node by text node), take the element path of the search text's parts, and compare these with their counterparts in the found text. If the paths for all parts match, you're done.
Edit: By path, I meant something similar to XPath, or the path notion of the TinyMCE editor. Example: plain text part of the search text is "colortext®". The path of this text node in the search text is <strong>/<font color="#ff6600">. Search for the same plain text in the text body (trivial), and take it's path, which is also <strong>/<font color="#ff6600">. (Compare this with the path of "other text..", which is /, and of "text text", which is <strong>.) The two paths are the same, so this is a real match. If you have a DOM tree representation, determining the path shouldn't be difficult.
You're asking for several related, but discrete, abilities:
Search and Replace content
Search and Replace formatting
Search and Replace similar (ie, ignore trivial differences in whitespace)
You should take this in steps - otherwise it becomes overwhelming and a single search algorithm won't be able to do all three without intense effort and resulting in difficult to maintain code.
First, look at the similar problem. Make a search that ignores spaces and case. You might want to get into Lucene or another search engine technology if you also need to deal with "bowl" vs "bowls" and "intelligent" vs "smart" - though I expect this is beyond your current needs.
Once you have that working, it becomes one layer in your stack of searches.
Second, look a formatting search. This is typically done using tokens or tags - which you already have in the form of HTML. However, you have to be able to deal with things out of sequence - so <b><i>text</i></b> needs to be caught in a search for <i><b>text</b></i> and the malformed representation where tags aren't nested properly, such as <b><i>text</b></i>.
One method of this is to pre-parse the string and apply the formatting styles to each character. So you'd have a t that's bold and italic, an e that's bold and italic, etc. to make this easier and faster use a hash to represent the style combination - Read the first character, figure out what style it is (keep track of this turning styles on and off and you find tags) and if it already exists in the hash, assign that hash number to the letter. If it doesn't, get the new hash number and assign that.
Now you can compare the letter and its style hash against your search and get format and content matches. Stack that on top of your similar match and you have what you need.
-Adam
If it's valid XML, an XSLT would be trivial for this kind of exercise.
Use an identity template, and then add an XPath to find the specific node you want:
<xsl:template match="//strong/font">
<xsl:copy>
<!-- Insert the replacement text here -->
</xsl:copy>
</xsl:template>
When working with XML, this would be a maintainable, extensible solution.
Not sure to understand everything you said but the use of regular expression seems a good way to overcome the problem you're talking about.