xml (TEI P5) visualisation method - html

I've encoded a book with xml, according to TEI P5 guidelines, and I'm trying to visualize it in an html page. The real aim is to visualize the encoded text with all its formatting tags of TEI P5 guidelines (text formatting, inner references, etc.). So, the big question is how we do this right?
{For the record: I've tried (and still trying) to "throw" the whole xml text in the html body, and edit the visualization in a parallel css stylesheet, which I link to the html <style>. But, something doesn't seem quite right...}
Any thought?? Any example??
Here an xml sample:
<?xml version="1.0" ?>
<TEI version="5.0" xmlns="http://www.tei-c.org/ns/1.0">
<!--here exists the teiHeader-->
<text>
<front>
<div type="preface">
<head rend="align(center)">
<hi rend="bold">title</hi></head>
<p>Here is the prologue.Here is the prologue.</p>
<p>And second para of prologue.</p>
<p><ref target="#memories">Here is first reference to an xml:id, which have created to the teiHeader.</ref> Some more prologue. <ref target="#modernNovel">Here is the second reference to an xml:id, which have created to the teiHeader.</ref> and more prologue.</p>
<closer>
<salute rend="align(right)">Dear readers</salute>
<signed rend="align(right)">signature</signed>
</closer>
</div>
</front>
<body>
<ab type="title">
<title rend="align(center)" type="main">
TITLE
</title>
<lb/>
<title rend="align(center)" type="desc">
Secondary title
</title>
</ab>
<div n="1" type="chapter">
<head rend="align(center)">
<hi rend="bold">Chapter 1 <lb/>name of the chapter.</hi>
</head>
<p>Here is the real text, and text, and more text. Again the text, and text, and more text.Again the text, and text, and more text.An finally the name <placeName xml:id="London"> <hi rend="bold">London</hi></placeName>, appears here.</p>
<p>Here is the real text, and text, and more text. Again the text, and text, and more text.Again the text, and text, and more text.An the name <ref target="London"> <hi rend="bold">London</hi></ref>, repeat itself.</p>
</div>
</body>

Your use of the #rend attribute suggests that you're trying to use it to supply CSS styling properties. If that's so, you might find it easier to use the #style or #rendition attributes instead, since these are intended for that purpose. Remember however that all three of these attributes are intended to describe the rendition of the source. The TEI doesn't (as yet) provide ways of specifying how particular components of a document should be processed: this was considered out of scope for the original project. However, there is work being done in the area of defining a "processing model" for TEI elements, which should appear in the Guidelines soon. See for example the TEI Simple project.

Yes! I'd strongly recommend writing XSLT to transform your XML into HTML, as this will give you the most control over how your TEI elements are to be rendered in a web browser. It also makes it possible for you to generate multiple different options for structuring and styling your pages.
For example, if you want to extract a list of all the character names you have tagged in the TEI, and count the number of times they appear in each chapter, and do other kinds of analytical processing, XSLT is the tool you want. And you can use it to generate a full reading view of your text.

Related

What is the difference between html <var> and <p 'font-style: italic'></p>?

I tried to search the web about what is the purpose of the HTML <var> Tag and didn't find any good explanation or let say I'm not satisfied yet. I can read what they say about it but I don't understand the purpose. I tried two different lines of code and both gives me the same thing now I need to know what exactly is <var> and why we should use it rather than a single style.
<var>y</var> = <var>m</var><var>x</var> + <var>b</var>
<p style='font-style:italic'>y = mx + b</p>
Reference to name only one: https://html.com/tags/var/
Funny because I read the explanation but I still don't see what is the use of <var> other than just making the text italic!
Here is how W3Schools defines HTML:
HTML stands for Hyper Text Markup Language
HTML is the standard markup language for creating Web pages
HTML describes the structure of a Web page
HTML consists of a series of elements
HTML elements tell the browser how to display the content
HTML elements label pieces of content such as "this is a heading", "this is a paragraph", "this is a link", etc.
The way I see it is that, even though <var> and <i> have the same output printed to the browser, they mean different things, specially if you are "reading" pages without opening a browser like search engines do.
Check it is not particular to the example you mentioned. Look at the example on <b> and <strong> (https://www.w3schools.com/html/html_formatting.asp). They also have the same output but mean different things.
Semantics.
<p> tags are generic paragraph elements, typically used for text.
<var> elements represent the name of a variable in a mathematical expression or a programming context.
If you italicize a paragraph it may resemble the default styling of the <var> element, but that's where the similarities end. Also, they're different to screen readers.
Here's an example using both elements and you can see that semantically, it's a paragraph of text that contains references to variables in a mathematical sense:
<p>The volume of a box is <var>l</var> × <var>w</var> × <var>h</var>, where <var>l</var> represents the length, <var>w</var> the width and <var>h</var> the height of the box.</p>

Are there valid cases in HTML/XML where tags would not be fully contained?

I think in XML and HTML that having cross-scoped tags is not allowed. Maybe SGML allows it. In XML/HTML though, are there any valid and allowed cases where this can occur?
Something like:
<p>This is <i>some <b>example</i> text</b> right here!</p>
Which would likely generate output like: "This is some example text right here!"
(Sidenote: the SO markdown parser apparently can handle it, who knew?)
"This is *some **example* text** right here!"
It's not allowed in HTML or XML. For a survey of approaches to handling non-hierarchic markup, the Wikipedia article is a good place to start:
https://en.wikipedia.org/wiki/Overlapping_markup
I think in XML and HTML that having cross-scoped tags is not allowed.
Correct
Maybe SGML allows it.
It doesn't.
In XML/HTML though, are there any valid and allowed cases where this can occur?
No. The markup just describes a DOM which is a tree of nodes. A node can only have one parent.
"This is some example text right here!"
That is rendered as:
<p>"This is <em>some <strong>example</strong></em><strong> text</strong> right here!"</p>
Overlapping tags like that is only possible as long as it's only tags in a text. As soon as the text is parsed into elements (HTML or XML), it's not possible to represent such a structure.
The concept of an elements is that it is a single entity, it's not a starting and ending point in a text.
As your SO markdown example shows, it's possible to use tags like that as long as it's just tags in a text. As Quentin showed, the SO text parser has to translate that into a non-overlapping structure to be able to create valid HTML code for it.

HTML 5 lang attribute not working as expected

I'm trying to create a website that supports multiple languages with the help of the HTML lang attribute. I've found this example here:
<!DOCTYPE html>
<html>
<body>
<p>This is a paragraph.</p>
<p lang="fr">Ceci est un paragraphe.</p>
</body>
</html>
I've defined german as language in my OS and tried this with different browsers, but I always see the french paragraph as well. That's what I see:
This is a paragraph.
Ceci est un paragraphe.
The lang attribute specifies the language of the element’s content. Everything else is up to the user-agent resp. the webmaster.
Example uses:
a screen reader may use it to use the appropriate pronounciation
a browser may use it to use syllabification
a search engine may use it to find relevant content
a webmaster may use it to style content accordingly, e.g. using the correct quotation marks for the q element with CSS’s quotes
By no means should user-agents hide content in a different language by default. Think of these examples:
<p lang="en">I met a nice guy there. His name was <span lang="de">Max Mustermann</span>.
<p lang="en">He said to me <q lang="de">Halt! Stopp!</q>.</p>
<p lang="en">The original title is <cite>Faust. Eine Tragödie.</cite>.</p>
When content in different languages would be hidden, they would read:
I met a nice guy there. His name was .
He said to me .
The original title is .
It seems you want to use it to realize a multilingual page. While this is possible with JS/CSS, it’s usually not the best way. Typically you might want to use separate pages for each language and link the translation with the link type alternate and the corresponding hreflang:
<!-- on the page <example.com/en/about-me>, you could link to the German translation -->
<link rel="alternate" hreflang="de" href="/de/ueber-mich" />
The lang attribute goes on the HTML tag for the whole page, and specifies what the default language of the page is. It's metadata that describes the content. What you're trying to do with it isn't what it does.
You could do what you're trying to do by using JQuery to hide all tags in the page which have a lang attribute not equal to something you specify, but you'd have to research how to discover what the system language is in the browser (assuming that's possible). If you don't want to drag JQuery into it, you could just walk the DOM yourself.

Any ideas on how to identify the main content of the page?

if you had to identify the main text of the page (e.g. on a blog page to identify the post's content) what would you do? What do you think is the simplest way to do it?
Get the page content with cURL
Maybe use a DOM parser to identify the elements of the page
That's a pretty hard task but I would start by counting spaces inside of DOM elements. A tell tale sign of human-readable content is spaces and periods. Most articles seem to encapsulate the content in paragraph tags so you could look at all p tags with n spaces and at least one punctuation mark.
You could also use the amount of grouped paragraph tags inside an element.. So if a div has N paragraph children, it could very well be the content you're wanting to extract.
There are some framework that can archive this, one of them is http://code.google.com/p/boilerpipe/ which uses some statistics.
Some features that can detect html block with main content:
p, div tags
amount of text inside/outside
amount of links inside/outside (i.e remove munus)
some css class names and id (frequntly those block have classes or ids with main, main_block, content e.t.c)
relation between title and text inside content
You might consider:
Boilerpipe: "The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page. The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings."
Ruby Readability: "Ruby Readability is a tool for extracting the primary readable content of a webpage. It is a Ruby port of arc90's readability project."
The Readability API: "If you'd like access to the Readability parser directly, the Content API is available upon request. Contact us if you're interested."
It seems like the best answer is "it depends". As in, it depends on how the site in question is marked up.
If the author uses "common" tags, you could look for a container
element ID'd as "content" or "main."
If the author is using HTML5, you should in theory be able to query for the <article> element, if it's a page with only one "story" to tell.
Recently I faced the same problem. I developed a news article scraper and I had to detect the main textual content of the article pages. Many news sites are displaying lots of other textual content beside the "main article" (e.g 'read next', 'you might be interested in'). My first approach was to collect all text between <p> tags. But this did't work because there were news sites that used the <p> for other elements like navigation, 'read more', etc. too. Some time ago I stumbled on the Boilerpipe libary.
The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.
That sounded like the perfect solution for my problem, but it wasn't. It failed at many news sites, because it was often not able to parse the whole text of the news article. I don't know why, but think that the boilerpipe algorithm can't deal with badly written html. So in many cases it just returned an empty string and not the main content of the news article.
After this bad experience I tried to develop my own "article text extractor" algorithm. The main idea was to split the html into different depths, for example:
<html>
<!-- depth: 1 -->
<nav>
<!-- depth: 2 -->
<ul>
<!-- depth: 3 -->
<li>Site<!-- depth: 5 --></li>
<li>Site<!--- depth: 5 ---></li>
</ul>
</nav>
<div id='text'>
<!--- depth: 2 --->
<p>Thats the main content...<!-- depth: 3 --></p>
<p>main content, bla, bla bla ... <!-- depth: 3 --></p>
<p>bla bla bla interesting bla bla! <!-- depth: 3 --></p>
<p>whatever, bla... <!-- depth: 3 --></p>
</div>
</html>
As you can see, to filer out the surplus "clutter" with this algorithm, things like navigation elements, "you may like" sections, etc. must be on a different depth than the main content. Or in other words: the surplus "clutter" must be described with more (or less) html tags than the main textual content.
Calculate the depth of every html element.
Find the depth with the highest amount of textual content.
Select all textual content with this depth
To proof this concept I wrote a Ruby script, which works out good, with most of the news sites. In addition to the Ruby script I also developed the textracto.com api which you can use for free.
Greetings,
David
It depends very much on the page. Do you know anything about the page's structure beforehand? If you are in luck, it might provide an RSS feed that you could use or it might be marked up with some of the new HTML5 tags like <article>, <section> etc. (which carry more semantic power than pre-HTML5 tags).
I've ported the original boilerpipe java code into a pure ruby implementation Ruby Boilerpipe also a Jruby version wrapping the original Java code Jruby Boilerpipe

How do I create tab indenting in html

I have a situation as follows
<body>
Test<br />
test<br />
test1<br />
</body>
I need to add a tab after the 2nd test and 3rd test
so that it looks similar to this.
Test
test
test1
Is there a special HTML entity or special character for TAB. eg. Non-breaking space == & nbsp ;
thanks
The simplest way I can think of would be to place the text in nested divs. Then add margin to the left of div. It will cascade down, giving you indentation.
<div id='testing'>
Test1
<div>
Test2
<div>
Test3
</div>
</div>
</div>
With the CSS:
#testing div {
margin-left: 10px;/*or whatever indentation size you want*/
}
With those, you'll get a nice tree, no matter how many levels deep you want to go.
EDIT: You can also use padding-left if you want.
If you really want to use tabs (== tabulator characters), you have to go with the following solution, which I don't recommend:
<pre>
test
test
test
</pre>
or replace the <pre/> with <div style="white-space: pre" /> to achieve the same effect as with the pre element. You can even enter a literal tab character instead of the escaped .
I don't recommend it for most usages, because it is not really semantic, that is, from viewing the HTML source a program cannot deduce any useful information (like, e.g., "this is a heading" or such). You'd be better off using one of the nice margin-left examples of the other answers. However, if you'd like to display some stuff like source code or the such, the above solution would do it.
Cheers,
Ye gods, tables?
Looks like an obvious use-case for lists, with variable margin and list-style-type: none; seasoned to taste.
See Making a 'Tab' in HTML by Neha Sinha:
Preformatted
You can put tab characters in your HTML directly if
you use what’s called “preformatted”
text.In HTML, surround text that you
want “preformatted” in a pair of
“<pre>” and “</pre>” start and end
tags.
Tables
You can use a html table so that everything you put within the set of rows(<tr>) and
columns(<td>) shows up the same way. You can very well hide the table borders to show the text alone.
Using the <dd> Tag
The <dd> tag is for formatting definitions. But it
also will create a line break and make
a tab!
, The Non-Breaking Space
One bit of HTML code I used in the table example is the “non-breaking space,” encoded as in HTML. This just gives you some space. Combined with a line break, <br>, you can create some tab-like effects.
Example
Test<br/>
<pre> </pre>test<br/>
<pre> </pre>test1<br/>
this should render as:
Test
test
test1
There have been a variety of good and bad answers so far but it seems no-one has addressed the way that you can choose between the solutions.
The first question to ask is "What is the relationship between the data being displayed?". Once this has been answered it the HTML structure you use should be obvious.
Please update the question explaining more about the structure of the content you need to display and you should find that you get better answers. At the moment everything from using <pre> to tables might be the best solution.
I think that easiest thing to do is to use UL/LI html tags and then to manipulate (and remove if needed) symbols in front of list with CSS.
Then you get something like:
Test
Test2
Test 3
More info + working example you can try out.
If you need to display tabs (tabulator characters), use a PRE element (or any element with the white-space: pre; CSS applied to it).
<!doctype html>
<html>
<head>
<title>Test</title>
<style type="text/css">
pre { white-space: pre; }
</style>
</head>
<body>
<p>This is a normal paragraph, blah blah blah.</p>
<pre>This is preformatted text contained within a <code>PRE</code> element. Oh, and here are some tab characters, each of which are displayed between two arrows: ← → ← → ← → ← →</pre>
</body>
</html>
You can also use the HTML entity instead of the actual tab character.
I am not a fan of using CSS to simulate a Tab Character.
For Indenting, yes, by all means use CSS - but not for Tab Characters.
For a single Tab, I would replace with " " (4 Spaces).
This is similar to what was used to format your Question for display.
The added benefit to this is (if someone copies your text)
   it will preserve the spacing when pasted into Word or Notepad.
Example:
Test<br />
test<br />
test1
Note: If your text is in a <pre> tag, then #Boldewyn's answer is the better option.
Keep in mind, the text in the <pre> tag may render differently than expected.
I realize this is an old post, however someone may want to use the following list in order to create an indented list (by using a description list)
In my opinion, this is a much cleaner way than many of the other answers here and may be the best way to go:
It does not use a bunch of whitespace characters (which gives little control in terms of formatting for styles)
It does not use the <pre> tag, which should only be used for formatting (in my opinion, this should pretty much be a last resort or a special-case use in HTML); <pre> tag is also whitespace-dependent and not CSS dependent when used the way it is intended to be used
w3schools says to use the <pre> element when displaying text with unusual formatting, or some sort of computer code.
description lists allow for more control in terms of formatting and hierarchy
The answer by #geowa4, I would say, is another great way to accomplish this. <div>s allow for style control and, depending on use/objective, his answer may be the best way to go.
<dl>
<dt>Test</dt>
<dd>
<dl>
<dt>test</dt>
<dd>test1</dd>
</dl>
</dd>
</dl>