Can OpenNLP use HTML tags as part of the training? - html

I'm creating a training set for the TokenNameFinder using html documents converted into plain text, but my precision is low and I want to use the HTML tags as part of the training. Like words in bold, and sentences in differents margin sizes.
Will OpenNLP accept and use those tags to create rules?
Is there another way to make use of those tags to improve precision?

It is not clear what you mean with using HTML tags to train OpenNLP.
The train input is an annotated tokenized sentence:
<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr . <START:person> Vinken <END> is chairman of <START:company> Elsevier N.V. <END> , the Dutch publishing group .
To train an OpenNLP model using the standard tooling you need annotations follows this convention. Note that the annotations does not follow the XML standard.
You can embed annotations directly to the HTML documents you will use for training. It might even help the classifier with the extra context, but I've never read any experimental results about it.
You should keep in mind that the training data should be tokenized. It means that you should include white spaces between words and punctuation, as well as between text elements and html:
<p> <i> Mr . <START:person> Vinken <END> </i> is chairman of <b> <START:company> Elsevier N.V. <END> </b>, the Dutch publishing group .

Related

Does text from a rich text editor not inherit styles when rendered in an HTML document?

Just to make things clear, I have used an RTE in the backend to store some description. Later, through an api, I am receiving the description along with other details as a response. Now the styles are intact till now. For example, bold headings. But when I render it in the HTML document using innerHTML property, all I see is unformatted text. The headings are not bold anymore.
Here's a part of response:
</p>\r\n\n\r\n<p><span style=\"font-weight: bold;\">Features</span> \n </p>\r\n\n\r\n<p>Gives even skin tone, smoother complexion and sculpted facial features.
Clearly, font-style="bold" can be seen here. But after this, the rendered version does not contain those styles.
Here's the full response:
"cart_count":2,
"images":[
],
"success":true,
"message":"Sucessfully",
"data":{
"product_id":1,
"name":"Dr G Butterfly Gua Sha",
"category_id":1,
"category":"Skin Tool",
"description":"<p>Dr G Butterfly Rose Quartz Gua Sha is a beauty and wellness tool designed to heal and enhance natural beauty. It lifts and sculpts your face, drains the lymph node, which reduces puffy eyes and face. By scraping with repeated strokes on the surface of the skin, this tool helps stimulate muscles and increases the blood flow. \n </p>\r\n\n\r\n<p><span style=\"font-weight: bold;\">Features</span> \n </p>\r\n\n\r\n<p>Gives even skin tone, smoother complexion and sculpted facial features. Reduces the signs of ageing and gives younger-looking skin. Increases lymphatic function. Stimulates blood circulation. Improves the appearance of dark circles and reduces under-eye puffiness. </p>\r\n\n\r\n<p><span style=\"font-weight: bold;\">How To Use \n</span></p>\r\n\n\r\n<p>Apply Dr G oil or Dr G gel as per your skin type covering the face and neck. </p>\r\n<p>Hold the butterfly gua sha tool firmly and sweep across gently up and out, starting with the neck, cheeks, jawline, chin, around the mouth, and slowly glide under the eyes, across your eyebrows and from your forehead up to your hairline. </p>\r\n<p>You can sweep it 3-5 times per area. </p>\r\n<p>Recommended at least a few times a week for best results. </p>\r\n\n\r\n<p><span style=\"font-weight: bold;\">About Dr G</span> \n </p>\r\n\n\r\n<p>Dr G offers luxury skincare products, backed by over a decade of dermatology expertise and on-ground practice. Made for Indian weather conditions, with variants for different skin types, including sensitive skin, and to address specific skin concerns - these innovative products are a perfect balance of nature and science. Drawing from ancient Ayurveda and combining natural extracts with skin-safe science, Dr G's range of products bridge modern skincare with holistic science.</p>",
"short_description":"Sculpts, Tones, Reduces Puffiness, Lifts",
"max_quantity":500,
"status":1,
"in_stock":1,
"measurement":[
{
"is_cart":true,
"ordered_quantity":2,
"is_wish":false,
"discounted_price":1400.0,
"weight":"200 Gram",
"price":1400.0,
"prod_id":1,
"percentage":100,
"max_quantity":500
}
]
}
}
The HTML from your response isn't valid. You can easily test it, if you copy the HTML string from your response to a text file with .html file ending and open it with your browser (index.html for example). Or use a validator like this one: https://www.freeformatter.com/html-validator.html
Let's pick one part from the HTML string which has wrong characters and gets displayed unformatted:
<span style=\"font-weight: bold;\">Features</span> \n
If you remove the backslashes \ here this peace gets rendered correctly:
<span style="font-weight: bold;">Features</span> \n
I would reccomend you to encode the HTML before sending it to the frondend. You could use Base64 which can be easily encoded in the backend and decoded on the frontend before displaying it.
If this "wrong" characters are already there when you recive this HTML (on your Backend) you have to parse it first to clean it.

Any conventional standards for storing OCR data/metadata in JPEG images?

I want to organize a collection of scanned documents (receipts, bank statements, etc.) by adding their metadata and text content (OCR'ed) into the same jpeg files. Is there any more or less commonly accepted way of storing such data? Any commonly used schemas?
For metadata, for example - I found a Dublin Core scheme, but most of the fields I want are not there, and I'm not sure what's the good way to add custom fields - can I just use them like if they existed in DC or XMP scheme (i.e. <dc:myfield>myvalue</dc:myfield> or <xmp:myfield>myvalue</xmp:myfield>), or I have to define my own scheme by adding xmlns:myScheme="http://myScheme.uri" and then use it as <myScheme:myfield>myvalue</myScheme:myfield> ?
Also, in all the examples I found, this data is stored inside <rdf:Description> which is inside <rdf:RDF> which is inside <x:xmpmeta> - is it a standard requirement? I don't see it in the XMP specification for storage in files...
For now, based on the examples, I plan to embed something like this:
<?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?>
<x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='MyTool v 0.0.1'>
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
<rdf:Description rdf:about=''
xmlns:dc='http://purl.org/dc/elements/1.1/'
xmlns:myDoc='http://some.custom.uri/'>
<dc:format>image/jpeg</dc:format>
<myDoc:doctype>scan</myDoc:doctype>
<myDoc:originalfilename>20190519121225_003.jpg</myDoc:originalfilename>
<myDoc:originalimagewidth>1684</myDoc:originalimagewidth>
<myDoc:originalimageheight>2788</myDoc:originalimageheight>
<myDoc:langOCR>EN-US</myDoc:langOCR>
<myDoc:acquisitiondatetime>2019-05-19T12:12:25Z</myDoc:acquisitiondatetime>
<myDoc:documentdate>2019-01-02</myDoc:documentdate>
<myDoc:pagesindocument>6</myDoc:pagesindocument>
<myDoc:page>2</myDoc:page>
<myDoc:textcontent>
Bank
statement
02/01/2019
Page 2 of 6
( Here goes raw OCR content
as multiline text )
</myDoc:textcontent>
<dc:subject>
<rdf:Bag>
<rdf:li>bank</rdf:li>
<rdf:li>statement</rdf:li>
</rdf:Bag>
</dc:subject>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end='w'?>
Does it make sense at all? I'm sure many people already worked on similar tasks, I don't want to reinvent the wheel...

What's the difference between `<seg>` and `<span>`

What's the difference between a <seg> in XML and <span> in HTML? Here are two passages from Bibles, one from the English Bible in Christodouloupoulos' and Steedman's massively parallel Bible corpus,
<?xml version="1.0" ?>
<cesDoc version="4">
…
<text>
<body id="Bible" lang="en">
<div id="b.GEN" type="book">
<div id="b.GEN.1" type="chapter">
<seg id="b.GEN.1.1" type="verse">
In the beginning God created the heaven and the earth.
</seg>
<seg id="b.GEN.1.2" type="verse">
And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters.
</seg>
…
and the other from the NIV English Bible at Bible Gateway, which is where they got most of their texts from:
<p class="chapter-1">
<span id="en-NIV-27932" class="text Rom-1-1">
<span class="chapternum">1 </span>
Paul, a servant of Christ Jesus, called to be an apostle and set apart for the gospel of God—
</span>
<span id="en-NIV-27933" class="text Rom-1-2">
<sup class="versenum">2 </sup>the gospel he promised beforehand through his prophets in the Holy Scriptures
</span>
…
In the HTML, a it seems a <span> can replace a <seg>, except that the HTML has added verse numbers in <span>. Oh, and the chapters are in <div>. So it's not one-to-one.
Of course, I realize that HTML and XML are different, and this is only one juxtaposition; I'm sure there are others out there. But I'm going to need to be able to display XML as HTML, and I don't want to anger the doctype gods. So, conceptually, how is <seg> different from <span> in purpose, meaning and usage?
Update: #jim-garrison, says I'm going to need to read the schema to understand the XML, but I'm a neophyte at that, too. In particular, I did find some official-looking documentation for <seg> by TEI that makes me think it's use is a little more than arbitrary, but I have no idea how to interpret this documentation. Should it give us a more specific answer than what Jim has already written?
The difference between XML and HTML generally is that the list of tags that can be present in XML is defined by a DTD or XML Schema, and tags represent document semantics and not presentation. So tags can be named anything. In HTML the set of tags is generally predefined, as if there was a pre-existing HTML DTD or schema, but HTML is not XML and doesn't follow all the rules of XML. While HTML was in some sense derived from the same parent as XML (SGML), and the two are superficially very similar, they are most definitely NOT the same thing.
The answer to your specific question is that the writers of the XML chose to use a tag named <seg> ("segment"?) to represent generalized strings of text, with attributes providing additional semantic information. For more details you'll need to find the DTD or XML schema that governs the content of the XML and read the documentation that goes with it.
But I'm going to need to be able to display XML as HTML, and I don't want to anger the doctype gods. So, conceptually, how does different from in purpose, meaning and usage?
This is where you will use XSLT to transform the input XML into valid HTML. To figure out how to do that transformation you will need to know the full semantics of all the tags that can appear (again, go to the documentation for the DTD/Schema) and decide on a visual representation for the data. There's no one answer to "how should a <seg>" be transformed. That's up to your requirements regarding presentation. One possible transformation converts <seg> tags to <span>, but that may depend on the value of certain attributes (type="verse" vs some other type). It might even differ depending on output medium (desktop vs tablet vs phone vs watch vs ...?)
Once you convert from XML to HTML you have left the realm of the Doctype gods and they have no interest in what you do :-) There's a whole different set of deities such as CSS-Cthulhu, Javascript-Janai'ngo (look it up), et al who will take great pleasure making your life miserable.

How to get content of an HTML element using XPath without element id?

I am trying to find an element using xpath and get the elements text value. Kindly bear with me and help me in resolving the issue.
Visit Click here
Visit Click here
1
In <div class=“medium-8 columns”> - I need to extract paragraphs text only up to “Further History” (ie. stop at “Further History”, not including “Further History”).
2.
In <div class=“medium-8 columns”> - Here I need to extract paragraphs text after “Further History” (not including “Further History”).
I am using below XPath expression which is returning anything.
(//STRONG[not(contains(text(), 'Further History'))]/following-sibling::text() | //STRONG[not(contains(text(), 'Further History'))]/../following-sibling::p/text()) | //div[contains(#class, 'articlecontent')]
HTML might not be case-sensitive, but XML (and, consequently, XPath) is: "STRONG" is not the same as "strong", and in the HTML you linked to, there is only "strong".
A useful XPath expression to retrieve the text you are interested in might be
//div[#class="medium-8 columns"]/p[following-sibling::p/strong]/text()
which means
//div select all `div` elements, anywhere in the document
[#class="medium-8 columns"] but only if they have a `class` attribute whose value is
equal to "medium-8 columns"
/p of those `div` elements select all `p` child elements
[following-sibling::p/strong] but only if they have a following sibling `p` which has a
`strong` element as a child
/text() of the remaining `p` elements, select the text content
and which would return (individual results separated by ------):
Tim Bajarin is recognized as one of the leading industry
consultants, analysts and futurists, covering the field of
personal computers and consumer technology. Mr. Bajarin has
been with Creative Strategies since 1981 and has served as a
consultant to most of the leading hardware and software
vendors in the industry including IBM, Apple, Xerox, Hewlett
Packard/Compaq, Dell, AT&T, Microsoft, Polaroid, Lotus,
Epson, Toshiba and numerous others.
-----------------------
His articles and/or analysis have appeared in USA Today, Wall
Street Journal, The New York Times, Time and Newsweek
magazines, BusinessWeek and most of the leading business and
trade publications. He has appeared as a business analyst
commenting on the computer industry on all of the major
television networks and was a frequent guest on PBS’ The
Computer Chronicles.
-----------------------
Mr. Bajarin has been a columnist for US computer industry
publications such as PC Week and Computer Reseller News and
wrote for ABCNEWS.COM for two years and Mobile Computing for
10 years. His columns currently appear in Asia Computer
Weekly, Personal Computer World (UK), and Microscope (UK) as
well as Mobile Enterprise Magazine. His various columns and
analyses are syndicated in over 30 countries.
For your second case:
Here I need to extract paragraphs text after “Further History” (not including “Further History”)
just replace following-sibling with preceding-sibling in the path expression.

Text-friendly file format for structured data

I'm looking for a file format that lets me encode structured data like dictionaries and arrays, but also allows me to easily edit text blocks, including line breaks.
Candidates so far:
xml: (+) good for text editing and structured data, (-) ignores line breaks, closing tags is cumbersome
html: (+) has tags for line breaks, (-) no structured data
json: (+) good for structured data, (-) bad for editing multiline text
yaml: (+) good for structured data, (-) bad for editing multiline text if text contains special characters like colon etc [edit: see accepted answer, literal style works]
My favorite so far: xml with self-defined tags for line breaks. Better ideas?
YAML is a perfect fit, and your "con" that it's "bad for editing multiline text if text contains special characters like colon etc" is entirely unfounded. YAML is by far the most featureful format for multi-line text:
---
# Block scalars are folded and stripped by default
preamble:
We the People of the United States, in Order to form a more
perfect Union, establish Justice, insure domestic Tranquility,
provide for the common defence, promote the general Welfare,
and secure the Blessings of Liberty to ourselves and our
Posterity, do ordain and establish this Constitution for the
United States of America.
# Chomping indicators (+ and -) allow explicit control over how
# leading/trailing whitespace will be preserved or stripped
chomp: >+
Hello: Is it me you're looking for?
# Literal style preserves formatting
homepage: |
<html>
<head>
<title>My kewl web site</title>
</head>
<body>
<h1>Hello world!</h1>
</body>
</html>
# The indentation indicator lets you explicitly control indentation if it
# can't be inferred
indentation: |4
I'll be indented eight spaces
I'll be indented six
# And colons (or other special characters) are not a problem
emoji: |
😀: Grinning face {U+1F600}
😬: Grimacing face {U+1F62C}
😞: Disappointed face {U+1F61E}
...and, of course, you can use any of these formats within a mapping (dictionary) or sequence (array). You can even use complex strings (or any YAML structure, as it happens) for mapping keys.
If you have an example of a use case that you think YAML is a poor fit for, feel free to leave a comment. YAML isn't perfect for everything, but it's great for a lot of things.
For comparison, here's the same thing in JSON:
{ "preamble": "We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defence, promote the general Welfare, and secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America.",
"chomp": "\n\nHello: Is it me you're looking for?\n\n\n\n",
"homepage": "<html>\n <head>\n <title>My kewl web site</title>\n </head>\n <body>\n <h1>Hello world!</h1>\n </body>\n</html>\n",
"indentation": "\n I'll be indented eight spaces\n I'll be indented six\n",
"emoji": "😀: Grinning face {U+1F600}\n😬: Grimacing face {U+1F62C}\n😞: Disappointed face {U+1F61E}\n"
}