Any conventional standards for storing OCR data/metadata in JPEG images? - ocr

I want to organize a collection of scanned documents (receipts, bank statements, etc.) by adding their metadata and text content (OCR'ed) into the same jpeg files. Is there any more or less commonly accepted way of storing such data? Any commonly used schemas?
For metadata, for example - I found a Dublin Core scheme, but most of the fields I want are not there, and I'm not sure what's the good way to add custom fields - can I just use them like if they existed in DC or XMP scheme (i.e. <dc:myfield>myvalue</dc:myfield> or <xmp:myfield>myvalue</xmp:myfield>), or I have to define my own scheme by adding xmlns:myScheme="http://myScheme.uri" and then use it as <myScheme:myfield>myvalue</myScheme:myfield> ?
Also, in all the examples I found, this data is stored inside <rdf:Description> which is inside <rdf:RDF> which is inside <x:xmpmeta> - is it a standard requirement? I don't see it in the XMP specification for storage in files...
For now, based on the examples, I plan to embed something like this:
<?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?>
<x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='MyTool v 0.0.1'>
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
<rdf:Description rdf:about=''
xmlns:dc='http://purl.org/dc/elements/1.1/'
xmlns:myDoc='http://some.custom.uri/'>
<dc:format>image/jpeg</dc:format>
<myDoc:doctype>scan</myDoc:doctype>
<myDoc:originalfilename>20190519121225_003.jpg</myDoc:originalfilename>
<myDoc:originalimagewidth>1684</myDoc:originalimagewidth>
<myDoc:originalimageheight>2788</myDoc:originalimageheight>
<myDoc:langOCR>EN-US</myDoc:langOCR>
<myDoc:acquisitiondatetime>2019-05-19T12:12:25Z</myDoc:acquisitiondatetime>
<myDoc:documentdate>2019-01-02</myDoc:documentdate>
<myDoc:pagesindocument>6</myDoc:pagesindocument>
<myDoc:page>2</myDoc:page>
<myDoc:textcontent>
Bank
statement
02/01/2019
Page 2 of 6
( Here goes raw OCR content
as multiline text )
</myDoc:textcontent>
<dc:subject>
<rdf:Bag>
<rdf:li>bank</rdf:li>
<rdf:li>statement</rdf:li>
</rdf:Bag>
</dc:subject>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end='w'?>
Does it make sense at all? I'm sure many people already worked on similar tasks, I don't want to reinvent the wheel...

Related

Parser user page information from Wikipedia. How to remove redundant information?

I'm trying to fetch public user information from Wikipedia using API. (Using the script get_pages_revisions.py). After I got the revisions, I used BeautifulSoup to strip all the HTML tags. However, I found the remaining text is still quite messy.
For example, when I fetched the textual data from the User:(aeropagitica), the results showed the following:
(A small part of it)
{{administrator}}
{{divbox|gray||Wikipedia is currently working on {{NUMBEROFARTICLES}} articles. The local time at the Wikipedia servers is '''{{CURRENTTIME}}''' on {{CURRENTDAYNAME}} {{CURRENTDAY}} {{CURRENTMONTHNAME}}, {{CURRENTYEAR}}.}}
• '''[[:WP:AIV|AIV]]''' •
'''[[Wikipedia:Articles for deletion/Log/{{CURRENTYEAR}} {{CURRENTMONTHNAME}} {{CURRENTDAY}}|AfD]]''' • '''[[User:(aeropagitica)/RFA summary|RfA]]''' • '''[[:Category:Candidates for speedy deletion|CSD]]''' • '''[[Wikipedia:Template messages|tpl]]''' • '''[[Wikipedia:Template_messages/User_talk_namespace|user talk tpl]]''' • '''[[Special:Newpages|new]]''' • '''[[Wikipedia:Stubs|stubs]]''' • '''[[Wikipedia:Copyright problems|(c)]]''' • '''[[Wikipedia:Manual of Style|MoS]]''' • '''[[User:Interiot/Tool2|edits (interiot)]]''' • '''[[Wikipedia:Proposed_deletion|prod]]''' • '''[[Special:Log/Newusers|newusers]]''' • '''[http://tools.wikimedia.de/~essjay/edit_count/Count.php? PHP interiot's tool]''' • '''[http://tools.wikimedia.de/~interiot/cgi-bin/Tool1/wannabe_kate Interiot's tool 1]''' • '''[[:Wikipedia:Article Creation and Improvement Drive|Article Improvement]]'''
{{purge|Purge server cache}}
I was [[Wikipedia:Requests_for_adminship/%28aeropagitica%29|nominated for adminship]] by [[User:King of Hearts|King of Hearts]] on February 27th 2006. The vote achieved consensus and I was accepted for the role with a score of '''40/10/5''' on March 7th 2006.
When I am not working on Wikipedia pages, I enjoy learning to play acoustic fingerstyle guitar, photography, learning languages (Spanish and French) and travel.
''Userboxes''
{| style="text-align:center; border: 1px solid #000000; background-color:#00cc99; width:100%; -moz-border-radius: 15px;"
|- padding:5em;padding-top:0.5em;"
|{{user en}}
May I ask:
How can I remove the string like style="....", cellpadding="...." or something like these here? Can I remove all the format strings like these at once?
There are many blocks like this:
{{Userbox|#77E0E8|#D0F8FF|{{CURRENTDAY}}|It is currently a [[{{CURRENTDAYNAME}}]]. I don't like {{CURRENTDAYNAME}}s.}}
The information after "It is .." is what we need, but the text before it: Userbox|#77E0E8, is also used for the web layout definition and should be removed. Is there any way we can remove the first half of this line?
(Userbox is just one kind of it, there are many other types like User:, Category:, hence it will be quite hard to move them with customize re rules)
(I'm a beginner of BeautifulSoup and Web Parser, so any suggestions or hints will be valuable. Thank you for your help in advance!)
You're using the Revisions API which only allows you to get the page content as Wikitext. That's the "messy" text you're seeing.
You can instead use the Parse API to get the rendered HTML content of the page, which you can then put into a local DOM parser of your choosing or just strip HTML tags if that works for you.
See the MediaWiki API documentation for details, including examples on how to request the parsed contents of a page.

Meta tags in skin from MediaWiki template

Let's say i have a template in my MediaWiki like
<includeonly>
<div id="custom-person">
* <span>Birthday:</span> {{#if: {{{birth date|}}} | <b>{{#ol-time:|{{{birth date}}}}}</b> | — }}
{{#if: {{{full name|}}} | * <span>full name:</span> <b>{{{full name}}}</b>}}
{{#if: {{{birth place|}}} | * <span>birth place:</span> <b>{{{birth place}}}</b>}}
{{#if: {{{age|}}} | * <span> age:</span> <b>{{{age}}}</b>}}
{{#if: {{{nationality|}}} | * <span> nationality:</span> <b>{{{nationality}}}</b>}}
<div class="clear"></div>
</div>
[[Category:Person]]
__NOTOC__
</includeonly>
All these pages are in one Namespace (0).
I need to generate head meta tags with data from this template.
I figured out how to filter such a pages and add title tags in my SkinPerson.php
if ( $out->getTitle()->getNamespace() == 0 ) {
$out->addMeta( "description", $out->getPageTitle());
$out->addHeadItem( 'og:description', '<meta property="og:description" content="' . $out->getPageTitle() . '">');
}
But I'm really stuck on how can I insert in, say, 'og:description' tag something like {{{full name}}} + {{{age}}} ?
That's simply not possible, and I would wonder what your use case here would be, why you want to do that. First some explanation, why this is not possible in the way you want to achieve that:
The template is evaluated by a piece of software we call the Parser. The parser is generating a html representation of your wikitext, including all the templates and so on. The result of that is then saved in the ParserOutput and probably cached in ParserCache (so that not every time it needs to be parsed again).
However, the skin, where you want to add the head item, is using the output of the parser directly, so it does not really know about the wikitext (including template parameters) anymore, and really shouldn't.
One possible solution for what you want to achieve is probably to extend the wikitext markup language by providing a tag extension, parsing that during the parsing of the wikitext, and save the values for the head items in the database. During the output of the page you can then retrieve these values from the database again and add them into the head items like you want. See more information about that in the documentation.
There might be other ways, apart from the database, to get information from the parsing time into the output time, which I'm not aware of.

What's the difference between `<seg>` and `<span>`

What's the difference between a <seg> in XML and <span> in HTML? Here are two passages from Bibles, one from the English Bible in Christodouloupoulos' and Steedman's massively parallel Bible corpus,
<?xml version="1.0" ?>
<cesDoc version="4">
…
<text>
<body id="Bible" lang="en">
<div id="b.GEN" type="book">
<div id="b.GEN.1" type="chapter">
<seg id="b.GEN.1.1" type="verse">
In the beginning God created the heaven and the earth.
</seg>
<seg id="b.GEN.1.2" type="verse">
And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters.
</seg>
…
and the other from the NIV English Bible at Bible Gateway, which is where they got most of their texts from:
<p class="chapter-1">
<span id="en-NIV-27932" class="text Rom-1-1">
<span class="chapternum">1 </span>
Paul, a servant of Christ Jesus, called to be an apostle and set apart for the gospel of God—
</span>
<span id="en-NIV-27933" class="text Rom-1-2">
<sup class="versenum">2 </sup>the gospel he promised beforehand through his prophets in the Holy Scriptures
</span>
…
In the HTML, a it seems a <span> can replace a <seg>, except that the HTML has added verse numbers in <span>. Oh, and the chapters are in <div>. So it's not one-to-one.
Of course, I realize that HTML and XML are different, and this is only one juxtaposition; I'm sure there are others out there. But I'm going to need to be able to display XML as HTML, and I don't want to anger the doctype gods. So, conceptually, how is <seg> different from <span> in purpose, meaning and usage?
Update: #jim-garrison, says I'm going to need to read the schema to understand the XML, but I'm a neophyte at that, too. In particular, I did find some official-looking documentation for <seg> by TEI that makes me think it's use is a little more than arbitrary, but I have no idea how to interpret this documentation. Should it give us a more specific answer than what Jim has already written?
The difference between XML and HTML generally is that the list of tags that can be present in XML is defined by a DTD or XML Schema, and tags represent document semantics and not presentation. So tags can be named anything. In HTML the set of tags is generally predefined, as if there was a pre-existing HTML DTD or schema, but HTML is not XML and doesn't follow all the rules of XML. While HTML was in some sense derived from the same parent as XML (SGML), and the two are superficially very similar, they are most definitely NOT the same thing.
The answer to your specific question is that the writers of the XML chose to use a tag named <seg> ("segment"?) to represent generalized strings of text, with attributes providing additional semantic information. For more details you'll need to find the DTD or XML schema that governs the content of the XML and read the documentation that goes with it.
But I'm going to need to be able to display XML as HTML, and I don't want to anger the doctype gods. So, conceptually, how does different from in purpose, meaning and usage?
This is where you will use XSLT to transform the input XML into valid HTML. To figure out how to do that transformation you will need to know the full semantics of all the tags that can appear (again, go to the documentation for the DTD/Schema) and decide on a visual representation for the data. There's no one answer to "how should a <seg>" be transformed. That's up to your requirements regarding presentation. One possible transformation converts <seg> tags to <span>, but that may depend on the value of certain attributes (type="verse" vs some other type). It might even differ depending on output medium (desktop vs tablet vs phone vs watch vs ...?)
Once you convert from XML to HTML you have left the realm of the Doctype gods and they have no interest in what you do :-) There's a whole different set of deities such as CSS-Cthulhu, Javascript-Janai'ngo (look it up), et al who will take great pleasure making your life miserable.

How to generate hash from ~200k text/html that would match/compare to similar text?

I would like to make a sort of hash key out of a text (in my case html) that would match/compare to the hash of other similar text
ex of matching texts:
"2012/10/01 This is my webpage #1"+ 100k_of_same_text + random_words_1 + ..
"2012/10/02 This is my webpage #2"+ 100k_of_same_text + random_words_2 + ..
...
"2012/10/02 This is my webpage #2"+ 100k_of_same_text + random_words_3 + ..
So far I've thought of removing numbers and tags but that wold still leave the random words.
Is there anything out there that dose this?
I have root access to the server so I can add any UDF that is necesare and if needed I can do the processing in c or other languages.
The ideal would be a function like generateSimilarHash(text) and an other function compareSimilarHashes(hash1,hash2) that would return the procent of matching text.
Any function like compare(text1,text2) would not work as in my case as I have many pages to compare (~20 mil at the moment)
Any advice is welcomed!
UPDATE:
I'm refering to ahash function as it is described on wikipedia:
A hash function is any algorithm or subroutine that maps large data
sets of variable length to smaller data sets of a fixed length.
the fixed length part is not necessary in my case.
It sounds like you need to utilize a program like diff.
If you are just trying to compare text a hash is not the way to go because slight differences in input cause total and complete differnces in output. (Thus the reason why they are used to encode passwords, and secure text). Character difference programs are pretty complicated, unless you really are interested in how they work and are trying to write your own I would just use a solution like the one that is shown here using sdiff to get a percentage.
Percentage value with GNU Diff
You could use some sort of Levenshtein distance algoritm. this works for small pieces of text, but I'm rather sure that something similar can be applied to large chunks of text.
Ref: http://en.m.wikibooks.org/wiki/Algorithm_implementation/Strings/Levenshtein_distance
I've found out that tag order in webpages can create a very distinctive pattern, that remains the same even if portions of text / css / script change. So I've made a string generated by the tag order (ex: html head meta title body div table tr td span bold... => "hhmtbdttsb...") and then I just do exact matches between these strings. I can even apply the Levenshtein distance algorithm and get accurate results.
If I didn't have html, I would have used the punctuation/end-lines for splitting, or something similar.

Regex: Extracting readable (non-code) text and URLs from HTML documents

I am creating an application that will take a URL as input, retrieve the page's html content off the web and extract everything that isn't contained in a tag. In other words, the textual content of the page, as seen by the visitor to that page. That includes 'masking' out everything encapsuled in <script></script>, <style></style> and <!-- -->, since these portions contain text that is not enveloped within a tag (but is best left alone).
I have constructed this regex:
(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>)
It correctly selects all the content that i want to ignore, and only leaves the page's text contents. However, that means that what I want to extract won't show up in the match collection (I am using VB.Net in Visual Studio 2010).
Is there a way to "invert" the matching of a whole document like this, so that I'd get matches on all the text strings that are left out by the matching in the above regex?
So far, what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group. This works, but I was wondering if it was possible to do it all through regex and just end up with matches on the plain text.
This is supposed to work generically, without knowing any specific tags in the html. It's supposed to extract all text. Additionally, I need to preserve the original html so the page retains all its links and scripts - i only need to be able to extract the text so that I can perform searches and replacements within it, without fear of "renaming" any tags, attributes or script variables etc (so I can't just do a "replace with nothing" on all the matches I get, because even though I am then left with what I need, it's a hassle to reinsert that back into the correct places of the fully functional document).
I want to know if this is at all possible using regex (and I know about HTML Agility Pack and XPath, but don't feel like).
Any suggestions?
Update:
Here is the (regex-based) solution I ended up with: http://www.martinwardener.com/regex/, implemented in a demo web application that will show both the active regex strings along with a test engine which lets you run the parsing on any online html page, giving you parse times and extracted results (for link, url and text portions individually - as well as views where all the regex matches are highlighted in place in the complete HTML document).
what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group.
That's what one would normally do. Or even simpler, replace every match of the markup pattern with and empty string and what you've got left is the stuff you're looking for.
It kind of works, but there seems to be a string here and there that gets picked up that shouldn't be.
Well yeah, that's because your expression—and regex in general—is inadequate to parse even valid HTML, let alone the horrors that are out there on the real web. First tip to look at, if you really want to chase this futile approach: attribute values (as well as text content in general) may contain an unescaped > character.
I would like to once again suggest the benefits of HTML Agility Pack.
ETA: since you seem to want it, here's some examples of markup that looks like it'll trip up your expression.
<a href=link></a> - unquoted
<a href= link></a> - unquoted, space at front matched but then required at back
- very common URL char missing in group
- more URL chars missing in group
<a href=lïnk></a> - IRI
<a href
="link"> - newline (or tab)
<div style="background-image: url(link);"> - unquoted
<div style="background-image: url( 'link' );"> - spaced
<div style="background-image: url('link');"> - html escape
<div style="background-image: ur\l('link');"> - css escape
<div style="background-image: url('link\')link');"> - css escape
<div style="background-image: url(\
'link')"> - CSS folding
<div style="background-image: url
('link')"> - newline (or tab)
and that's just completely valid markup that won't match the right link, not any of the possible invalid markup, markup that shouldn't but does match a link, or any of the many problems with your other technique of splitting markup from text. This is the tip of the iceberg.
Regex is not reliable for retrieving textual contents of HTML documents. Regex cannot handle nested tags. Supposing a document doesn't contain any nested tag, regex still requires every tags are properly closed.
If you are using PHP, for simplicity, I strongly recommend you to use DOM (Document Object Model) to parse/extract HTML documents. DOM library usually exists in every programming language.
If you're looking to extract parts of a string not matched by a regex, you could simply replace the parts that are matched with an empty string for the same effect.
Note that the only reason this might work is because the tags you're interested in removing, <script> and <style> tags, cannot be nested.
However, it's not uncommon for one <script> tag to contain code to programmatically append another <script> tag, in which case your regex will fail. It will also fail in the case where any tag isn't properly closed.
You cannot parse HTML with regular expressions.
Parsing HTML with regular expressions leads to sadness.
I know you're just doing it for fun, but there are so many packages out there than actually do the parsing the right way, AND do it reliably, AND have been tested.
Don't go reinventing the wheel, and doing it a way that is all but guaranteed to frustrate you down the road.
OK, so here's how I'm doing it:
Using my original regex (with the added search pattern for the plain text, which happens to be any text that's left over after the tag searches are done):
(?:(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?P<text>[^<>]*)
Then in VB.Net:
Dim regexText As New Regex("(?:(?:<(?<tag>script|style)[\s\S]*?</\k<tag>>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?<text>[^<>]*)", RegexOptions.IgnoreCase)
Dim source As String = File.ReadAllText("html.txt")
Dim evaluator As New MatchEvaluator(AddressOf MatchEvalFunction)
Dim newHtml As String = regexText.Replace(source, evaluator)
The actual replacing of text happens here:
Private Function MatchEvalFunction(ByVal match As Match) As String
Dim plainText As String = match.Groups("text").Value
If plainText IsNot Nothing AndAlso plainText <> "" Then
MatchEvalFunction = match.Value.Replace(plainText, plainText.Replace("Original word", "Replacement word"))
Else
MatchEvalFunction = match.Value
End If
End Function
Voila. newHtml now contains an exact copy of the original, except every occurrence of "Original word" in the page (as it's presented in a browser) is switched with "Replacement word", and all html and script code is preserved untouched. Of course, one could / would put in a more elaborate replacement routine, but this shows the basic principle. This is 12 lines of code, including function declaration and loading of html code etc. I'd be very interested in seeing a parallel solution, done in DOM etc for comparison (yes, I know this approach can be thrown off balance by certain occurrences of some nested tags quirks - in SCRIPT rewriting - but the damage from that will still be very limited, if any (see some of the comments above), and in general this will do the job pretty darn well).
For Your Information,
Instead of Regex, With JQuery , Its possible to extract text alone from a html markup. For that you can use the following pattern.
$("<div/>").html("#elementId").text()
You can refer this JSFIDDLE