MusicXML units (position of elements in pixels) - units-of-measurement

I use audivers application to convert PDFs and Images to MusicXML.
It give me some result. An for example this element after OMR:
<credit-words font-family="serif" font-size="23" default-x="407" default-y="1489">
Polonaise in F major
</credit-words>
contain attribute default-x and default-y. Problem is that it is not in pixels. What unit it is and how I can convert it on pixels?

Identifying exactly where on the page a musical element occurs can be extremely difficult in musicxml. The layout.py module of my music21 python toolkit (shameless plug) can do it up to the measure level -- getting the note/credit level will not be too hard after that. The code is LGPL so you could use that to hack together a parser in another language.
See http://web.mit.edu/music21/doc/moduleReference/moduleLayout.html#music21.layout.divideByPages

Most MusicXML graphical units, including default-x and default-y, are in tenths of a staff space. There's more documentation in the MusicXML DTDs and XSD, for instance at http://www.musicxml.com/for-developers/musicxml-dtd/common-elements/.

Related

Any conventional standards for storing OCR data/metadata in JPEG images?

I want to organize a collection of scanned documents (receipts, bank statements, etc.) by adding their metadata and text content (OCR'ed) into the same jpeg files. Is there any more or less commonly accepted way of storing such data? Any commonly used schemas?
For metadata, for example - I found a Dublin Core scheme, but most of the fields I want are not there, and I'm not sure what's the good way to add custom fields - can I just use them like if they existed in DC or XMP scheme (i.e. <dc:myfield>myvalue</dc:myfield> or <xmp:myfield>myvalue</xmp:myfield>), or I have to define my own scheme by adding xmlns:myScheme="http://myScheme.uri" and then use it as <myScheme:myfield>myvalue</myScheme:myfield> ?
Also, in all the examples I found, this data is stored inside <rdf:Description> which is inside <rdf:RDF> which is inside <x:xmpmeta> - is it a standard requirement? I don't see it in the XMP specification for storage in files...
For now, based on the examples, I plan to embed something like this:
<?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?>
<x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='MyTool v 0.0.1'>
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
<rdf:Description rdf:about=''
xmlns:dc='http://purl.org/dc/elements/1.1/'
xmlns:myDoc='http://some.custom.uri/'>
<dc:format>image/jpeg</dc:format>
<myDoc:doctype>scan</myDoc:doctype>
<myDoc:originalfilename>20190519121225_003.jpg</myDoc:originalfilename>
<myDoc:originalimagewidth>1684</myDoc:originalimagewidth>
<myDoc:originalimageheight>2788</myDoc:originalimageheight>
<myDoc:langOCR>EN-US</myDoc:langOCR>
<myDoc:acquisitiondatetime>2019-05-19T12:12:25Z</myDoc:acquisitiondatetime>
<myDoc:documentdate>2019-01-02</myDoc:documentdate>
<myDoc:pagesindocument>6</myDoc:pagesindocument>
<myDoc:page>2</myDoc:page>
<myDoc:textcontent>
Bank
statement
02/01/2019
Page 2 of 6
( Here goes raw OCR content
as multiline text )
</myDoc:textcontent>
<dc:subject>
<rdf:Bag>
<rdf:li>bank</rdf:li>
<rdf:li>statement</rdf:li>
</rdf:Bag>
</dc:subject>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end='w'?>
Does it make sense at all? I'm sure many people already worked on similar tasks, I don't want to reinvent the wheel...

What's the difference between `<seg>` and `<span>`

What's the difference between a <seg> in XML and <span> in HTML? Here are two passages from Bibles, one from the English Bible in Christodouloupoulos' and Steedman's massively parallel Bible corpus,
<?xml version="1.0" ?>
<cesDoc version="4">
…
<text>
<body id="Bible" lang="en">
<div id="b.GEN" type="book">
<div id="b.GEN.1" type="chapter">
<seg id="b.GEN.1.1" type="verse">
In the beginning God created the heaven and the earth.
</seg>
<seg id="b.GEN.1.2" type="verse">
And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters.
</seg>
…
and the other from the NIV English Bible at Bible Gateway, which is where they got most of their texts from:
<p class="chapter-1">
<span id="en-NIV-27932" class="text Rom-1-1">
<span class="chapternum">1 </span>
Paul, a servant of Christ Jesus, called to be an apostle and set apart for the gospel of God—
</span>
<span id="en-NIV-27933" class="text Rom-1-2">
<sup class="versenum">2 </sup>the gospel he promised beforehand through his prophets in the Holy Scriptures
</span>
…
In the HTML, a it seems a <span> can replace a <seg>, except that the HTML has added verse numbers in <span>. Oh, and the chapters are in <div>. So it's not one-to-one.
Of course, I realize that HTML and XML are different, and this is only one juxtaposition; I'm sure there are others out there. But I'm going to need to be able to display XML as HTML, and I don't want to anger the doctype gods. So, conceptually, how is <seg> different from <span> in purpose, meaning and usage?
Update: #jim-garrison, says I'm going to need to read the schema to understand the XML, but I'm a neophyte at that, too. In particular, I did find some official-looking documentation for <seg> by TEI that makes me think it's use is a little more than arbitrary, but I have no idea how to interpret this documentation. Should it give us a more specific answer than what Jim has already written?
The difference between XML and HTML generally is that the list of tags that can be present in XML is defined by a DTD or XML Schema, and tags represent document semantics and not presentation. So tags can be named anything. In HTML the set of tags is generally predefined, as if there was a pre-existing HTML DTD or schema, but HTML is not XML and doesn't follow all the rules of XML. While HTML was in some sense derived from the same parent as XML (SGML), and the two are superficially very similar, they are most definitely NOT the same thing.
The answer to your specific question is that the writers of the XML chose to use a tag named <seg> ("segment"?) to represent generalized strings of text, with attributes providing additional semantic information. For more details you'll need to find the DTD or XML schema that governs the content of the XML and read the documentation that goes with it.
But I'm going to need to be able to display XML as HTML, and I don't want to anger the doctype gods. So, conceptually, how does different from in purpose, meaning and usage?
This is where you will use XSLT to transform the input XML into valid HTML. To figure out how to do that transformation you will need to know the full semantics of all the tags that can appear (again, go to the documentation for the DTD/Schema) and decide on a visual representation for the data. There's no one answer to "how should a <seg>" be transformed. That's up to your requirements regarding presentation. One possible transformation converts <seg> tags to <span>, but that may depend on the value of certain attributes (type="verse" vs some other type). It might even differ depending on output medium (desktop vs tablet vs phone vs watch vs ...?)
Once you convert from XML to HTML you have left the realm of the Doctype gods and they have no interest in what you do :-) There's a whole different set of deities such as CSS-Cthulhu, Javascript-Janai'ngo (look it up), et al who will take great pleasure making your life miserable.

How to translate an HTML structure to XML?

Suppose I have HTML structured like this:
<div class="veggie">carrot</div>
<div class="veggie">cucumber</div>
<div class="fruit">
<div class="citrus">orange</div>
<div class="citrus">lemon</div>
<div class="berry">grape</div>
</div>
<div class="veggie">lettuce</div>
<div class="dairy">milk</div>
But it's all on a single line like this:
<div class="vegetable">carrot</div><div class="vegetable">cucumber</div><div class="fruit"><div class="citrus">orange</div><div class="citrus">lemon</div><div class="berry">grape</div></div><div class="vegetable">lettuce</div><div class="dairy">milk</div>
How can I translate it to XML like this:
<veggie>carrot</veggie>
<veggie>cucumber</veggie>
<fruit>
<citrus>orange</citrus>
<citrus>lemon</citrus>
<berry>grape</berry>
</fruit>
<veggie>lettuce</veggie>
<dairy>milk</dairy>
It sounds straightforward, but I have no clue where to start!
Doing this with regexes will be ugly and likely unreliable. First, regexes don't handle languages with nested structures, which HTML has. Secondly, HTML is not a clean language; it is full of errors that the browser builders in thier wisdom decided to accept, ensuring further sloppy programming by HTML writers.
A clean way to do this is to parse the HTML (just like a compiler, using a "dirty" HTML-capable parser) and build an abstract syntax tree. (You might get away with using a browser DOM). Then you apply transformations to the HTML AST to incrementally convert it into XML fragments. You can do the latter by writing ad hoc procedural code to do a recursive tree walk, check for special cases, and spit out XML. The procedureal is likely to look pretty ugly, because it is climbing up and down tree nodes, testing this, spitting that, all over the place, and the more specical cases you have, the messier this gets.
A nice way to do this is with a program transformation system (PTS). A good PTS will let you define parsers and prettyprinters for (dirty) HTML and XML; you can then parse the HTML and the PTS will will an AST as suggested in the previous paragraph. The value in the PTS is that you can usually define transformation rules using the "surface syntax" of the source and target langauges, e.g., you can say "if you see this HTML pattern, then replace it by that XML pattern. Here's a few examples:
rule replace div_class(a: attribute, t: text_content): HTMLnode -> XMLNode
= " <div class=\a>\t<div> "
-> "< \a > \t <\a /> ";
This rule matches the HTML AST (not text) for a div with only text content,
and maps it to an XML tag matching the class attribute, with the same text content, matching part of what OP wants. The double-quotes are meta quotes, to distinguish rewrite-rule syntax from source or target language syntax. The match-to part is written in HTML syntax with metavariable escapes \a and \t corresponding the values found that satisy the match. Note that this rule can only match HTML tags that contain only text as their body, because of the constraint on t. The replacement part generates the desired XML tag and content, substituting the values of the matched metavariables.
For the more complicated part of OP's example, where the HTML content is not just a text we need this rule:
rule replace div_class(a: attribute, c: content): HTMLnode -> XMLNode
= " <div class=\a>\c<div> "
-> "< \a > \c <\a /> ";
if ~ match(c,text_content);
The \c will match anything, so that's too general, but the extra "if" constraint checks that \c is NOT text_content. This rule will run where the previous rule will not, and vice-versa.
I think that covers all the OP's example, for the basics.
Without any other constraints, both rules will run whereever they can on the AST, and order of application for these rules don't matter. Conceptually, each rule converts "yellow" HTML nodes to "blue" XML nodes; collectively, the rules convert all the yellow patches to blue patches.
OP probably needs additional rules to translate the other parts of the HTML document to XML in whatever way he desires; HTML being a fairly big languages, he may have to write a bunch of rules to fill this out properly. The point is that he can write the rules largely in this same surface syntax style [as a practical matter, you often have to add some procedural code to the rules to make it all glue together right, just a lot less than the pure ad hoc way]. (Writing this as ad hoc code won't save any effort; OP will still have to handle all the HTML tag types).
Different PTS's express rules differently. I am using the rewrite rule syntax from my own PTS [DMS Software Reengineering Toolkit]. Ideally, the PTS already has available definitions of HTML and XML; DMS does.

Long division symbol (HTML and/or CSS)

I'm looking for a way to display the traditional long division symbol using HTML/CSS (kinda like what's shown here: http://barronstestprep.com/blog/wp-content/uploads/2013/04/longdiv1.png).
This (http://www.fileformat.info/info/unicode/char/27cc/index.htm) is basically what I need, but I don't think many people would have the proper font installed on their computer to see it (I don't, at least).
I've also tried this (below), but it doesn't display consistently on Chrome and FF...
4<span style="text-decoration: overline;"><span style="font-size: 14px">)</span>84</span>
This should be displaying 84 ÷ 4 with the long division box.
Ideas?
<span style="border-right: 1px black solid; border-radius: 0px 0px 10px 0px">
4
</span>
<span style="border-top: 1px black solid; ">
84
</span>
Demo
The concept and notation of “long division” is traditional, in some traditions, of teaching arithmetic at school, and it is used in contexts where the steps of integer division are explained graphically. There is no reliable way to do this in HTML and CSS except by using images, either large images containing an entire long division as a process or piecewise, e.g. one piece containing just a number, the long division operator, and another number (as in the jpg referred to in the question). This is how e.g. http://www.mathsisfun.com/long_division.html does this. The page http://en.wikipedia.org/wiki/Long_division uses preformatted text, construction symbols from Ascii characters like “)” and “_”, but the result is primitive-looking and is not robust (e.g., turns to gibberish in a screen reader).
When using an image, you should write an alt text that expresses the idea verbally. This somewhat depends on context, but I’m afraid it would need to be longish, like alt="long division with divisor 4, dividend 84".
Using just HTML and CSS to construct long divisions is rather hopeless, since HTML and CSS are rather powerless with anything involving essential two-dimensionality in math notations (i.e., mathematical expressions that are not simple linear sequences of characters). Even constructing a square root expression, with a vinculum extending over the radicand, requires trickery that easily fails, more or less, and showing such an expression is similar to, but essentially simpler than a long division expression.
The character U+27CC LONG DIVISION would theoretically let you write a long division expression, even in plain text, since it is defined in the Unicode standard so that it “graphically extends over the dividend”. This is however largely theoretical, for several reasons. In addition to limited font coverage (which could be dealt with using a downloadable font with #font-face), the approach suffers from lack of software support. The idea “graphically extends over the dividend” is not easily implemented. While browsers may (when using a suitable font) render 84⟌4 properly, they fail with 84⟌42 (the symbol extends over the “4” after it but not over the “2”). The reason seems to be that in fonts that contain U+27CC, it might be implemented with advance rules that imply that operator seems to extend over the next digit, but to make it extend over the next number (digit sequence), software support above the simple font level would be needed.
In HTML5, you can directly use MathML. MathML 3 supports the <mlongdiv> element:
<figure>
<math>
<mlongdiv>
<mn>4</mn>
<mn></mn>
<mn>84</mn>
</mlongdiv>
</math>
<figcaption>
This will display as a long division in browsers that support MathML 3.
</figcaption>
</figure>
For MathML 2 you can use a Javascript solution based on LaTeX, such as MathJax. Here is a long division example which uses the MathJax TeX parser to parse input of the form \longdiv{84}{4}.

How to generate hash from ~200k text/html that would match/compare to similar text?

I would like to make a sort of hash key out of a text (in my case html) that would match/compare to the hash of other similar text
ex of matching texts:
"2012/10/01 This is my webpage #1"+ 100k_of_same_text + random_words_1 + ..
"2012/10/02 This is my webpage #2"+ 100k_of_same_text + random_words_2 + ..
...
"2012/10/02 This is my webpage #2"+ 100k_of_same_text + random_words_3 + ..
So far I've thought of removing numbers and tags but that wold still leave the random words.
Is there anything out there that dose this?
I have root access to the server so I can add any UDF that is necesare and if needed I can do the processing in c or other languages.
The ideal would be a function like generateSimilarHash(text) and an other function compareSimilarHashes(hash1,hash2) that would return the procent of matching text.
Any function like compare(text1,text2) would not work as in my case as I have many pages to compare (~20 mil at the moment)
Any advice is welcomed!
UPDATE:
I'm refering to ahash function as it is described on wikipedia:
A hash function is any algorithm or subroutine that maps large data
sets of variable length to smaller data sets of a fixed length.
the fixed length part is not necessary in my case.
It sounds like you need to utilize a program like diff.
If you are just trying to compare text a hash is not the way to go because slight differences in input cause total and complete differnces in output. (Thus the reason why they are used to encode passwords, and secure text). Character difference programs are pretty complicated, unless you really are interested in how they work and are trying to write your own I would just use a solution like the one that is shown here using sdiff to get a percentage.
Percentage value with GNU Diff
You could use some sort of Levenshtein distance algoritm. this works for small pieces of text, but I'm rather sure that something similar can be applied to large chunks of text.
Ref: http://en.m.wikibooks.org/wiki/Algorithm_implementation/Strings/Levenshtein_distance
I've found out that tag order in webpages can create a very distinctive pattern, that remains the same even if portions of text / css / script change. So I've made a string generated by the tag order (ex: html head meta title body div table tr td span bold... => "hhmtbdttsb...") and then I just do exact matches between these strings. I can even apply the Levenshtein distance algorithm and get accurate results.
If I didn't have html, I would have used the punctuation/end-lines for splitting, or something similar.