Azure Translator Text API: What is the definition of a character? - microsoft-translator

The pricing of the Translator Text API belonging to the Azure Cognitive Services family is based on characters.
But what is the definition of a character?
Some examples:
Do spaces, punctuation and line breaks count as a character?
This is , a
test.
When translating HTML does every character count here including angle brackets, tags, slashes etc.?
<p>This is<br>
a
test.</p>
For the sake of completeness: I suppose only the text that is being sent to the API for translation counts (request characters) and not what comes back (response), right?

This is answered here character counts. All of the above examples count as text. Responses do not count.
Copying from there:
What counts is:
Text passed to the Translator Text API in the body of the request
Text when using the Translate, Transliterate, and Dictionary Lookup methods
Text and Translation when using the Dictionary Examples method
All markup: HTML, XML tags, etc. within the text field of the request body. JSON notation used to build the request (for instance "Text:") is not counted.
An individual letter
Punctuation
A space, tab, markup, and any kind of white space character
Every code point defined in Unicode
A repeated translation, even if you have translated the same text previously

Related

Entering ascii into html text box

I'm doing a cybersecurity capture the flag challenge and attempting to do buffer overflow on a server. it has an html text box that I'm trying to overflow with particular values. How can I enter ascii characters into this text box? The characters entered after a certain buffer length seem to be converted into their ascii values, so I'm trying to enter characters like NUL, EOT, etc. into the text box.
You can use the hex value, see this or this
For example from python you could use something like:
param = "\x00\x04\x03\x03"
And then send as GET request (see urllib2 or requests or httplib2)
From URL, before of the hex code you must add the % character
yourpage.html/param=%00%04%03%04
Look also this link

{{=XML(some thing)}} can't be parsed in html with web2py

I am using magicsuggest as a auto-complete plugin of a web application with web2py. I define a list variable dt=['张','李'] in the model/db.py. The element in the list is Chinese. However when I embeded the variable in the html like{{=XML(dt)}} according to the manual book of magicsuggest. The chinese character was garbled. After several days searching, I find the list variable with chinese character was encode into hex in the html. I know there is something wrong about encode/decode. Could someone help me to display the correct chinese character in the html?
XML() is meant to take a string, not a list of strings. If you pass it something other than a string, it will first be converted to a string, so your code is equivalent to {{=XML(str(dt))}}, and you'll notice that in Python, str(['张','李']) yields "['\\xe5\\xbc\\xa0', '\\xe6\\x9d\\x8e']".
Instead, you can do {{=XML(dt[0])}}, and you will see the first character in the list displayed properly.
If you want to display a comma separated list surrounded by brackets, you can do:
{{=json.dumps(dt, encoding="UTF-8", ensure_ascii=False)}}

How to extract pdf text from BT and ET section line by line and how to decode Tj block to unicode in pdfbox

As the title, I have the special requirement to extract text line by line or block by block of BT and ET.
below is the pdf content, I tried PDFTextstripper class, but it is not what I want,
so any one has the solution to resolve the problem?
I wanna parse this : [ (=\324Z\016) ] TJ.......
this is my pdf: https://dl.dropboxusercontent.com/u/63353043/docu.pdf
below is my code:
enter coList<Object> tokens = pages.get(0).getContents().getStream().getStreamTokens();
tokens.forEach(s->{
if(s instanceof COSString){
System.out.print(s.toString());
}
but, I get thoes:
COSString{&Ð}COSString{O6}COSString{&³}COSString{p»}COSString{6±}COSString{˛¨}COSString{+^}COSString{+·}COSString{˚©}COSString{9}COSString{O©}COSString{en}COSString{˛¨}COSString{Fœ}COSString{0ł}COSString{Q¯}COSString{#”}COSString{˛¨}COSString{+^}COSString{(Ï}COSString{˚©}COSString{9}COSString{O©}COSString{en}COSString{Zo}COSString{#°}COSString{˜}COSString{p»}COSString{#Š}COSString{5×}COSString{,
}COSString{:É}COSString{(Ù}COSString{4ÿ}COSString{ä}COSString{_Á}COSString{˛¨}COSString{:É}COSString{p»}COSString{O©}COSString{en}COSString{#p}COSString{/F}COSString{O©}COSString{en}COSString{F,}COSString{_N}COSString{!}COSString{9»}COSString{]˘}COSString{!¢}COSString{˜.}COSString{p»}COSString{#°}COSString{˜}COSString{#p}COSString{<:}COSString{Zo}COSString{1¸}COSString{ä}COSString{˚~}COSString{F³}COSString{!Ø}COSString{]Š}COSString{2}COSString{6±}COSString{˛¨}COSString{gî}COSString{+·}COSString{9á}COSString{XS}COSString{hP}COSString{h[}COSString{˜º}COSString{˚.}COSString{p»}COSString{5d}COSString{5×}COSString{]˘}COSString{_ö}COSString{#c}COSString{2˚}COSString{]˜}COSString{+·}COSString{9á}COSString{p»}COSString{#b}COSString{˚.}COSString{eÚ}COSString{;
}COSString{!5}COSString{:É}COSString{XS}COSString{hP}COSString{h[}COSString{˜º}COSString{p»}COSString{B!}COSString{&Ø}COSString{,}COSString{/F}COSString{^r}COSString{˛²}COSString{2&}COSString{˜.}COSString{N}COSString{*ø}COSString{˜.}COSString{&¢}COSString{+B}COSString{+·}COSString{9á}COSString{ä}COSString{ZX}COSString{˚}COSString{˛µ}COSString{6«}COSString{0ł}COSString{!Ø}COSString{p»}COSString{O©}COSString{en}COSString{F,}COSString{Q±}COSString{"}}COSString{XS}COSString{&Ð}COSString{_N}COSString{!}COSString{9»}COSString{˛²}COSString{F,}COSString{(Ï}COSString{O©}COSString{en}COSString{F³}COSString{/?}COSString{˛¨}COSString{=­}COSString{˚4}COSString{9}COSString{p»}COSString{;}COSString{5×}COSString{_Á}COSString{˜³}COSString{#É}COSString{0·}COSString{F,}COSString{Q±}COSString{"}}COSString{p»}COSString{1û}COSString{"}}COSString{˚.}COSString{O©}COSString{en}COSString{p»}COSString{#°}COSString{˜}COSString{Lê}COSString{5d}COSString{Zo}COSString{1¸}COSString{"G}COSString{˚.}COSString{ä}COSString{&Ð}COSString{Kå}COSString{Yª}COSString{#°}COSString{#´}COSString{5ê}COSString{p»}COSString{(Ï}COSString{O©}COSString{en}COSString{ZR}COSString{pÉ}COSString{p„}COSString{˜}COSString{G“}COSString{_û}COSString{%v}COSString{pÎ}COSString{1û}COSString{"}}COSString{1¹}COSString{F,}COSString{˛µ}COSString{5×}COSString{˜}COSString{2˚}COSString{]˜}COSString{+·}COSString{9á}COSString{p»}COSString{O´}COSString{5×}COSString{#b}COSString{˚.}COSString{+·}COSString{9á}COSString{p»}COSString{˜}COSString{]y}COSString{/0}COSString{}COSString{2§}COSString{˚.}COSString{˛¨}COSString{8E}COSString{N}COSString{*ø}COSString{22}COSString{++}COSString{&¢}COSString{+B}COSString{)%}COSString{ä}COSString{&Ð}COSString{!Í}COSString{˚b}COSString{f¨}COSString{Y)}COSString{.}COSString{"Q}COSString{5ê}COSString{p»}COSString{)*}COSString{7D}COSString{˛¨}COSString{˜³}COSString{˚b}COSString{P¥}COSString{&Ð}COSString{!Í}COSString{˚b}COSString{˛µ}COSString{G“}COSString{0m}COSString{F,}COSString{0m}COSString{<i}COSString{˛³}COSString{p»}COSString{;fi}COSString{˛µ}COSString{BÞ}COSString{\}COSString{F,}COSString{BO}COSString{Bˆ}COSString{Q™}COSString{-Ž}COSString{F,}COSString{!Ñ}COSString{Fr}COSString{p»}COSString{$™}COSString{/½}COSString{BO}COSString{Bˆ}COSString{F,}COSString{#™}COSString{5×}COSString{˛¨}COSString{nƒ}COSString{nƒ}COSString{p»}COSString{˚}COSString{5×}COSString{f‰}COSString{P¥}COSString{#Š}COSString{\\}COSString{F,}COSString{p°}COSString{1¹}COSString{<:}COSString{6±}COSString{C®}COSString{DÙ}COSString{˛µ}COSString{Q¯}COSString{_Á}COSString{9Ë}COSString{F,}COSString{˚b}COSString{#°}COSString{˜}COSString{p»}COSString{_Á}COSString{9Ë}COSString{F,}COSString{˚b}COSString{˚}COSString{<:}COSString{6±}COSString{C®}COSString{DÙ}COSString{˛µ}COSString{Cˆ}COSString{/?}COSString{1¸}COSString{"G}COSString{p°}COSString{p“}COSString{/4}COSString{˜.}COSString{p»}COSString{+·}COSString{O©}COSString{en}COSString{F,}COSString{˚3}COSString{9}COSString{7D}COSString{#Þ}COSString{DÙ}COSString{;}COSString{T}COSString{T`}COSString{5“}COSString{˛²}COSString{p»}COSString{]2}COSString{ }COSString{]2}COSString{(Ï}COSString{p°}COSString{:}COSString{6«}COSString{Må}COSString{&Ð}COSString{˛µ}COSString{M;}COSString{0·}COSString{e;}COSString{O«}COSString{aw}COSString{˚b}COSString{F,}COSString{FÇ}COSString{ZH}
(If this wasn't so long, it would have been more appropriate as a comment to the question instead of as an answer. But comments are too limited.)
The OP in his question shows that he essentially wants to parse the content stream of e.g. a page and extract the strings drawn in a legible form. He attempts this by simply taking the tokens in the content stream and looking at the COSString instances in there:
List<Object> tokens = pages.get(0).getContents().getStream().getStreamTokens();
tokens.forEach(s->{
if (s instanceof COSString) {
System.out.print(s.toString());
}
});
Unfortunately the output looks like a mess.
Why do the string values in the content stream look so messy?
The reason for this is that those COSString instances represent the PDF string objects as they are, and that there is no single encoding of PDF string objects in content streams, not even a limitation to a few standardized ones.
The encoding of a string completely depends on the definition of the font currently active when the string drawing instruction in question is executed.
Fonts in PDFs can be defined to use either some standard encoding or a custom one and it is very common, in particular in case of embedded font subsets, to use custom encodings mapping constructed similar to this:
the code 1 to the glyph of this font which is first used on the page,
the code 2 on the second glyph of this font on the page which is not identical to the first,
the code three to the third glyph of this font on the page not identical to either of the first two,
etc...
Obviously there is no good to conjecture the meaning of string bytes for such encodings.
Thus, when parsing the content stream, you have to keep track of the current font and look up the meaning of each byte (or multi byte sequence!) of a COSString in the definition of that current font in the resource dictionary of the current page.
How to map those messy bytes using the current font definition?
The encoding of a PDF font might have to be determined in different ways.
There may be a ToUnicode map in the font definition which shows you which bytes to map to which character.
Otherwise the encoding may be a standard encoding like MacRomanEncoding or WinAnsiEncoding in which case one has to bring along the mapping table oneself (they are printed in the PDF specification).
Otherwise the encoding might be based on such a standard encoding but deviations are given by a mapping from codes to names of glyphs. If certain standard names are used, the character can be derived from that name. These names are listed in another document.
Otherwise some CIDSystemInfo entry may point to yet another standard Registry and Ordering from which to derive a mapping table specified in other documents.
Otherwise the font program itself may include usable mappings to Unicode.
Otherwise ???
Any pitfalls to evade?
The current font is a PDF graphics state attribute. Thus, one does not only have to remember the most recently set font but also consider the effects of operations changing the whole graphics state, in particular the save-graphics-state and restore-graphics-state operations which push the current graphics state onto a stack or pop it there-from.
How can PDF libraries help you?
PDF libraries which support you in text extraction can do so by doing all the heavy lifting for you,
parsing the content stream,
keeping track of the graphics state,
determining the encoding of the current font,
translating any drawn PDF strings using that encoding,
and only forward the resulting characters and some extra data (current position on the page, text drawing orientation, font and font size, colors, and other effects) to you.
In case of PDFBox 2.0.x, this is what the PDFTextStreamEngine class does for you: you only have to override its processTextPosition method to which PDFBox forwards those enriched character information in the TextPosition parameter. As you also want to know the starts and ends of the BT ... ET text object envelopes, you also have to override beginText and endText.
The class PDFTextStripper is based on that class and collects and sorts those character information bits to build a string containing the page text which it eventually returns.
In PDFBox 1.8.x there was a very similar PDFTextStripper class but the base methods were not that properly separated in a base class, everything was somewhat more intermingled and it was harder to implement one's own extraction ways.
(In other PDF libraries there are similar constructs, sometimes event based like in PDFBox, sometimes as a collected sequence of Textposition-like objects.)

Why is "&reg" being rendered as "®" without the bounding semicolon

I've been running into a problem that was revealed through our Google adwords-driven marketing campaign. One of the standard parameters used is "region". When a user searches and clicks on a sponsored link, Google generates a long URL to track the click and sends a bunch of stuff along in the referrer. We capture this for our records, and we've noticed that the "Region" parameter is coming through incorrectly. What should be
http://ravercats.com/meow?foo=bar&region=catnip
is instead coming through as:
http://ravercats.com/meow?foo=bar®ion=catnip
I've verified that this occurs in all browsers. It's my understanding that HTML entity syntax is defined as follows:
&VALUE;
where the leading boundary is the ampersand and the closing boundary is the semicolon. Seems straightforward enough. The problem is that this isn't being respected for the ® entity, and it's wreaking all kinds of havoc throughout our system.
Does anyone know why this is occurring? Is it a bug in the DTD? (I'm looking for the current HTML DTD to see if I can make sense of it) I'm trying to figure out what would be common across browsers to make this happen, thus my looking for the DTD.
Here is a proof you can use. Take this code, make an HTML file out of it and render it in a browser:
<html>
http://foo.com/bar?foo=bar&region=US&register=lowpass&reg_test=fail&trademark=correct
</html>
EDIT: To everyone who's suggesting that I need to escape the entire URL, the example URLs above are exactly that, examples. The real URL is coming directly from Google and I have no control over how it is constructed. These suggestions, while valid, don't answer the question: "Why is this happening".
Although valid character references always have a semicolon at the end, some invalid named character references without a semicolon are, for backward compatibility reasons, recognized by modern browsers' HTML parsers.
Either you know what that entire list is, or you follow the HTML5 rules for when & is valid without being escaped (e.g. when followed by a space) or otherwise always escape & as & whenever in doubt.
For reference, the full list of named character references that are recognized without a semicolon is:
AElig, AMP, Aacute, Acirc, Agrave, Aring, Atilde, Auml, COPY, Ccedil,
ETH, Eacute, Ecirc, Egrave, Euml, GT, Iacute, Icirc, Igrave, Iuml, LT,
Ntilde, Oacute, Ocirc, Ograve, Oslash, Otilde, Ouml, QUOT, REG, THORN,
Uacute, Ucirc, Ugrave, Uuml, Yacute, aacute, acirc, acute, aelig,
agrave, amp, aring, atilde, auml, brvbar, ccedil, cedil, cent, copy,
curren, deg, divide, eacute, ecirc, egrave, eth, euml, frac12, frac14,
frac34, gt, iacute, icirc, iexcl, igrave, iquest, iuml, laquo, lt,
macr, micro, middot, nbsp, not, ntilde, oacute, ocirc, ograve, ordf,
ordm, oslash, otilde, ouml, para, plusmn, pound, quot, raquo, reg,
sect, shy, sup1, sup2, sup3, szlig, thorn, times, uacute, ucirc,
ugrave, uml, uuml, yacute, yen, yuml
However, it should be noted that only when in an attribute value, named character references in the above list are not processed as such by conforming HTML5 parsers if the next character is a = or a alphanumeric ASCII character.
For the full list of named character references with or without ending semicolons, see here.
This is a very messy business and depends on context (text content vs. attribute value).
Formally, by HTML specs up to and including HTML 4.01, an entity reference may appear without trailing semicolon, if the next character is not a name character. So e.g. &region= would be syntactically correct but undefined, as entity region has not been defined. XHTML makes the trailing semicolon required.
Browsers have traditionally played by other rules, though. Due to the common syntax of query URLs, they parse e.g. href="http://ravercats.com/meow?foo=bar&region=catnip" so that &region is not treated as an entity reference but as just text data. And authors mostly used such constructs, even though they are formally incorrect.
Contrary to what the question seems to be saying, href="http://ravercats.com/meow?foo=bar&region=catnip" actually works well. Problems arise when the string is not in an attribute value but inside text content, which is rather uncommon: we don’t normally write URLs in text. In text, &region= gets processed so that &reg is recognized as an entity reference (for “®”) and the rest is just character data. Such odd behavior is being made official in HTML5 CR, where clause 8.2.4.69 Tokenizing character references describes the “double standard”:
If the character reference is being consumed as part of an attribute,
and the last character matched is not a ";" (U+003B) character, and
the next character is either a "=" (U+003D) character or in the range
ASCII digits, uppercase ASCII letters, or lowercase ASCII letters,
then, for historical reasons, all the characters that were matched
after the U+0026 AMPERSAND character (&) must be unconsumed, and
nothing is returned.
Thus, in an attribute value, even &reg= would not be treated as containing a character reference, and still less &region=. (But reg_test= is a different case, due to the underscore character.)
In text content, other rules apply. The construct &region= causes then a parse error (by HTML5 CR rules), but with well-defined error handling: &reg is recognized as a character reference.
Maybe try replacing your & as &? Ampersands are characters that must be escaped in HTML as well, because they are reserved to be used as parts of entities.
1: The following markup is invalid in the first place (use the W3C Markup Validation Service to verify):
In the above example, the & character should be encoded as &, like so:
2: Browsers are tolerant; they try to make sense out of broken HTML. In your case, all possibly valid HTML entities are converted to HTML entities.
Here is a simple solution and it may not work in all instances.
So from this:
http://ravercats.com/meow?status=Online&region=Atlantis
To This:
http://ravercats.com/meow?region=Atlantis&status=Online
Because the &reg as we know triggers the special character ®
Caveat: If you have no control over the order of your URL query string parameters then you'll have to change your variable name to something else.
Escape your output!
Simply enough, you need to encode the url format into html format for accurate representation (ideally you would do so with a template engine variable escaping function, but barring that, with htmlspecialchars($url) or htmlentities($url) in php).
See your test case and then the correctly encoded html at this jsfiddle:
http://jsfiddle.net/tchalvakspam/Fp3W6/
Inactive code here:
<div>
Unescaped:
<br>
http://foo.com/bar?foo=bar&region=US&register=lowpass&reg_test=fail&trademark=correct
</div>
<div>
Correctly escaped:
<br>
http://foo.com/bar?foo=bar&region=US&register=lowpass&reg_test=fail&trademark=correct
</div>
It seems to me that what you have received from google is not an actual URL but a variable which refers to a url (query-string). So, thats why it's being parsed as registration mark when rendered.
I would say, you owe to url-encode it and decode it whenever processing it. Like any other variable containing special entities.
To prevent this from happening you should encode urls, which replaces characters like the ampersand with a % and a hexadecimal number behind it in the url.

What characters are allowed in the HTML Name attribute inside input tag?

I have a PHP script that will generate <input>s dynamically, so I was wondering if I needed to filter any characters in the name attribute.
I know that the name has to start with a letter, but I don't know any other rules. I figure square brackets must be allowed, since PHP uses these to create arrays from form data. How about parentheses? Spaces?
Note, that not all characters are submitted for name attributes of form fields (even when using POST)!
White-space characters are trimmed and inner white-space characters as well the character . are replaced by _.
(Tested in Chrome 23, Firefox 13 and Internet Explorer 9, all Win7.)
Any character you can include in an [X]HTML file is fine to put in an <input name>. As Allain's comment says, <input name> is defined as containing CDATA, so the only things you can't put in there are the control codes and invalid codepoints that the underlying standard (SGML or XML) disallows.
Allain quoted W3 from the HTML4 spec:
Note. The "get" method restricts form data set values to ASCII characters. Only the "post" method (with enctype="multipart/form-data") is specified to cover the entire ISO10646 character set.
However this isn't really true in practice.
The theory is that application/x-www-form-urlencoded data doesn't have a mechanism to specify an encoding for the form's names or values, so using non-ASCII characters in either is “not specified” as working and you should use POSTed multipart/form-data instead.
Unfortunately, in the real world, no browser specifies an encoding for fields even when it theoretically could, in the subpart headers of a multipart/form-data POST request body. (I believe Mozilla tried to implement it once, but backed out as it broke servers.)
And no browser implements the astonishingly complex and ugly RFC2231 standard that would be necessary to insert encoded non-ASCII field names into the multipart's subpart headers. In any case, the HTML spec that defines multipart/form-data doesn't directly say that RFC2231 should be used, and, again, it would break servers if you tried.
So the reality of the situation is there is no way to know what encoding is being used for the names and values in a form submission, no matter what type of form it is. What browsers will do with field names and values that contain non-ASCII characters is the same for GET and both types of POST form: it encodes them using the encoding the page containing the form used. Non-ASCII GET form names are no more broken than everything else.
DLH:
So name has a different data type for than it does for other elements?
Actually the only element whose name attribute is not CDATA is <meta>. See the HTML4 spec's attribute list for all the different uses of name; it's an overloaded attribute name, having many different meanings on the different elements. This is generally considered a bad thing.
However, typically these days you would avoid name except on form fields (where it's a control name) and param (where it's a plugin-specific parameter identifier). That's only two meanings to grapple with. The old-school use of name for identifying elements like <form> or <a> on the page should be avoided (use id instead).
The only real restriction on what characters can appear in form control names is when a form is submitted with GET
"The "get" method restricts form data set values to ASCII characters." reference
There's a good thread on it here.
While Allain's comment did answer OP's direct question and bobince provided some brilliant in-depth information, I believe many people come here seeking answer to more specific question: "Can I use a dot character in form's input name attribute?"
As this thread came up as first result when I searched for this knowledge I guessed I may as well share what I found.
Firstly, Matthias' claimed that:
character . are replaced by _
This is untrue. I don't know if browser's actually did this kind of operation back in 2013 - though, I doubt that. Browsers send dot characters as they are(talking about POST data)! You can check it in developer tools of any decent browser.
Please, notice that tiny little comment by abluejelly, that probably is missed by many:
I'd like to note that this is a server-specific thing, not a browser thing. Tested on Win7 FF3/3.5/31, IE5/7/8/9/10/Edge, Chrome39, and Safari Windows 5, and all of them sent " test this.stuff" (four leading spaces) as the name in POST to the ASP.NET dev server bundled with VS2012.
I checked it with Apache HTTP server(v2.4.25) and indeed input name like "foo.bar" is changed to "foo_bar". But in a name like "foo[foo.bar]" that dot is not replaced by _!
My conclusion: You can use dots but I wouldn't use it as this may lead to some unexpected behaviours depending on HTTP server used.
Do you mean the id and name attributes of the HTML input tag?
If so, I'd be very tempted to restrict (or convert) allowed "input" name characters into only a-z (A-Z), 0-9 and a limited range of punctuation (".", ",", etc.), if only to limit the potential for XSS exploits, etc.
Additionally, why let the user control any aspect of the input tag? (Might it not ultimately be easier from a validation perspective to keep the input tag names are 'custom_1', 'custom_2', etc. and then map these as required.)