Converting an "HTML entity" emoticon code in UTF16 (in c++) - html

I'm currently writing my own DrawTextEx() function that supports emoticons. Using this function, a callback is called every time an emoticon is found in the text, giving the opportunity to caller to replace the text segment containing the emoticon by an image. For example, the Unicode chars 0x3DD8 0x00DE found in a text will be replaced by a smiling face image while the text is drawn. Actually this function works fine.
Now I want to implement an image library on the caller side. I receive a text segment like 0x3DD8 0x00DE in my callback function, and my idea is to use this code as key in a map containing all the Unicode combinations, every one linked with a structure containing the image to draw. I found a good package on the http://emojione.com/developers/ website. All the packages available on this site contain several file names, that is an hexadecimal code. So I can iterate through the files contained in the package, and create my map in an automatic way.
However I found that these codes are part of another standard, and are in fact a set of items named "HTML entity", apparently used in the web development, as it can be seen on the http://graphemica.com/%F0%9F%98%80 website. So, to be able to use these files, I need a solution to convert the HTML entity values contained in their names into an UTF16 code. For example, in the case of the above mentioned smiling face, I need to convert the 0x1f600 HTML entity code to the 0x3DD8 0x00DE UTF16 code.
A brute force approach may consist to write a map that converts these codes, by adding each of them in my code, one by one. But as the Unicode standard contains, in the most optimist scenario, more than 1800 combinations for the emoticons, I want to know it there is an existing solution, such as a known API or function, that I may use to do the job. Or is there a known trick to do that? (like e.g. "character + ('a' - 'A')" to convert an uppercase char to lower)
Regards

For example, the Unicode chars 0x3DD8 0x00DE found in a text will be replaced by a smiling face image
The character U+1F600 Grinning Face 😀 is represented by the UTF-16 code unit sequence 0xD83D, 0xDE00.
(Graphemica swapping the order of the bytes for each code unit is super misleading; ignore that.)
I found that these codes are part of another standard, and are in fact a set of items named "HTML entity", apparently used in the web development
HTML has nothing to do with it. They're plain Unicode characters—just ones outside the Basic Multilingual Plane, above U+FFFF, which is why it takes more than one UTF-16 code unit to represent them.
HTML numeric character references like 😀 (often incorrectly referred to as entities) are a way of referring to characters by code point number, but the escape string is only effective in an HTML (or XML) document, and we're not in one of those.
So:
I need to convert the 0x1f600 HTML entity code to the 0x3DD8 0x00DE UTF16 code.
sounds more like:
I need to convert representations of U+1F600 Grinning Face: from the code point number 0x1F600 to the UTF-16 code unit sequence 0xD83D, 0xDE00
Which in C# would be:
string face = Char.ConvertFromUtf32(0x1F619); // "😀" aka "\uD83D\uDE00"
or in the other direction:
int codepoint = Char.ConvertToUtf32("\uD83D\uDE00", 0); // 0x1F619
(the name ‘UTF-32’ is poorly-chosen here; we are talking about an integer code point number, not a sequence of four-bytes-per-character.)
Or is there a known trick to do that? (like e.g. "character + ('a' - 'A')" to convert an uppercase char to lower)
In C++ things are more annoying; there's not (that I can think of) anything that directly converts between code points and UTF-16 code units. You could use various encoding functions/libraries to convert between UTF-32-encoded byte sequences and UTF-16 code units, but that can end up more faff than just writing the conversion logic yourself. eg in most basic form for a single character:
std::wstring fromCodePoint(int codePoint) {
if (codePoint < 0x10000) {
return std::wstring(1, (wchar_t)codePoint);
}
wchar_t codeUnits[2] = {
0xD800 + ((codePoint - 0x10000) >> 10),
0xDC00 + ((codePoint - 0x10000) & 0x3FF)
};
return std::wstring(codeUnits, 2);
}
This is assuming the wchar_t type is based on UTF-16 code units, same as C#'s string type is. On Windows this is probably true. Elsewhere it is probably not, but on platforms where wchar_t is based on code points, you can just pull each code point out of the string as a character with no further processing.
(Optimisation and error handling left as an exercise for the reader.)

I'm using the RAD Studio compiler, and fortunately it provides an implementation for the ConvertFromUtf32 and ConvertToUtf32 functions mentioned by bobince. I tested them and they do exactly what I needed.
For those that doesn't use the Embarcadero products, the fromCodePoint() implementation provided by bobince works also well. For information, here is also the ConvertFromUtf32() function as implemented in RAD Studio, and translated into C++
std::wstring ConvertFromUtf32(unsigned c)
{
const unsigned unicodeLastChar = 1114111;
const wchar_t minHighSurrogate = 0xD800;
const wchar_t minLowSurrogate = 0xDC00;
const wchar_t maxLowSurrogate = 0xDFFF;
// is UTF32 value out of bounds?
if (c > unicodeLastChar || (c >= minHighSurrogate && c <= maxLowSurrogate))
throw "Argument out of range - invalid UTF32 value";
std::wstring result;
// is UTF32 value a 16 bit value that can fit inside a wchar_t?
if (c < 0x10000)
result = wchar_t(c);
else
{
// do divide in 2 chars
c -= 0x10000;
// convert code point value to UTF16 string
result = wchar_t((c / 0x400) + minHighSurrogate);
result += wchar_t((c % 0x400) + minLowSurrogate);
}
return result;
}
Thanks to bobince for his response, which pointed me in the right direction and helped me to solve this problem.
Regards

Related

Why is "Warning: Implicit string type conversion from AnsiString to UnicodeString" here while both are Strings?

Here I get an warning Warning: Implicit string type conversion from "AnsiString" to "UnicodeString"
....
{$mode DelphiUnicode}
{$H+}
....
Function THeader.ToHtml(Constref input: String): String;
Begin
Result := Format('<h%d>%s</h%d>', [FLevel, Chunk(input), FLevel]); // <--- HERE !
End;
My project settings include -MDelphiUnicode. My Lazarus version is 2.2.2.
As I understand it means that if Chunk() returns symbols outside of ASCII (Unicode), then the Result will be problematic. Right? What to do with this warning? Sure, I can cast the Format() result to String. But why is it required? I see that Format's prototype is:
// somewhere in the sysstrh.inc ...
Function Format (Const Fmt : String; const Args : Array of const) : String;
so it already returns a String (which is magically UnicodeString in my case, as I think). What is the problem actually here? And how to work in the correct way with such library functions like Format() (for instance, GetOptionValue() of TCustomApplication)?
ps. I read FreePascal Wiki about Unicode and String types, but I still cannot understand the reason of this warning :)
There are multiple reasons to do so.
The exact codepage of ansistring is under control of the RTL, which can query the OS for it, without the compiler knowing the details. In Lazarus applications this is generally set to utf8, but the compiler doesn't know that.
So calling a ansistring format() could corrupt strings, and repeated conversions are of course also not ideal for a performance.
delphiunicode is a work in progress, and I would not recommend using it (yet) out of habit, only if you really know what you are doing (and by that I mean knowing the state of it in FPC, not that it works in Delphi)
The original plan was to migrate to unicodestring fully, but since Windows now allows UTF8 as native 1-byte codepage (see thick in application tab of project options), the progress on that migration is glacial.
In short, consider arranging your code as much as possible so that string type doesn't matter, and then use utf8 ansistrings in Lazarus for unicode.
Or ignore the warnings, or disable them with some -vn parameter that allows you to disable specific hints/warnings

How to check whether a numeric encoded entity is a valid ISO8859-1 encoding?

Let's say I was given random character reference like 〹. I need a solution to check whether this a valid encoding or not.
I think I can use the Charset lib but I can't fully wrap my mind on how to come up with a solution.
[This answer has been rewritten after further research.]
There's no simple answer to this using Charsets; see below for a complicated one.
There are simple answers using the character code, but it turns out to depend on exactly what you mean by ISO8859-1!
According to the Wikipedia page on ISO/IEC 8859-1, the character set ISO8859-1 defines only characters 32–126 and 160–255. So you could simply check for those ranges, e.g.:
fun Char.isISO8859_1() = this.toInt() in 32..126 || this.toInt() in 160..255
However, that same page also mentions the character set ISO-8859-1 (note the extra hyphen), which defines all 8-bit characters (0–255), assigning control characters to the extra ones. You could check for that with e.g.:
fun Char.isISO_8859_1() = this.toInt() in 0..255
ISO8859-1 includes all the printable characters, so if you only want to know whether a character has a defined glyph, you could use the former. However, these days most people tend to mean ISO-8859-1: that's what many web pages use (those which haven't yet moved on to UTF-8), and that's what the first 256 Unicode characters are defined as. So the latter will probably be more generally useful.
Both of the above methods are of course very short, simple, and efficient; but they only work for the one character set; and it's awkward hard-coding details of a character set, when library classes already have that information.
It seems that Charset objects are mainly aimed at encoding and decoding, so they don't provide a simple way to tell which characters are defined as such. But you can find out whether they can encode a given character. Here's the simplest way I found:
fun Char.isIn(charset: Charset) =
try {
charset.newEncoder()
.onUnmappableCharacter(CodingErrorAction.REPORT)
.encode(CharBuffer.wrap(toString()))
true
} catch (x: CharacterCodingException) {
false
}
That's really inefficient, but will work for all Charsets.
If you try this for ISO_8859_1, you'll find that it can encode all 8-bit values, i.e. 0–255. So it's clearly using the full ISO-8859-1 definition.

Smarter way to isolate a value in an unformated string?

I'm using xpdf in an AIR app to convert PDFs to PNGs on the fly. Before conversion I want to get a page count and am using xdf's pdfinfo utility to print to stdout and then parsing that string to get the page count.
My first pass solution: split the string by line breaks, test the resulting array for the "Pages:" string, etc.
My solution works but it feels clunky and fragile. I thought about replacing all the double spaces, doing a split on ":" and building a hash table – but there are timestamps with colons in the string which would screw that up.
Is there a better or smarter way to do this?
protected function processPDFinfo(data:String):void
{
var pageCount:Number = 0;
var tmp:Array = data.split("\n");
for (var i:int = 0; i < tmp.length; i++){
var tmpStr:String = tmp[i];
if (tmpStr.indexOf("Pages:") != -1){
var tmpSub:Array = tmpStr.split(":");
if (tmpSub.length){
pageCount = Number(tmpSub[tmpSub.length - 1]);
}
break;
}
}
trace("pageCount", pageCount);
}
Title: Developing Native Extensions
Subject: Adobe Flash Platform
Author: Adobe Systems Incorporated
Creator: FrameMaker 8.0
Producer: Acrobat Distiller Server 8.1.0
CreationDate: Mon Dec 7 05:45:39 2015
ModDate: Mon Dec 7 05:45:39 2015
Tagged: yes
Form: none
Pages: 140
Encrypted: no
Page size: 612 x 783 pts (rotated 0 degrees)
File size: 2505564 bytes
Optimized: yes
PDF version: 1.4
Use regular expressions like this one for example:
/Pages:\s*(\d+)/g
The first (and only) capturing group is the string of digits you are looking for.
var pattern:RegExp = /Pages:\s*(\d+)/g;
var pageCount:int = parseInt(patern.exec(data)[1]);
I understand about 2% of that (/Pages: /g). It is looking for the string literal Pages: and and then something with spaces wildcard and escaping d+??
I know, regex can be hard. What really helps creating them is if your IDE supports them. There are also online tools like regexr (me first time using version 2 here and it's even better than version 1, very nice!) In general, you want to have a tool that gives you immediate visual feedback of what's being matched.
Below is a screenshot with your text and my pattern in regexr.
You can hover over things and get all kinds of information.
The sidebar to the left is a full fledged documentation on regex.
The optional explain tab goes through the given pattern step by step.
\s* is any amount of whitespace characters and \d+ is at least one numeric digit character.
and returning an array??
This is the As3 part of the story. Once you create a RegExp object with he pattern, you can use exec() to execute it on some String. (not sure why they picked the retarded abbreviation for the method name)
The return value is a little funky:
Returns
Object — If there is no match, null; otherwise, an object with the following properties:
An array, in which element 0 contains the complete matching substring, and other elements of the array (1 through n) contain substrings that match parenthetical groups in the regular expression
index — The character position of the matched substring within the string
input — The string (str)
You have to check the documentation of exec() to really understand this. It's kind of JS style, returning a bunch of variables held together in a generic object that also acts as an array.
This is where the [1] in my example code comes from.

PHP4: Json_encode method which accepts multi byte chars

in my company we have a webservice zu send data from very old projects to pretty new ones. The old projects run PHP4.4 which has natively no json_encode method. So we used the PEAR class Service_JSON instead. http://www.abeautifulsite.net/using-json-encode-and-json-decode-in-php4/
Today, I found out, that this class can not deal with multi byte chars because it extensively uses ord() in order to get charcodes from the string and replace the chars. There is no mb_ord() implementation, not even in newer PHP versions. It also uses $string{$index} to access the char at a index, I'm not completely sure if this supports multi byte chars.
//Excerpt from encode() method
// STRINGS ARE EXPECTED TO BE IN ASCII OR UTF-8 FORMAT
$ascii = '';
$strlen_var = $this->strlen8($var);
/*
* Iterate over every character in the string,
* escaping with a slash or encoding to UTF-8 where necessary
*/
for ($c = 0; $c < $strlen_var; ++$c) {
$ord_var_c = ord($var{$c});
//Here comes a switch which replaces chars according o their hex code and writes them to $ascii
we call
$Service_Json = new Service_JSON();
$data = $Service_Json->encode('Marktplatz, Hauptstraße, Endingen');
echo $data; //prints "Marktplatz, Hauptstra\u00dfe, Endinge". The n is missing
We solved this problem by setting up another webservice which receives serialised arrays and returns a json_encoded string. This service runs on a modern mahine, so it uses PHP5.4. But this "solutions is pretty awkward and I should look for a better one. Does anyone have an idea?
Problem description
German umlauts are replaced properly. BUT then the string is cut of at the end because ord returns the wrong chars. . mb_strlen() does not change anything, it gives the same length as strlen in this case.
Input string was "Marktplatz, Hauptstraße, Endingen", the n at the end was cut off. The ß was correctly encoded to \u00df. For every Umlaut it cuts of one more char at the end.
It's also possible the reason is our old database encoding, but the replacement itself works correctly so I guess it's the ord() method.
A colleague found out that
mb_strlen($var, 'ASCII');
solves the problem. We had an older lib version in use which used simple mb_strlen. This fix seems to do the same as your mb_convert_encoding();
Problem is solved now. Thank you very much for your help!

A StringToken Parser which gives Google Search style "Did you mean:" Suggestions

Seeking a method to:
Take whitespace separated tokens in a String; return a suggested Word
ie:
Google Search can take "fonetic wrd nterpreterr",
and atop of the result page it shows "Did you mean: phonetic word interpreter"
A solution in any of the C* languages or Java would be preferred.
Are there any existing Open Libraries which perform such functionality?
Or is there a way to Utilise a Google API to request a suggested word?
In his article How to Write a Spelling Corrector, Peter Norvig discusses how a Google-like spellchecker could be implemented. The article contains a 20-line implementation in Python, as well as links to several reimplementations in C, C++, C# and Java. Here is an excerpt:
The full details of an
industrial-strength spell corrector
like Google's would be more confusing
than enlightening, but I figured that
on the plane flight home, in less than
a page of code, I could write a toy
spelling corrector that achieves 80 or
90% accuracy at a processing speed of
at least 10 words per second.
Using Norvig's code and this text as training set, i get the following results:
>>> import spellch
>>> [spellch.correct(w) for w in 'fonetic wrd nterpreterr'.split()]
['phonetic', 'word', 'interpreters']
You can use the yahoo web service here:
http://developer.yahoo.com/search/web/V1/spellingSuggestion.html
However it's only a web service... (i.e. there are no APIs for other language etc..) but it outputs JSON or XML, so... pretty easy to adapt to any language...
You can also use the Google API's to spell check. There is an ASP implementation here (I'm not to credit for this, though).
First off:
Java
C++
C#
Use the one of your choice. I suspect it runs the query against a spell-checking engine with a word limit of exactly one, it then does nothing if the entire query is valid, otherwise it replaces each word with that word's best match. In other words, the following algorithm (an empty return string means that the query had no problems):
startup()
{
set the spelling engines word suggestion limit to 1
}
option 1()
{
int currentPosition = engine.NextWord(start the search at word 0, querystring);
if(currentPosition == -1)
return empty string; // Query is a-ok.
while(currentPosition != -1)
{
queryString = engine.ReplaceWord(engine.CurrentWord, queryString, the suggestion with index 0);
currentPosition = engine.NextWord(currentPosition, querystring);
}
return queryString;
}
Since no one has yet mentioned it, I'll give one more phrase to search for: "edit distance" (for example, link text).
That can be used to find closest matches, assuming it's typos where letters are transposed, missing or added.
But usually this is also coupled with some sort of relevancy information; either by simple popularity (to assume most commonly used close-enough match is most likely correct word), or by contextual likelihood (words that follow preceding correct word, or come before one). This gets into information retrieval; one way to start is to look at bigram and trigrams (sequences of words seen together). Google has very extensive freely available data sets for these.
For simple initial solution though a dictionary couple with Levenshtein-based matchers works surprisingly well.
You could plug Lucene, which has a dictionary facility implementing the Levenshtein distance method.
Here's an example from the Wiki, where 2 is the distance.
String[] l=spellChecker.suggestSimilar("sevanty", 2);
//l[0] = "seventy"
http://wiki.apache.org/lucene-java/SpellChecker
An older link http://today.java.net/pub/a/today/2005/08/09/didyoumean.html
The Google SOAP Search APIs do that.
If you have a dictionary stored as a trie, there is a fairly straightforward way to find best-matching entries, where characters can be inserted, deleted, or replaced.
void match(trie t, char* w, string s, int budget){
if (budget < 0) return;
if (*w=='\0') print s;
foreach (char c, subtrie t1 in t){
/* try matching or replacing c */
match(t1, w+1, s+c, (*w==c ? budget : budget-1));
/* try deleting c */
match(t1, w, s, budget-1);
}
/* try inserting *w */
match(t, w+1, s + *w, budget-1);
}
The idea is that first you call it with a budget of zero, and see if it prints anything out. Then try a budget of 1, and so on, until it prints out some matches. The bigger the budget the longer it takes. You might want to only go up to a budget of 2.
Added: It's not too hard to extend this to handle common prefixes and suffixes. For example, English prefixes like "un", "anti" and "dis" can be in the dictionary, and can then link back to the top of the dictionary. For suffixes like "ism", "'s", and "ed" there can be a separate trie containing just the suffixes, and most words can link to that suffix trie. Then it can handle strange words like "antinationalizationalization".