British English to American English (and vice versa) Converter - language-agnostic

Does anyone know of a library or bit of code that converts British English to American English and vice versa?
I don't imagine there's too many differences (some examples that come to mind are doughnut/donut, colour/color, grey/gray, localised/localized) but it would be nice to be able to provide localised site content.

I've been working on one to convert US English to UK English. As I've discovered it's actually a lot harder to write something to convert the other way but I hope to get around to providing a reverse conversion one day.
This isn't perfect, but it's not a bad effort (even if I do say so myself). It'll convert most US spellings to UK ones but there are some words where UK English retains the US spelling (e.g. "program" where this refers to computer software). It won't convert words like pants to trousers because my main goal was simply to make the spelling uniform across the whole document.
There are also words such as practice and license where UK English uses either those or practise & licence, depending on whether the word's being used as a verb or a noun. For those two examples the conversion tool will highlight them and an explanatory note pops up on the lower left hand of your screen when you hover your mouse over them. All word patterns which are converted are underlined in red, and the output is shown in a side by side comparison with your original input.
It'll do quite large blocks of text quite quickly, but I prefer to go use it just for a couple of paragraphs at a time - copying them in from a Word doc.
It's still a work in progress so if anyone has any comments or suggestions then I'd appreciate feedback I can use to improve it.
http://www.us2uk.eu/

The difference between UK and US English is far greater than just a difference in spelling. There is also the hood/bonnet, sidewalk/pavement, pants/trousers idea.
Guess it depends how far you need to take it.

I looked forever to find a solution to this, but couldn't find one, so, I wrote my own bit of code for it, using a master list of ~20,000 different spellings that were freely available from the varcon project and the language experts at wordsworldwide:
https://github.com/HoldOffHunger/convert-british-to-american-spellings
Since I had two source lists, I used them each to crosscheck each other, and I found numerous errors and typos (varcon lists "preexistent"'s british equivalent as "preaexistent"). It is possible that I may have accidentally made typos, too, but, since I didn't do any wordsmithing here, I don't believe that to be the case.
Example:
require('AmericanBritishSpellings.php');
$american_british_spellings = new AmericanBritishSpellings();
$text = "Axiomatically ax that door, would you, my neighbour?";
$text = $american_british_spellings->SwapBritishSpellingsForAmericanSpellings(['text'=>$text]);
print($text); // output: Axiomatically axe that door, would you, my neighbor?

I think if you're thinking of converting from American English to British English, I personally wouldn't bother. Britain is very Americanised anyway, we accept silly yank spellings on the net :)

I had a similar problem recently. I discovered the following tool, called VarCon. I haven't tested it out, but I needed a rough converter for some text data. Here's an example.
echo "I apologise for my colourful tongue ." | ./translate british american
# >> I apologize for my colorful tongue .
It looks like it works for various dialects. Be sure to read the README and proceed with caution.
*note: This will only correct spelling variations.

Related

creating a common embedding for two languages

My task deals with multi-language like (english and hindi). For that I need a common embedding to represent both languages.
I know there are methods for learning multilingual embedding like 'MUSE', but this represents those two embeddings in a common vector space, obviously they are similar, but not the same.
So I wanted to know if there is any method or approach that can learn to represent both embedding in form of a single embedding that represents the both the language.
Any lead is strongly appreciated!!!
I think a good lead would be to look at past work that has been done in the field. A good overview to start with is Sebastian Ruder's talk, which gives you a multitude of approaches, depending on the level of information you have about your source/target language. This is basically what MUSE is doing, and I'm relatively sure that it is considered state-of-the-art.
The basic idea in most approaches is to map embedding spaces such that you minimize some (usually Euclidean) distance between the both (see p. 16 of the link). This obviously works best if you have a known dictionary and can precisely map the different translations, and works even better if the two languages have similar linguistic properties (not so sure about Hindi and English, to be honest).
Another recent approach is the one by Multilingual-BERT (mBERT), or similarly, XLM-RoBERTa, but those learn embeddings based on a shared vocabulary. This might again be less desirable if you have morphologically dissimilar languages, and also has the drawback that they incorporate a bunch of other, unrelated, languages.
Otherwise, I'm unclear on what exactly you are expecting from a "common embedding", but happy to extend the answer once clarified.

How to translate text/HTML that has stylistic line breaks?

The general question here is how do you mark text up for translation on an HTML page when the position of the line breaks have to look eye pleasing (as opposed to the line break aways happening after a specific word)?
I have a web page I want to translate into 5 different languages. In some places, I have text like "Enjoyed by 10,000 happy users" under a small icon that needs to be displayed in an eye pleasing way. This looks good as the noun phrase is on its own line and each line has about the same number of letters:
<icon>
Enjoyed by
10,000 happy users
Do I send this text to be translated as this?
Enjoyed by <br> 10,000 happy users
Problems:
By adding markup to the text it makes it unlikely I can reuse the string elsewhere but I can't see any other options.
How do I cope with how I place the in the translated text given the translated text will have a different number of letters (e.g. "Genossen von 10.000 glückliche Benutzer" in German)? Just review how each one renders on the page manually and adjust the myself after the translations come back?
I can't see any clean way to do this. I could remove the markup and try to write some server code that will add the break in a nice place but I can't see how it's possible to automate (e.g. putting noun phrases on their own line if possible when the previous line has enough letters). CSS has even less options to do this.
Your question is somewhat subjective, but I think your choices are to either trust your translators to format the HTML, or trust them to come up with copy that fits your design. Trying to engineer your way to a "clean" solution with server code sounds like it will achieve the exact opposite.
Make sure your design is good enough to cope with a reasonable range of word lengths. If your layout lives and dies by the text being exactly X characters long, then it isn't well designed. You can always ask your translators to try and write a translation in less than a maximum number of characters. This is why we still have human translators - they are also copywriters :)

Thai line breaking: how to break Thai text effectively

Situation
with Thai text on a client site is that we can't control where exactly particular words/sentences are going to break between the lines (how web browser will handle it). Often, content appearance is indicated as incorrect by local reviewers.
Workaround
to this is that copywriter needs to deliver Thai content with breaking ​ and non-breaking  zero-width-space chars included.
In practice, rather than:
ของเพื่อนๆ ที่ออนไลน์อยู่
we should use something as ugly as:
ของเพื่อนๆ​ที่​ออนไลน์อยู่
The above is just an example, I don't really know where exactly the breakpoints are allowed.
In fact, non-breaking zero spaces alone would do the trick either ... it's just more strict and correct to use breaking ones as well for better accuracy.
And while it definitely is doable like this, it also is a time consuming and not very effective solution for a large site content management. Simply said, the effort put into it doesn't match the effect needed.
Research
so far has lead to the workaround mentioned, looking for a better way how to handle this. Even W3C doesn't have a solution yet and is just discussing whether it should be part of CSS3 specification.
Thai language utilizes spaces very rarely, mostly to distinguish between sentences etc. Therefore, common appearance of a Thai sentence is one looong string.
Where to break such a string when more lines of text are put together is determined by particular words identification. For words identification local dictionaries are used which are most probably part of operating system or web browser, I'm not entirely sure about these.
Apparently, the more web browsers / operating systems you check on the more results you get! Moreover, there's not much you can do about this as it's system driven and there are no "where to break Thai" settings available.
Using <wbr/>, ​ or ­ to indicate where the breakpoints really are won't prevent web browser thinking (even though wrong) that some breaks are also possible in places, where you haven't defined them e.g. in the middle of a word which might be grammatically incorrect.
If such a word is placed at the end of a line (depends on screen resolution, copy length, CSS rules defined) and the browser applies his wrong line breaking rule on it then you would end up with a Thai line breaking issue, no matter that you have defined another breakpoints before, after or somewhere else in the word - browser will always use a breakpoint that he thinks is closest to EOL, not just the ones you have gently suggested by inserting one of the mentioned chars in your markup.
That's why you actually need to focus on where not to break your text (non-breaking zero-width-space), not where it's allowed. And that's what lead us back to the ugly and long markup example in the "Workaround" section above. That way a line break can strictly only occur where you have allowed it to be, but it's messy.
Any other solution
how to handle this more effectively would be appreciated ... and who knows, it might even help W3C in their implementation?
THANK YOU!
I know this thread was quite some time but I have something to say as a native Thai. I read lots of Thai web pages everyday and I feel the quality of Thai line breaking by the modern web browsers nowadays is perfectly acceptable.
As I know, Google Chrome browser uses ICU4C, Internet Explorer uses Uniscribe API, and Firefox uses libthai to break Thai sentences into words. For Thai people I know, how these web browsers handle line breaks in Thai is perfectly acceptable for them. (actually we used to have this problem with very early version of Firefox (1.x) but that is resolved now.)
Thai line breaking and word breaking, unlike western languages, is still considered an unsolved problem and is still actively tackled by many linguistics researchers. Currently there is no implementation that could perfectly break a sentence to Thai words. IBM ICU Boundary Analysis page contains some analysis on this problem.
Many times, it has something to do with the context. For example, the phrase "ตากลม" can be correctly broken to "ตา","กลม" or "ตาก","ลม". Each way says totally different thing but Thai readers can still perfectly understand the intended meaning, given the context.
Given that your local reviewers are already familiar with reading Thai websites, I think maybe they are too pushy on you to resolve this problem. This is common unsolvable problem for all Thai websites, web browsers, and even Microsoft Word.
It is best to wait (or contribute to IBM ICU) until Thai sentence breaking implementation gets better. Let the web browsers handle this. I don't think trying to workaround this problem worth your valuable time. As as I know, even Thai website publishers here just don't care to get this one right.
Should you need to publish a document with a perfect line/word breaking, you may consider other medium, such as PDF document in which you should have more control over the line breaks.
Hope this helps :)
The ICU and ICU4J libraries have a dictionary based word break iterator for Thai that you could use on the server side to inject breaking zero width spaces where appropriate.
Or, you could use this to build a utility that could run at build time or on delivery of translations, if you knew the spacing requirements that far in advance.
see ICU Boundary Analysis for more info. These libraries are available for C, C++, and Java.
There is a W3C working group working exactly on this (for Thai and other Southeast Asian languages). Their layout requirement draft is quite recent, from last month:
Thai Layout Requirements (Draft) (10 Jan 2023)
https://www.w3.org/International/sealreq/thai/
Thai Gap Analysis (19 Jan 2022) https://www.w3.org/TR/thai-gap/
I hope these info can feed into the fruitful discussion here.
You can also follow/join the Southeast Asia Language Enablement (sealreq) activity on GitHub: https://github.com/w3c/sealreq

Which tools do you use to analyze text?

I'm in need of some inspiration. For a hobby project I am playing with content analysis. I am basically trying to analyze input to match it to a topic map.
For example:
"The way on Iraq" > History, Middle East
"Halloumni" > Food, Middle East
"BMW" > Germany, Cars
"Obama" > USA
"Impala" > USA, Cars
"The Berlin Wall" > History, Germany
"Bratwurst" > Food, Germany
"Cheeseburger" > Food, USA
...
I've been reading a lot about taxonomy and in the end, whatever I read concludes that all people tag differently and therefor the system is bound to fail.
I thought about tokenized input and stop word lists, but they are of course a lot of work to come up with and build. Building the relevant links between words and topics seems exhausting and also never ending cause whatever language you deal with, it's very rich and most languages also heavily rely on context. Let alone maintaining it.
I guess I need to come up with something smart and train it with topics I want it to be able to guess. Kind of like an Eliza bot.
Anyway, I don't believe there is something that does that out of the box, but does anyone have any leads or examples for technology to use in order to analyze input in order to extract meaning?
Hiya. I'd first look to OpenCalais for finding entities within texts or input. It's great, and I've used it plenty myself (from the Reuters guys).
After that you can analyze the text further, creating associations between entities and words. I'd probably look them up in something like WordNet and try to typify them, or even auto-generate some ontology that matches the domain you're trying to map.
As to how to pull it all together, there's many things you can do; the above, or two- or three-pass models of trying to figure out what words are and mean. Or, if you control the input, make up a format that is easier to parse, or go down the murky path of NLP (which is a lot of fun).
Or you could look to something like Jena for parsing arbitrary RDF snippets, although I don't like the RDF premise myself (I'm a Topic Mapper). I've written stuff that looks up words or phrases or names in WikiPedia, and rate their hitrate based on the semantics found in the WikiPedia pages (I could tell you the details more if requested, but isn't it more fun to work it out yourself and come up with something better than mine? :), ie. number of links, number of SeeAlso, amount of text, how big the discussion page, etc.
I've written tons of stuff over the years (even in PHP and Perl; look to Robert Barta's Topic Maps stuff on CPAN, especially the TM modules for some kick-ass stuff), from engines to parsers to something weird in the middle. Associative arrays which breaks words and phrases apart, creating cumulative histograms to sort their components out and so forth. It's all fun stuff, but as to shrink-wrapped tools, I'm not so sure. Everyones goals and needs seems to be different. It depends on how complex and sophisticated you want to become.
Anyway, hope this helps a little. Cheers! :)
SemanticHacker does exactly what you want, out-of-the-box, and has a friendly API. It's somewhat inaccurate on short phrases, but just perfect for long texts.
“The way on Iraq” > Society/Issues/Warfare and Conflict/Specific Conflicts
“Halloumni” > N/A
“BMW” > Recreation/Motorcycles/Makes and Models
“Obama” > Society/Politics/Conservatism
“Impala” > Recreation/Autos/Makes and Models/Chevrolet
“The Berlin Wall” > Regional/Europe/Germany/States
“Bratwurst” > Home/Cooking/Meat
“Cheeseburger” > Home/Cooking/Recipe Collections; Regional/North America/United States/Maryland/Localities
Sounds like you're looking for a Bayesian Network implementation. You may get by using something like Solr.
Also check out CI-Bayes. Joseph Ottinger wrote an article about it on theserverside.net earlier this year.

How do you handle translation of text with markup?

I'm developing multi-language support for our web app. We're using Django's helpers around the gettext library. Everything has been surprisingly easy, except for the question of how to handle sentences that include significant HTML markup. Here's a simple example:
Please log in to continue.
Here are the approaches I can think of:
Change the link to include the whole sentence. Regardless of whether the change is a good idea in this case, the problem with this solution is that UI becomes dependent on the needs of i18n when the two are ideally independent.
Mark the whole string above for translation (formatting included). The translation strings would then also include the HTML directly. The problem with this is that changing the HTML formatting requires changing all the translation.
Tightly couple multiple translations, then use string interpolation to combine them. For the example, the phrase "Please %s to continue" and "log in" could be marked separately for translation, then combined. The "log in" is localized, then wrapped in the HREF, then inserted into the translated phrase, which keeps the %s in translation to mark where the link should go. This approach complicates the code and breaks the independence of translation strings.
Are there any other options? How have others solved this problem?
Solution 2 is what you want. Send them the whole sentence, with the HTML markup embedded.
Reasons:
The predominant translation tool, Trados, can preserve the markup from inadvertent corruption by a translator.
Trados can also auto-translate text that it has seen before, even if the content of the tags have changed (but the number of tags and their position in the sentence are the same). At the very least, the translator will give you a good discount.
Styling is locale-specific. In some cases, bold will be inappropriate in Chinese or Japanese, and italics are less commonly used in East Asian languages, for example. The translator should have the freedom to either keep or remove the styles.
Word order is language-specific. If you were to segment the above sentence into fragments, it might work for English and French, but in Chinese or Japanese the word order would not be correct when you concatenate. For this reason, it is best i18n practice to externalize entire sentences, not sentence fragments.
2, with a potential twist.
You certainly could localize the whole string, like:
loginLink=Please log in to continue
However, depending on your tooling and your localization group, they might prefer for you to do something like:
// tokens in this string add html links
loginLink=Please {0}log in{1} to continue
That would be my preferred method. You could use a different substitution pattern if you have localization tooling that ignores certain characters. E.g.
loginLink=Please %startlink%log in%endlink% to continue
Then perform the substitution in your jsp, servlet, or equivalent for whatever language you're using ...
Disclaimer: I am not experienced in internationalization of software myself.
I don't think this would be good in any case - just introduces too much coupling …
As long as you keep formatting sparse in the parts which need to be translated, this could be okay. Giving translators the possibility to give special words importance (by either making them a link or probably using <strong /> emphasis sounds like a good idea. However, those translations with (X)HTML possibly cannot be used anywhere else easily.
This sounds like unnecessary work to me …
If it were me, I think I would go with the second approach, but I would put the URI into a formatting parameter, so that this can be changed without having to change all those translations.
Please log in to continue.
You should keep in mind that you may need to teach your translators a basic knowledge of (X)HTML if you go with this approach, so that they do not screw up your markup and so that they know what to expect from that text they write. Anyhow, this additional knowledge might lead to a better semantic markup, because, as mentioned above, texts could be translated and annotated with (X)HTML to reflect local writing style.
What ever you do keep the whole sentence as one string. You need to understand the whole sentece to translate it correctly.
Not all words should be translated in all languages: e.g. in Norwegian one doesn't use "please" (we can say "vær så snill" literally "be so kind" but when used as a command it sounds too forceful) so the correct norwegian vould be:
"Logg inn for å fortsette" lit.: "Log in to continue" or
"Fortsett ved å logge inn" lit.: "Continue by to log in" etc.
You must allow completely changing the order, e.g. in a fictional demo language:
"Für kontinuer Loggen bitte ins" (if it was real) lit.: "To continue log please in"
Some language may even have one single word for (most of) this sentence too...
I'll recommend solution 1 or possibly "Please %{startlink}log in%{endlink} to continue" this way the translator can make the whole sentence a link if that's more natural, and it can be completely restructured.
Interesting question, I'll be having this problem very soon. I think I'll go for 2, without any kind of tricky stuff. HTML markup is simple, urls won't move anytime soon, and if anything is changed a new entry will be created in django.po, so we get a chance to review the translation ( ex: a script should check for empty translations after makemessages ).
So, in template :
{% load i18n %}
{% trans 'hello world' %}
... then, after python manage.py makemessages I get in my django.po
#: templates/out.html:3
msgid "hello world"
msgstr ""
I change it to my needs
#: templates/out.html:3
msgid "hello world"
msgstr "bonjour monde"
... and in the simple yet frequent cases I'll encounter, it won't be worth any further trouble. The other solutions here seems quite smart but I don't think the solution to markup problems is more markup. Plus, I want to avoid too much confusing stuff inside templates.
Your templates should be quite stable after a while, I guess, but I don't know what other trouble you expect. If the content changes over and over, perhaps that content's place is not inside the template but inside a model.
Edit: I just checked it out in the documentation, if you ever need variables inside a translation, there is blocktrans.
Makes no sense, how would you translate "log in"?
I don't think many translators have experience with HTML (the regular non-HTML-aware translators would be cheaper)
I would go with option 3, or use "Please %slog in%s to continue" and replace the %s with parts of the link.