Strange Issues with Norwegian -> English Translations - microsoft-translator

When you translate completely separate sentences from Norwegian to English they come out as exactly the same. Examples are:
"Likte dem veldig godt" and "Greit produkt,fikk ganske fine vipper" (and many others).
First of these should be "Enjoyed them very well" and second one is "Fine product, got pretty nice lashes"
This also happens on the online version of Bing Translator.

Seems to be fixed, but the translation is weird:
They will improve in time. Compare Google translator:

Related

Text breaking out of Divs - only in Wordpress

I hope I do not break any rules of this website but posting a link to the issue is a necessity. I have copied the html from the source and tested in a local html file and it does not break. I can not work this out for the life of me.
If you look at the demo web page you can see where the text breaks throughout the whole site (it is a wordpress site if that helps).
Online Demo
Here is the html:
<h2>Our Core Values:</h2>
<strong>Relationship with God: </strong> This is our primary relationship. We were created to serve and give praise to our Creator, through our thoughts, words, and actions. When we do this, we experience the presence of God as our Heavenly Father and live in a joyful, intimate relationship with Him.
<strong>Relationship with Self:</strong> People are uniquely created in the image of God and thus have inherent worth and dignity. While we must remember that we are not God, we have the high calling of reflecting God’s being, making us superior to the rest of creation.
<strong>Relationship with Others:</strong> God created us to live in loving relationship with one another, and to encourage one another to use the gifts God has given to each of us to fulfill our calling.
<strong>Relationship with the rest of Creation:</strong> The cultural mandate of Gen 1:28-30 teaches that God created us to be stewards, people who understand, subdue and manage the world that God created in order to produce bounty. While God made the World ‘perfect’ He left it incomplete. God called humans to interact with creation to make possibilities into realities and to be able to sustain ourselves via the fruit of our stewardship. The economically poor are singled out in the Scriptures as being in a particularly desperate category and as needing very specific attention (Acts :-1:6-7)
<ul>
<li>Faith – God is our provider and equips in all He calls us to do.</li>
<li>The Great Commissions - We are called to make disciples of all nations (Matthew 28:19-20)</li>
<li>Relationship - The body of Christ is held together in relationship with the Lord and each other and self.</li>
<li>Partnership – The Lord never calls one person to work alone. A biblical, effective model of missionary involvement. Ministry partnerships should promote interdependence, not dependence.</li>
<li>Leadership – The five-fold gifts are meant to operate in the establishment and leading of the church.</li>
<li>Faithful stewardship and accountability are essential for successful ministry.</li>
Like I said this works fine if you copy the source into a local html page and test using WAMP.
Please I hope someone can help me with this and again if posting a link is against the rules I am sorry but as the issue is localised to this one instance I have no other choice.
Add this CSS
li { word-wrap: break-word; }
Your page http://kenyaaustraliamission.com/statement-of-faith/
is breaking out of the container because the spaces in the text are replaced with non-breaking space entity references
check the actual text in your WordPress Dashboard
We believe the Bible to be the inspired, only, infallible and authoritative Word of God
Your html problem please remove in p tag in your html code
Example is below
Now check to this answer
Your page is http://kenyaaustraliamission.com/statement-of-faith/ is breaking because its taking whole line as a single word and going out of div. This is due to non breaking space in each space.

Thai line breaking: how to break Thai text effectively

Situation
with Thai text on a client site is that we can't control where exactly particular words/sentences are going to break between the lines (how web browser will handle it). Often, content appearance is indicated as incorrect by local reviewers.
Workaround
to this is that copywriter needs to deliver Thai content with breaking ​ and non-breaking  zero-width-space chars included.
In practice, rather than:
ของเพื่อนๆ ที่ออนไลน์อยู่
we should use something as ugly as:
ของเพื่อนๆ​ที่​ออนไลน์อยู่
The above is just an example, I don't really know where exactly the breakpoints are allowed.
In fact, non-breaking zero spaces alone would do the trick either ... it's just more strict and correct to use breaking ones as well for better accuracy.
And while it definitely is doable like this, it also is a time consuming and not very effective solution for a large site content management. Simply said, the effort put into it doesn't match the effect needed.
Research
so far has lead to the workaround mentioned, looking for a better way how to handle this. Even W3C doesn't have a solution yet and is just discussing whether it should be part of CSS3 specification.
Thai language utilizes spaces very rarely, mostly to distinguish between sentences etc. Therefore, common appearance of a Thai sentence is one looong string.
Where to break such a string when more lines of text are put together is determined by particular words identification. For words identification local dictionaries are used which are most probably part of operating system or web browser, I'm not entirely sure about these.
Apparently, the more web browsers / operating systems you check on the more results you get! Moreover, there's not much you can do about this as it's system driven and there are no "where to break Thai" settings available.
Using <wbr/>, ​ or ­ to indicate where the breakpoints really are won't prevent web browser thinking (even though wrong) that some breaks are also possible in places, where you haven't defined them e.g. in the middle of a word which might be grammatically incorrect.
If such a word is placed at the end of a line (depends on screen resolution, copy length, CSS rules defined) and the browser applies his wrong line breaking rule on it then you would end up with a Thai line breaking issue, no matter that you have defined another breakpoints before, after or somewhere else in the word - browser will always use a breakpoint that he thinks is closest to EOL, not just the ones you have gently suggested by inserting one of the mentioned chars in your markup.
That's why you actually need to focus on where not to break your text (non-breaking zero-width-space), not where it's allowed. And that's what lead us back to the ugly and long markup example in the "Workaround" section above. That way a line break can strictly only occur where you have allowed it to be, but it's messy.
Any other solution
how to handle this more effectively would be appreciated ... and who knows, it might even help W3C in their implementation?
THANK YOU!
I know this thread was quite some time but I have something to say as a native Thai. I read lots of Thai web pages everyday and I feel the quality of Thai line breaking by the modern web browsers nowadays is perfectly acceptable.
As I know, Google Chrome browser uses ICU4C, Internet Explorer uses Uniscribe API, and Firefox uses libthai to break Thai sentences into words. For Thai people I know, how these web browsers handle line breaks in Thai is perfectly acceptable for them. (actually we used to have this problem with very early version of Firefox (1.x) but that is resolved now.)
Thai line breaking and word breaking, unlike western languages, is still considered an unsolved problem and is still actively tackled by many linguistics researchers. Currently there is no implementation that could perfectly break a sentence to Thai words. IBM ICU Boundary Analysis page contains some analysis on this problem.
Many times, it has something to do with the context. For example, the phrase "ตากลม" can be correctly broken to "ตา","กลม" or "ตาก","ลม". Each way says totally different thing but Thai readers can still perfectly understand the intended meaning, given the context.
Given that your local reviewers are already familiar with reading Thai websites, I think maybe they are too pushy on you to resolve this problem. This is common unsolvable problem for all Thai websites, web browsers, and even Microsoft Word.
It is best to wait (or contribute to IBM ICU) until Thai sentence breaking implementation gets better. Let the web browsers handle this. I don't think trying to workaround this problem worth your valuable time. As as I know, even Thai website publishers here just don't care to get this one right.
Should you need to publish a document with a perfect line/word breaking, you may consider other medium, such as PDF document in which you should have more control over the line breaks.
Hope this helps :)
The ICU and ICU4J libraries have a dictionary based word break iterator for Thai that you could use on the server side to inject breaking zero width spaces where appropriate.
Or, you could use this to build a utility that could run at build time or on delivery of translations, if you knew the spacing requirements that far in advance.
see ICU Boundary Analysis for more info. These libraries are available for C, C++, and Java.
There is a W3C working group working exactly on this (for Thai and other Southeast Asian languages). Their layout requirement draft is quite recent, from last month:
Thai Layout Requirements (Draft) (10 Jan 2023)
https://www.w3.org/International/sealreq/thai/
Thai Gap Analysis (19 Jan 2022) https://www.w3.org/TR/thai-gap/
I hope these info can feed into the fruitful discussion here.
You can also follow/join the Southeast Asia Language Enablement (sealreq) activity on GitHub: https://github.com/w3c/sealreq

What are situations with western languages where you'd use HTML 5's Ruby element?

HTML 5 is introducing a new element: <ruby>; here's the W3C's description:
The ruby element allows one or more spans of phrasing content to be marked with ruby annotations. Ruby annotations are short runs of text presented alongside base text, primarily used in East Asian typography as a guide for pronunciation or to include other annotations. In Japanese, this form of typography is also known as furigana.
They then go on to give a few examples of Ruby annotations in use for Chinese and Japanese text. I'm wondering though: is this element going to be useful only for east-asian HTML documents, or are there good semantic applications for the <ruby> element in other western languages like English, German, Spanish, etc.?
id-ee-oh-SINK-ruh-sees
Could be useful for people learning English, as our writing system has many idiosyncrasies that make it somewhat less than phonetic.
As a linguist, I can see the benefits in using <ruby> for marking up linguistic examples with various theoretical notational conventions. One example that comes to mind is indicating tonal levels in autosegmental phonology. Here's a quick example I threw together that can be seen in the latest Webkit/Chromium (at least):
http://miketaylr.com/code/western_ruby.html
Currently, this type of notation is left for LaTex and friends, and if on the web, generally a non-accessible image.
As I understand it, ruby annotations are not really relevant in Western languages because Western alphabets are (more or less) phonetic. In Japanese they are used to give a pronunciation guide for logographic characters which don't have obvious pronunciations (unless you've memorized them). I suppose the Western analog would be IPA notation in brackets following a word, but those are rarely used and I don't know if Ruby annotations would be appropriate for them.
My list:
theoretical notational conventions (miketylr's answer)http://miketaylr.com/code/western_ruby.html
language learning (Adam Bellaire's answer) id-ee-oh-SINK-ruh-sees foo idiosyncrasies bar - made with ascii 'nbsp' art
abbreviation, acronym, initialism (possibly - why hover?)
learning technical terms of English origin accidentally translated to your non-english native language
I'm often forced to do the latter in uni. While the translated terminology is often consistent, very often it's not at all self-explaining or not as much as the original english one.
Also the same term may have been translated using several translation systems by different authors/groups.
Another problem group is when, for example, queue, row, series (and sometimes tuple) are translated to the very same word in your language.
Given a western language with less users, and the low percentage of technical people in the population, this actually makes learning the topic much easier directly from English and then learn the translations in a second step.
Ruby could be a tool to transform this into a one-step process, providing either the translations or the original as a "Furigana".

Which tools do you use to analyze text?

I'm in need of some inspiration. For a hobby project I am playing with content analysis. I am basically trying to analyze input to match it to a topic map.
For example:
"The way on Iraq" > History, Middle East
"Halloumni" > Food, Middle East
"BMW" > Germany, Cars
"Obama" > USA
"Impala" > USA, Cars
"The Berlin Wall" > History, Germany
"Bratwurst" > Food, Germany
"Cheeseburger" > Food, USA
...
I've been reading a lot about taxonomy and in the end, whatever I read concludes that all people tag differently and therefor the system is bound to fail.
I thought about tokenized input and stop word lists, but they are of course a lot of work to come up with and build. Building the relevant links between words and topics seems exhausting and also never ending cause whatever language you deal with, it's very rich and most languages also heavily rely on context. Let alone maintaining it.
I guess I need to come up with something smart and train it with topics I want it to be able to guess. Kind of like an Eliza bot.
Anyway, I don't believe there is something that does that out of the box, but does anyone have any leads or examples for technology to use in order to analyze input in order to extract meaning?
Hiya. I'd first look to OpenCalais for finding entities within texts or input. It's great, and I've used it plenty myself (from the Reuters guys).
After that you can analyze the text further, creating associations between entities and words. I'd probably look them up in something like WordNet and try to typify them, or even auto-generate some ontology that matches the domain you're trying to map.
As to how to pull it all together, there's many things you can do; the above, or two- or three-pass models of trying to figure out what words are and mean. Or, if you control the input, make up a format that is easier to parse, or go down the murky path of NLP (which is a lot of fun).
Or you could look to something like Jena for parsing arbitrary RDF snippets, although I don't like the RDF premise myself (I'm a Topic Mapper). I've written stuff that looks up words or phrases or names in WikiPedia, and rate their hitrate based on the semantics found in the WikiPedia pages (I could tell you the details more if requested, but isn't it more fun to work it out yourself and come up with something better than mine? :), ie. number of links, number of SeeAlso, amount of text, how big the discussion page, etc.
I've written tons of stuff over the years (even in PHP and Perl; look to Robert Barta's Topic Maps stuff on CPAN, especially the TM modules for some kick-ass stuff), from engines to parsers to something weird in the middle. Associative arrays which breaks words and phrases apart, creating cumulative histograms to sort their components out and so forth. It's all fun stuff, but as to shrink-wrapped tools, I'm not so sure. Everyones goals and needs seems to be different. It depends on how complex and sophisticated you want to become.
Anyway, hope this helps a little. Cheers! :)
SemanticHacker does exactly what you want, out-of-the-box, and has a friendly API. It's somewhat inaccurate on short phrases, but just perfect for long texts.
“The way on Iraq” > Society/Issues/Warfare and Conflict/Specific Conflicts
“Halloumni” > N/A
“BMW” > Recreation/Motorcycles/Makes and Models
“Obama” > Society/Politics/Conservatism
“Impala” > Recreation/Autos/Makes and Models/Chevrolet
“The Berlin Wall” > Regional/Europe/Germany/States
“Bratwurst” > Home/Cooking/Meat
“Cheeseburger” > Home/Cooking/Recipe Collections; Regional/North America/United States/Maryland/Localities
Sounds like you're looking for a Bayesian Network implementation. You may get by using something like Solr.
Also check out CI-Bayes. Joseph Ottinger wrote an article about it on theserverside.net earlier this year.

British English to American English (and vice versa) Converter

Does anyone know of a library or bit of code that converts British English to American English and vice versa?
I don't imagine there's too many differences (some examples that come to mind are doughnut/donut, colour/color, grey/gray, localised/localized) but it would be nice to be able to provide localised site content.
I've been working on one to convert US English to UK English. As I've discovered it's actually a lot harder to write something to convert the other way but I hope to get around to providing a reverse conversion one day.
This isn't perfect, but it's not a bad effort (even if I do say so myself). It'll convert most US spellings to UK ones but there are some words where UK English retains the US spelling (e.g. "program" where this refers to computer software). It won't convert words like pants to trousers because my main goal was simply to make the spelling uniform across the whole document.
There are also words such as practice and license where UK English uses either those or practise & licence, depending on whether the word's being used as a verb or a noun. For those two examples the conversion tool will highlight them and an explanatory note pops up on the lower left hand of your screen when you hover your mouse over them. All word patterns which are converted are underlined in red, and the output is shown in a side by side comparison with your original input.
It'll do quite large blocks of text quite quickly, but I prefer to go use it just for a couple of paragraphs at a time - copying them in from a Word doc.
It's still a work in progress so if anyone has any comments or suggestions then I'd appreciate feedback I can use to improve it.
http://www.us2uk.eu/
The difference between UK and US English is far greater than just a difference in spelling. There is also the hood/bonnet, sidewalk/pavement, pants/trousers idea.
Guess it depends how far you need to take it.
I looked forever to find a solution to this, but couldn't find one, so, I wrote my own bit of code for it, using a master list of ~20,000 different spellings that were freely available from the varcon project and the language experts at wordsworldwide:
https://github.com/HoldOffHunger/convert-british-to-american-spellings
Since I had two source lists, I used them each to crosscheck each other, and I found numerous errors and typos (varcon lists "preexistent"'s british equivalent as "preaexistent"). It is possible that I may have accidentally made typos, too, but, since I didn't do any wordsmithing here, I don't believe that to be the case.
Example:
require('AmericanBritishSpellings.php');
$american_british_spellings = new AmericanBritishSpellings();
$text = "Axiomatically ax that door, would you, my neighbour?";
$text = $american_british_spellings->SwapBritishSpellingsForAmericanSpellings(['text'=>$text]);
print($text); // output: Axiomatically axe that door, would you, my neighbor?
I think if you're thinking of converting from American English to British English, I personally wouldn't bother. Britain is very Americanised anyway, we accept silly yank spellings on the net :)
I had a similar problem recently. I discovered the following tool, called VarCon. I haven't tested it out, but I needed a rough converter for some text data. Here's an example.
echo "I apologise for my colourful tongue ." | ./translate british american
# >> I apologize for my colorful tongue .
It looks like it works for various dialects. Be sure to read the README and proceed with caution.
*note: This will only correct spelling variations.