Multilingual spell checking with language detection - language-agnostic

I'm working on spell checking of mixed language webpages, and haven't been able to find any existing research on the subject.
The aim is to automatically detect language at a sentence level within mixed language webpages and spell check each against their appropriate language automatically. Assume that we can ignore sentences which mix multiple languages together (e.g. "He has a certain je ne sais quoi"), and assume webpages can't contain more than 2 or 3 languages.
Trivial example (Welsh + English): http://wales.gov.uk/
I'm currently using a mix of:
Character distribution (e.g. 0600-06FF = Arabic etc)
n-Grams to discern languages with similar characters
Dictionary lookup to discern locale, i.e. en-US, en-GB
I have working code but am concerned it may be naive or needlessly re-inventing a wheel. Has anyone else done this before?

You can use API (Google & Yandex) for spell check and language detection - but this option is not very scalable I think.
Other option is to use free lucene tools for spellchecking http://wiki.apache.org/lucene-java/SpellChecker, but you have to index some corpra first - Wikipedia is good choice.
LD can be archived by http://textcat.sourceforge.net/

With the Languagetool http:/www.languagetool.org Library you can select the languages you need and have the content checked against your set of languages. E.g. for a French/English website you'd check the text against English and French. Obviously there will be more errors when you check against the wrong language.
Example:
If you e.g. check the french text from http://fr.wikipedia.org/wiki/Charte_de_la_langue_fran%C3%A7aise:
La Charte de la langue française (communément appelée la loi 1011) est
une loi définissant les droits linguistiques de tous les citoyens du
Québec et faisant du français la langue officielle du Québec.
on http://www.languagetool.org it will show no errors for French and more than 20 errors for English/GB.
The corresponding english text:
The Charter of the French Language (French: La charte de la langue française), also
known as Bill 101 (Law 101 or French: Loi 101), is a law in the province of Quebec
in Canada defining French, the language of the majority of the population, as the
official language of Quebec and framing fundamental language rights. It is the central
legislative piece in Quebec's language policy.
will show 4 errors for English/GB (due to the French citation) and more than 20 errors when you check it agains the French language.

Related

Use nearest neighbors to predict text classification with fasttext

I may misunderstood how fasttext/deep-learning work for classification, I would like to take in consideration nearest neighbors to predict labels. The aim of my work is to predict label with synonyms.
I train a big dataset with fasttext:
fasttext supervised -input data/spam_status.txt -output models/sem -lr 1.0 -wordNgrams 1 -epoch 25
Where spam_status.txt use a regexp to labelize message containing the word "skype":
__label__skype i dont have skype __NUMBER__ sorry
__label__skype skype
__label__skype si ta un skype si
__label__skype i will give u my skype
__label__skype pv ici no skype
__label__skype skype
And plenty of other messages, with other labels, or "ok" if nothing is found.
Nearest neighbors of "skype" are (with fasttext nn models/sem.bin):
email
viber
emaill
skp
This is excellent, fasttext give me good similar words. But if I ask a prediction:
fasttext predict-prob ./models/sem.bin -
donne moi ton skype
__label__skype 1.00001
donne moi ton viber
__label__ok 1.00001
donne moi ton emaill
__label__ok 1.00001
Why NN is not taken in consideration here?
Because you trained the model with examples where ONLY messages with the word "skype" have the label Skype. Therefore messages with words like "email" and "Viber" are labelled "ok."
Your first pass taught you that you should re-label. Using a regex to label data is always going to cause problems like this. You can now at least re-label any of the messages with "email" or "Viber" as "__label__skype" so it will learn that pattern. However, it will probably not get you anything better than just using a regex as a classifier, because the model will learn the pattern going in: if it has one of the words from this short list, label it "Skype" otherwise label it "ok".
Your get better results by sounding a few hours manually labelling data instead of using a regex.

Where is an official code for the European Union defined?

There is no code for the European Union in ISO 3166 or UN.M49. CLDR states that its own list does not contain a code for the European Union. I've seen the code "EU" used, but I can't find any official list that contains it. Is it in any official list of codes?
As it turns out, it is not in a list per se, but the code EU was officially "reserved" by the ISO 3166 Maintenance Agency to represent the European Union. This is discussed in an old version of the Maintenance Agency FAQ:
You can use EU for the name European Union. Please note that this is
not an official ISO 3166-1 country code. The European Union is not a
country but rather an organization. As such it is not eligible to be
formally included in ISO 3166-1. Recognizing, however, that many users
of ISO 3166-1 have a practical need to encode that name the ISO
3166/MA reserved the two-letter combination EU for the purpose of
identifying the European Union within the framework of ISO 3166-1.
This document is apparently no longer available, although parts of the statement are widely quoted on other Web sites (e.g. on Wikipedia).

Is it significantly better to use ISO-8859-1 rather than UTF-8 wherever possible?

For globalization of scripts, it is very common to use UTF-8 as the default charset; for example in HTML or default charset of mysql. This is also the case for latin website in which characters are in the class of ISO-8859-1. Isn't it advantageous to use ISO-8859-1 when UTF-8 characters are not needed. From advantageous, I mean critically beneficial.
My point is that only 0 - 127 characters of UTF-8 are 1 byte, and from 128 - 255 are 2-byte; where ISO-8859-1 is 1 byte system. Doesn't it play a critical role in database storage?
If everything you need now and forever is ISO-8859-1, you'll save space by using it, though likely not much if most of the characters used are < 128. If you ever need to use anything outside of ISO-8859-1, you'll be in a world of hurt. From an overall perspective, the cost in storage for UTF-8 is way lower than the cost of implementing multiple encodings.
Most of these 127 UTF-8 1-byte characters are the most used when you work with ISO-8859-1. Let's have a look here. If you use UTF-8 you will need 1 extra byte only when you use one of the 127-255 characters (not so commons I bet).
My opinion? Use UTF-8 if you can and if you haven't problem handling it. The time you save the day you will need some extra characters (or the day you have to translate your content) really worth a few extra bytes here and there in the DB...
Short answer: It doesn't matter.
Long(er) answer: Think of it that way. You have a message table that contains the messages of a forum. You have a lot of messages (let's say, 1 million). Assume every message takes 10 extra bytes due to UTF-8. That's 10 millions extra character, which is not even 10MB (not counting index).
For such a "popular" forum, you will not use more than 15MB of storage at most. That's nothing. You should definitely not worry about the extra bytes lost, and UTF-8 will provide benefits that are much more important than 10 MB.
Does size matter?
As you know, the characters in the range U+0080 to U+009F take up twice as much space in UTF-8 as they do in ISO-8859-1. But, how often do these characters get used?
In a typical Spanish text I got from the front page of Wikipedia:
Artículo bueno
La séptima temporada de la serie de televisión de dibujos animados Los
Simpson fue emitida originalmente por la cadena Fox entre el 17 de
septiembre de 1995 y el 19 de mayo de 1996. Los productores ejecutivos
de la séptima temporada fueron Bill Oakley y Josh Weinstein, quienes
producirían 21 episodios de la temporada. David Mirkin fue el show
runner de los cuatro restantes, incluyendo dos vestigios que habían
sido producidos para la temporada anterior. La séptima temporada
estuvo nominada para dos Premios Primetime Emmy, incluyendo la
categoría "Mejor programa animado (de duración menor a una hora)" y
obtuvo un Premio Annie por "Mejor programa animado de televisión". La
versión en DVD fue lanzada a la venta en la Región 1 el 13 de
diciembre de 2005, en la Región 2 el 30 de enero de 2006 y en la
Región 4 el 29 de marzo del mismo año. La caja recopilatoria fue
puesta a la venta en dos formatos diferentes: una caja con la forma de
la cabeza de Marge y otra rectangular clásica, en la cual el dibujo
muestra el estreno de una película.
There are 17 non-ASCII characters in a sea of 1044 ASCII characters. That means an expansion of only 1.6% expansion when encoding in UTF-8. Hardly worth worrying about, especially when the all-ASCII HTML markup is taken into account.
(However, the difference may be significant for a more heavily-accented language like Sango.)
How would your idea work, anyway?
Are you going to encode all your data in windows-1252? That doesn't give you globalization; the globe does not stop at the Oder River. True ISO-8859-1 (lacking €) is even worse; the globe does not stop at the English Channel.
Tag text with its encoding? That works for XML, HTML, and SMTP. But you asked:
Doesn't it play a critical role in database storage?
How do you intend to store mixed Latin-1 and UTF-8 strings in a database?
Have two columns EncodedText BLOB, IsUtf8 BOOLEAN? How are you gonna query that? Surely you won't just look at EncodedText and ignore IsUtf8; that approach leads to mojibake.
You could write a view with a column CASE WHEN IsUtf8 THEN EncodedText ELSE Latin1ToUtf8(EncodedText) END, and a proper INSTEAD OF INSERT trigger, but that's likely to cost you more bytes than it saves.

S -> NP VP, do these sentences follow this format?

Am parsing some sentences (from the inaugural speech in the nltk corpus) with the format S -> NP VP, and I want to make sure I parsed them correctly, do these sentences follow the aforementioned format, sorry if this question seems trivial, English is not my first language. If anyone has any questions on a given sentence follows NP VP, ask me and I will give you my reasons on why I picked it and give you it's parsing tree.
god bless you
our capacity remains undiminished
their memories are short
they are serious
these things are true
the capital was abandoned
they are many
god bless the united stated of
america
the enemy was advancing
all this we can do
all this we will do
Thanks in advance.
The first 9 are NP VP. In the last two, "all this" is the direct object, which is part of the VP.
god bless you
NP- VP-------
our capacity remains undiminished
NP---------- VP------------------
their memories are short
NP------------ VP-------
they are serious
NP-- VP---------
these things are true
NP---------- VP------
the capital was abandoned
NP--------- VP-----------
they are many
NP-- VP------
god bless the united stated of america
NP- VP--------------------------------
the enemy was advancing
NP------- VP-----------
all this we can do
VP------ NP VP----
all this we will do
VP------ NP VP-----
Note that the last two sentences are semantically equivalent to the sentences "We can do all this" and "We will do all this", an order which makes the subject predicate/verb predicate breakdown easier.

Are there any open source/free LOGO implementations that support dynaturtles? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I'm looking for an implementation of the LOGO programming language that supports 'dynaturtles' - animated turtles that can programmatically change shape, speed and direction as well as detect collisions with each other or other objects in the environment.
Back in the mists of time when the earth was new and 8 bit micros ruled supreme, Atari LOGO did this famously well. One could create all sorts of small games and simulated environments using this technique very easily as that implementation of the language had a very well thought out, elegant syntax.
I know about LCSI's Microworlds but I'm looking for something I can use to get some friends and their kids involved in programming without breaking my budget.
Digging around a bit online, I've found OpenStarLogo. Though they don't specifically mention "dynaturtles" the docs do mention collision detection. The site has code and documentation downloads.
From this wikipedia article, under the Implementations section, there is a PDF listing known current and antique implementations. Some of these, such as StarLogo TNG and Elica have support for 3D objects. These are definitely not like the LOGO programs I wrote as a kid...
I use microworlds for my logo... I know of kturtle for kde kturtle
I also found a few links that could be interesting
python turtle
fmslogo
MSWlogo
Check out the turtle python package. It is in the standard python distribution and it supports a graphical turtle interface.
If you use win-logo (www.win-logo.de/eng/e_index.htm; you must register and then you can try for 30 days), you can practise this code (german version Nr. 2):
PR test
;* ##### Startdatei ######
SETZE "sprung.x" 0
SETZE "sprung.y" 0
flug
ENDE
PR flug
sprung
tasten
flug
ENDE
PR sprung
SETZE "sprung.x" :sprung.x + (SIN KURS)/2
SETZE "sprung.y" :sprung.y + (COS KURS)/2
AUFXY (XKO + :sprung.x) (YKO + :sprung.y)
ENDE
PR tasten
SETZE "t" TASTE
WENN :t = "d" DANN LI 30
WENN :t = "e" DANN DZ "Abbruch!" AUSSTIEG
WENN :t = "f" DANN RE 30
WENN :t = "h" DANN sprung
tasten
ENDE
OK?
Greetings. Michael Kraus
Two additions to my post of yesterday, concerning LOGO-procedures with dynaturtle:
1.) the key "d" is NUM 4
the key "e" is NUM 5
the key "f" is NUM 6
the key "h" is NUM 8
2.) After hitting "e" = NUM 5 to stop the recursive procedures, you have also to click the exit-button. - I have tried to find out why, but I have no idea.
Michael Kraus