To emoji or not to emoji? - mysql

I finally found a way to overcome issue with emojis in user inputs in my Rails 4 app. It was causing 'incorrect string value' errors.
The solution was to set utf8mb4 (using MySQL) not only in DB encoding*collation but also in database.yml.
So now it works. But the problem is that rendering is not consistent across the browsers as always :) And it feels like I have no control over user inputs anymore.
Is there an easy way to strip 4-byte characters, or possible just emojis from the user's input before saving the records and storing them in DB?
Thanks!

You can override the attribute setters in your models to do whatever you want to the value before it's stored. Then, borrowing the regex from this stack overflow post (looks Greek to me!), try:
# Do this for each attribute in your model:
def my_attr=(value)
value = value.to_s.gsub(/\360\237/, '')
super(value.presence)
end

I found one more solution for Emoji here: https://dev.firmafon.dk/blog/quick-no-hack-emoji-support-with-mysql-rails/
class Rating < ActiveRecord::Base
serialize :comment
end

Related

Django order_by() is not working for me on PythonAnywhere

For some reason order_by() is not working for me on a queryset. I've tried everything I can think of, but my Django/MySQL installation doesn't seem to be doing anything with order_by() method. The list appears to just remain in a fairly unordered state, or is ordered on some basis I cannot see.
My Django installation is 1.8.
An example of one of my models is as follows:
class PositiveTinyIntegerField(models.PositiveSmallIntegerField):
def db_type(self, connection):
if connection.settings_dict['ENGINE'] == 'django.db.backends.mysql':
return "tinyint unsigned"
else:
return super(PositiveTinyIntegerField, self).db_type(connection)
class School(models.Model):
school_type = models.CharField(max_length=40)
order = PositiveTinyIntegerField(default=1)
# Make the identity of db rows clear in admin
def __str__(self):
return self.school_type
And here is the the relevant line from my view:
schools = School.objects.order_by('order')
At first I thought the problem was related to having used the non-standard PositiveTinyIntegerField() defined by a class I found on a website somewhere which allows me to use the MySQL Tiny Integer field. However, when I ordered by 'id', or 'school_type' the list still remained in an order that appeared fairly random to my eye.
I could put in my own loop which orders the queryset after it has been retrieved, but I'd really rather solve this issue so I can use the standard Django way of doing it.
I hope someone can see where the issue may be coming from.
I managed to resolved it with some help from the comments here. I tried writing the school object to stdout using sys.write.stdout(str(school)). The logs then showed me that in fact the data was being ordered correctly, so the problem had to be with how the data was being packaged before being rendered by the template.
I wrote the view some time ago before I decided I wanted it ordered, so it turned out the problem was caused by each school object (with an attached tree of related data) being read into a dictionary. Once I changed the data type to the list, the schools then rendered in my intended order.

Scala Play 2.4.x handling extended characters through anorm (MySQL) to Java Mail

I was under the impression that UTF-8 was the answer to everything :0
Problem: Using Play's idiomatic form handling to go from a web page (basic HTML Text Area Input field) to a MySQL database through the Anorm abstraction layer (so all properly escaped) and then reading the database to gather that data and create an email using the JavaMail API's to send HTML email with alternate characters (accented characters like é for example. (I'd post more but I suspect we might get strange artifacts here as well -- I'll try that in a comment below perhaps)
I can use a moderate set of characters and create a TEXT email (edited via Atom and placed into the stream directly at the code level) and it comes through as an email with all the characters I've chosen in tact.
I have not yet systematically worked through the characters I was just using a relatively random sampling as an initial test.
I place the same set of characters into a text field and try to save them to the database and I can only save about 1 in 5 or less of them.
The errors look like this:
SQLException: Incorrect string value: '\xC4\x93\x0D\x0A\x0D\x0A...' for column 'content' at row 1
I suspect I'm about to learn a ton of new information about either Play and/or UTF-8 or HTML or some part of the chain where this is going off the rails.
My question then is this: Is there an idiomatic Play example of how to handle UTF-8 end to end through Anorm and into Java Mail?
(I think I kinda expected it to be "built-in" but then I expected a LOT more to be baked into the core product as well...)
I want/need both a TEXT and and HTML path for the email portion. (I can write BOTH and they work fine -- the problem is moving alternate characters though the channels as indicated above).
I'm currently seeing if this might be an answer:
https://objectpartners.com/2013/04/24/html-encoding-utf-8-characters/
However presently hitting this roadblock...
How to turn off specific Implicit's in Scala that prevent code from compiling due to overloaded methods?
This appears to be a hopeful candidate -- I am researching it now end to end.
import org.apache.commons.lang3._
def htmlEncode(input: String) = htmlEncode_sb(input).toString
def htmlEncode_sb(input: String, stringBuilder: StringBuilder = new StringBuilder()) = {
stringBuilder.synchronized {
for ((c, i) <- input.zipWithIndex) {
if (CharUtils.isAscii(c)) {
// Encode common HTML equivalent characters
stringBuilder.append(StringEscapeUtils.escapeHtml4(c.toString()))
} else {
// Why isn't this done in escapeHtml4()?
stringBuilder.append(s"""&#${Character.codePointAt(input, i)};""")
}
}
stringBuilder
}
}
In order to get it to work inside Play you'll need this in your build.sbt file
"org.apache.commons" % "commons-lang3" % "3.4",
This blog post lead me to write that code: https://objectpartners.com/2013/04/24/html-encoding-utf-8-characters/
Update: Confirmed that it does work end to end.
Web Page Input as TextArea inside a Form saved to MySQL database escaped by Anorm, reread from database and displayed inside a TextArea on a web page with extended characters (visually) appearing precisely as input.
You'll need to call #Html(htmlContentString) inside the Twirl template to re-render this as the original HTML but the browser (Safari 8.0.7) displayed exactly what I gave it after a round trip to and from the database.
One caveat -- it creates machine readable HTML not human readable HTML. It would be nice if it didn't encode angle brackets and such so it looks more like HTML that we expect. I'm sure a pattern match block will be added next to exclude just that :)

NLTK letter 'u' in front of text result?

I'm learning NLTK with a tutorial and whenever I try to print some text contents, it returns with 'u' in front of it.
In the tutorial it looks like this,
firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se...
But in my result, it looks like this
(u'firefox.txt', u'Cookie Manager: "Don\'t allow sites that set removed cookies to se', '...')
I am not sure why. I followed exact way the tutorial is explaining. Can someone help me understand this problem? Thank you!
That leading u just means that that string is Unicode. All strings are Unicode in Python 3. The parentheses means that you are dealing with a tuple. Both will go away if you print the individual elements of the tuple, as with t[0], t[1], and so on (assuming that t is your tuple).
If you want to print the whole tuple as a whole, removing u's and parentheses, try the following:
print " ".join (t)
As mentioned in other answer the leading u just means that string is Unicode. str() can be used to convert unicode to str but there doesnt seem to be a direct way to convert all the values in a tuple from unicode to string.
Simple function as below and using it when ever you are referring to any tuple in nltk.
>>> def str_tuple(t, encoding="ascii"):
... return tuple([i.encode(encoding) for i in t])
>>> str_tuple(nltk.corpus.gutenberg.fileids())
('austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt')
I guess you are using Python2.6 or any version before 3.0.
Python allows its users to do the same operation on 'str()' and 'unicode' in the early version. They tried to make conversion between 'str()' and 'unicode' directly in some case rely on default encoding, which on most platform is ASCII. That's probably the reason cause your problem. Here are two ways may solve it:
First, manually assign decoding method. For example:
>> for name in nltk.corpus.gutenberg.fileids():
>> name.decode('utf-8')
>> print(name)
The other way is to UPDATE your Python to version 3.0+ (Recommended). They fix this problem in Python3.0. Here is the link to update detail description:
https://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit
Hope this helps you.

Getting MySQL to properly distinguish Japanese characters in SELECT calls

I'm setting up a database to do some linguistic analysis, and Japanese Kana are giving me just a bit of trouble.
Unlike other questions on this so far, I don't know that it's an encoding issue, per se. I've set the coallation to utf8_unicode_ci, and on the surface it's saving and recalling most things all right.
The problem, however, is when I get into related kana, such as キ (ki) and ギ (gi). For sorting purposes, Japanese doesn't distinguish between the two unless they are in direct conflict. So for example:
ぎ (gi) comes before きかい (kikai)
きる (kiru) comes before ぎわく (giwaku)
き (ki) comes before ぎ (gi)
It's this behavior that I think is at the root of my problem. When loading my data set from an external file, I had it do a SELECT call to verify that specific readings in Japanese had not already been logged. If it was already there, it would fetch the ID so it could be paired to a headword; otherwise a new entry was added and paired thereafter.
What I noticed after I put everything in is that wherever two such similar readings occurred, the first one encountered would be logged and would then show up as a false positive for the other if it showed up. For example:
キョウ (kyou) appeared first, so characters with ギョウ (gyou) got paired with kyou instead
ズ (zu) appeared before ス (su), so likewise even more characters got incorrectly matched.
I can go through and manually sort it out if need be, but what I would really like to do is set the database up to take a stricter view regarding differentiating between characters (e.g. if the characters have two different UTF-8 code points, treat them as different characters). Is there any way to get this behavior?
You can use utf8_bin to get a collation that compares characters by their Unicode code points.
The utf8_general_ci collation also distinguishes キョウ and ギョウ.
when saving to database
save it as binary
and when calling back change it to Japanese
same problem accorded with me with Arabic language

Characters "ي" and "ی" and the difference in persian - Mysql

I'm working on a UTF-8 Persian website with integrated mysql database. All the content in the website are imported through an admin panel and it's all persian.
As you might know arabic language has the same letters as persian except some.
The problem is when a person tries to type on a keyboard with arabic layout it writes "ي" as an character and if he tries to type by a keyboard with persian layout it types "ی" as character.
So if a person searches for 'بازی' the mysql won't find 'بازي' as the result.
Important Note: 'ی' is not the only character with this property, there are lots of them and they are very similar.
How can I fix this issue?
One simple naive solution seems to be replace all "ي" with "ی" before importing the data into database, but i'm searching for a better robust solution than this.
Dear EBAG, We have a single Arabic block in Unicode which contains both Arabic & Persian characters.
06CC is Persian ی and 064A is Arabic ي
Default windows keyboard uses code page 1256 for arabic characters which put 064A as default ي for bothPersian and Arab users because Arab users are much more than Persian.
ISIRI make an standard keyboard ISIRI 9147 and put both Arabic and Persian Yeh on it but Perisan ی is the default characters. Persian users which are using standard keyboard will put ( and use ) standard Persian ی‍ while the rest of them use arabicي`.
As you told usually while we are saving a data to database we change arabic ي to Persian ‍ی and when we are reading from it we just go for Persian so everything is true.
the second approach is to use a JavaScript file in web application to control user input. most of the persian websites use this approach to save characters to database. In this method user don't need to install any Keyboard layout for Persian or Arabic keyboard. He/she just put the keyboard on English and then in JavaScript file developer check that which character is equevalent for him. Here you can find ISIRI 9147 javascript for web application and a Persian Guid to use it.
the Third approach is to use a On-Screen Keyboard that work just like the previous one with a user interface and is usually good for thise who are not familiar with Persian keyboard.
The forth approach is to search both dialect. As you know when you install MySql or SQL Server you can set the collation and also you have an option to support dialect ( and case sensivity). if you enable arabic collation with dialect you can get result for both of them and usually this works fine in sql server I don't test it in MySql. This is the best solution yet.
but if I were you, I implement a simple sql function which get nvarchar and return nvarchar. then I call it when I wanted to write data. and whenever you want to read, you can go for the standard one.
Sorry for the long tail.
update TABLENAME set COLUMNNAME=REPLACE(COLUMNNAME,NCHAR(1610),NCHAR(1740))
or
update TABLENAME set COLUMNNAME=REPLACE(COLUMNNAME,'ي',N'ی')
This is called a collation. It's what MySQL uses to compare two different characters. I'm afraid I don't know anything about persian or arabic, but the concept is the same. Essentially you've got two characters which map to the same base value. You need to find a collation which maps ي to ی. I'm afraid that's as helpful as I can be without knowing more about the language.
The first letter (ي) is Yāʾ in the arabic alphabet.
The second letter (ی) is ye in the perso-arabic alphabet.
More on the perso-arabic alphabet here:
http://en.wikipedia.org/wiki/Perso-Arabic_alphabet
"Two dots are removed in the final ye (ی). Arabic differentiates the final yāʾ with the two dots and the alif maqsura (except in Egyptian Arabic), which is written like a final yāʾ without two dots.
Because Persian drops the two dots in the final ye, the alif maqsura cannot be differentiated from the normal final ye. For example, the name Musâ (Moses) is written موسی. In the final letter in Musâ, Persian does not differentiate between ye or an alif maqsura."
Seems to be an interesting problem...
I was struggling with the similar situation 5-6 years ago, when Lucene was not an option for MySQL and there were no Sphinx (Never tried Sphinx result on this), but what I did was I found pretty much most of the possible alternations and put them in an array in PHP.
So if the input keyword contained any of those characters, I generated all the possible alternates of that.
So for the input of 'بازی' I would have generated {'بازي' , 'بازی' } and then I would query the MySQL for both, like the simplest query below :
SELECT title,Describtion FROM Games WHERE Description LIKE '%بازي%' OR Description LIKE '%بازی%'
The primary list of alternatives is not very long though.
If you've the possibility to switch DB engine, you might want to look into the full text search functionality of PostgreSQL:
http://www.postgresql.org/docs/9.0/static/textsearch.html
Among other things, you can configure it so that it indexes/searches unaccented characters, and you can define all sorts of additional dictionaries (e.g. stop words, thesaurus, synonyms, etc.).
If not, consider using Sphinx or Lucene instead of like statements for your searches.
I know answering this topic is like digging a corpse from its grave since it's really old but I'd like to share my experience IMHO, the best way is to wrap your request and apply your replacement . it's more portable than other ways. here is a java sample
public class FarsiRequestWrapper extends HttpServletRequestWrapper{
#Override
public String getParameter(String name) {
String parameterValue = super.getParameter(name);
parameterValue.replace("ی", "ي");
parameterValue.replace("\\s+", " ");
parameterValue.replace("ک","ک");
return parameter.trim();
}
}
then you only need to setup a filter servlet
public class FarsiFilter implements Filter{
public void doFilter(ServletRequest request, ServletResponse response,
FilterChain chain) throws IOException, ServletException {
HttpServletRequest req = (HttpServletRequest) request;
FarsiRequestWrapper rw = new FarsiRequestWrapper(req);
chain.doFilter(rw, response);
}
}
although this approach only works in Java, I found it simpler and better.
You must use N (meaning uNicode) before non-English characters, for example:
REPLACE(COLUMNNAME, N'ي', N'ی')