Hidden character in Pages and MySql - mysql

I have a text that seems that have a hidden character.
The original text was written with Apple Pages, the word processor, and copy-paste to a MySql database. They are h2 written with markdown. I detected that hidden character when I make a SELECT to the database to output the ## (.*) space and convert to h2 tag. Some of them work and some do not. For instance, if I use /## / (with a space behind #) regex only finds ## Brand:
## Brand
## New Tech
I tested that in different regex tools. For instance: http://regexr.com/3f660 They all find only ## Brand. with /## /
I can solve the problem if I use ##\s or just delete that space and make a space again. I have many cases like that in a big database and I would like to understand first and clean it later. If I go to Apple Pages > Show Invisibles it shows : between # and N in ## New Tech. What is that character and how can I find id to delete it in a MySql database?

Related

Sublime Text - find all instances of an html class name project-wide

I want to find all instances of a class named "validation" in all of my html files project wide. It's a very large project and a search for the word "validation" gives me hundreds of irrelevant results (js functions, css, js/css minified, other classes, functions and html page content containing the word validation, etc). It can sometimes be the second, third, or fourth class declared so searching for "class='validation" doesn't work.
Is there a way to specify that I only want results where validation is a class declared on an html block?
Yes. In the sublime menu go to Find --> Find in Files...
Then match what is in the following image.
The first thing you will want to do is consider other possibilities with how you can solve this problem. Currently, it sounds like you are only using sublime text. Have you considered trying to use a command-line tool like grep?
Here is an example of how it could be used.
I have a project called enfold-child with a bunch of frontend assets for a wordpress project. Let's say, I want to find all of my scss files with the class "home" listed in them somewhere, but I do NOT want to pull in built css files, or anything in my node_modules folder. The way i would do that is as follows:
Folder structure:
..
|build
|scss_files
|node_modules
|css_files
|style.css
grep -rnw build --exclude=*{.css} --exclude-dir=node_modules -e home
grep = handy search utility.
-r = recursive search.
-n = provide line numbers for each match
-w = Select only those lines containing matches that form whole words.
-e = match against a regular expression.
home = the expression I want to search for.
In general, the command line has most anything one could want/need to do most of the nifty operations offered by most text-editors -- such as Sublime. Becoming familiar with the command line will save you a bunch of time and headaches in the future.
In SublimeText, right-click on the folder you want to start the search from and click on Find in Folder. Make sure regex search is enabled (the .* button in the search panel) and use this regex as the search string:
class="([^"]+ )?validation[ "]
That regex will handle cases where "validation" is the only classname as well as cases where its one of several classnames (in which case it can be anywhere in the list).
If you didn't stick to double quotes, this version will work with single or double quotes:
class=['"]([^'"]+ )?validation[ '"]
If you want to use these regexes from the command line with grep, you'll need to include a -E argument for "extended regular expressions".

Perl-Application and queries with accented characters using postgres

It's been a decade I have worked with Postgres and Perl.
One of my oldest still-operated applications, an dictionary of government addresses and departmental responsibilities, has issues handling query terms containing accented characters, for example köln. In other words, whenever a query term contains a accented character (mainly umlauts) there are 0 results returned.
I have to mention that this behavior is only happening using this application with Postgres as the database. If I switch to MySQL5 (same data) same queries are working correctly.
Trying to track the cause of this problem I have checked the following:
Postgres database is UTF-8 (using the command show server_encoding;)
Postgres client encoding is also UTF8 (using show client_encoding;)
If I use the Postgres monitor and execute the same SQL query as the application does, using accented characters in the query term, I get correct results
The Perl application itself is handling UTF-8, the HTML-Header is set correctly, contents of the output display correct and not garbled
All Perl code files, scripts, .pm package files and templates are UTF-8 encoded (I verified that with file --mime perl_file_name)
I fiddled with the database connection, setting $self->{dbh}->{pg_enable_utf8} = 1; or/and $self->{dbh}->do("SET CLIENT_ENCODING TO 'UTF8';"); or/and $self->{dbh}->do("SET NAMES 'UTF8';"); with no change
I've updated the DBD::Pg module to version 3.6.2, no change.
So I am pretty much out of ideas what else to check or try to get Postgres fully working. Like mentioned in my intro, same application just using MySQL as database works flawlessly.
2 years ago the application was changed to handle UTF-8 data, I did not do the changes myself, but as far as I can see in the code (compared to the code in my GIT repo) its just the HTML UTF8-Header print "Content-type: text/html; charset=utf-8\n\n"; and a few unrelated template parts. Perhaps that change somewhere is the origin for all the problems but I don't know what esp. to adjust for Postgres.
The current Perl version is 5.22.1, using Apache/2.2.22 (Ubuntu). The vhost configuration is simple:
AddHandler cgi-script .cgi .pl
ScriptAlias /...abs-path-to-app.../cgi-bin/
<Directory "/...abs-path-to-app.../cgi-bin/">
AllowOverride None
Options +Indexes +ExecCGI +MultiViews +SymLinksIfOwnerMatch
<IfVersion < 2.4>
Allow from all
</IfVersion>
<IfVersion >= 2.4>
Require all granted
</IfVersion>
Allow from all
</Directory>
Postgres is version 9.1.24.
Edit:
Collate and Ctype is set to en_US.UTF-8, Encoding is set to UTF-8 for the database in question.
Taking a look into the tables, all character varying columns use pg_catalog."default" collation. Executing show lc_collate; show already mentioned en_US.UTF-8.
Edit2:
Using the DBD::Pg flag pg_enable_utf8 and setting it to 0 seems to work out and I get the expected results. Using a value other than 0, for example '-1or1` does not work. I tried out that flag (once again) right after the database connect. Actually I have to verify this as I still do not really understand what's going on.

MySQL on remote machine accessed via chromebook terminal returns nonsense unicode which persists after I leave MySQL

I am using the terminal in a chromebook to ssh into a remote server. When I run a MySQL (5.6) select query, sometimes one of the fields will return nonsense unicode (when the field should return an email address) and change the MySQL prompt from:
mysql>
to
└≤⎽─┌>
and whatever text I type is converted into weird unicode. The problem persists even after I exit MySQL
One of the values in your database happened to have the sequence of bytes 0x1B, 0x28, 0x30 (ESC ) 0) in it. When you did the query, MySQL printed this byte sequence directly to your console. You can reproduce the effect by typing from python:
>>> print '\x1B\x28\x30'
Consoles use control characters (in particular 0x1B, ESC) as a way to allow applications to control aspects of the console other than pure text, such as colours and cursor movements. This behaviour is inherited from the old dumb-terminal devices that they are pretending to be (which is why they are also known as terminal emulators), along with some weirder tricks that we probably don't need any more. One of those is to switch permanently between different character sets (considered encodings, now, but this long predates Unicode).
One of those alternative character sets is the DEC Special Graphics Character Set which it looks like you have here. In this character set the byte 0x6D, usually used in ASCII for m, comes out as the graphical character └.
You could in principle reset your terminal to normal ASCII by printing a byte sequence 0x1B, 0x28, 0x42 (ESC ) B), but this tends to be a pain to arrange when your console is displaying rubbish.
There are potentially other ways your console can become confused; it's not, in general safe to print arbitrary binary data to the console. There even used to be nastier things you could do with the console by faking keyboard input, which made this a security problem, but today it's just an annoyance factor.
However, one wouldn't normally expect to have any control codes in an e-mail address field. I suggest the application using the database should be doing some validation on the input it receives, and dropping or blocking all control codes (other than potentially newlines where necessary).
As a quick hack to clean this field for the specific case of the ESC character, you could do something like:
UPDATE things SET email=REPLACE(email, CHAR(0x1B), '');

Creating a diff which ignores differences between sentinel lines

I'm looking for a possible way of getting around some merge conflicts when working through different branches.
It's not unlikely that some information in some files (especially version numbers) are NOT to be spread around different branches, so I'm looking for some way to output a diff ignoring text between well defined sentinel lines, and I'd like to know if there's anything around without coding my own solution.
That what I'd like: suppose two source files that look like
some text
DIFF_IGNORE_START
foo bar
DIFF_IGNORE_END
some other text
one
and
some text
DIFF_IGNORE_START
different text
DIFF_IGNORE_END
some other text
two
I want the diff to be
--- original 2011-04-04 15:34:06.000000000 +0200
+++ modified 2011-04-04 15:35:13.000000000 +0200
## -3,4 +3,4 ##
foo bar
DIFF_IGNORE_END
some other text
-one
+two
I'd need a solution that allows the ignored blocks to be of a different size as well.
One way to implement this would be through a custom diff driver, declaring a special diff script in a .gitattributes file, which would:
remove every DIFF_IGNORE_xxx sections on root, source and destination versions, replacing them with dummy content (always identical between the three version)
perform the diff with the modified versions

How can I check if a binary string is UTF-8 in mysql?

I've found a Perl regexp that can check if a string is UTF-8 (the regexp is from w3c site).
$field =~
m/\A(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*\z/x;
But I'm not sure how to port it to MySQL as it seems that MySQL don't support hex representation of characters see this question.
Any thoughts how to port the regexp to MySQL?
Or maybe you know any other way to check if the string is valid UTF-8?
UPDATE:
I need this check working on the MySQL as I need to run it on the server to correct broken tables. I can't pass the data through a script as the database is around 1TB.
I've managed to repair my database using a test that works only if your data can be represented using a one-byte encoding in my case it was a latin1.
I've used the fact that mysql changes the bytes that aren't utf-8 to '?' when converting to latin1.
Here is how the check looks like:
SELECT (
CONVERT(
CONVERT(
potentially_broken_column
USING latin1)
USING utf8))
!=
potentially_broken_column) AS INVALID ....
If you are in control of both the input and output side of this DB then you should be able to verify that your data is UTF-8 on whichever side you like and implement constraints as necessary. If you are dealing with a system where you don't control the input side then you are going to have to check it after you pull it out and possibly convert in your language of choice (Perl it sounds like).
The database is a REALLY good storage facility but should not be used aggressively for other applications. I think this is one spot where you should just let the MySQL hold the data until you need to do something further with it.
If you want to continue on the path you are on then check out this MySQL Manual Page: http://dev.mysql.com/doc/refman/5.0/en/regexp.html
REGEX is normally VERY similar between languages (in fact I can almost always copy between JavaScript, PHP, and Perl with only minor adjustments for their wrapping functions) so if that is working REGEX then you should be able to port it easily.
GL!
EDIT: Look at this Stack article--you might want to use Stored Procedures considering you cannot using scripting to handle the data: Regular expressions in stored procedures
With Stored Procedures you can loop through the data and do a lot of handling without ever leaving MySQL. That second article is going to refer you right back to the one I listed though so I think you need to first prove out your REGEX and get it working, then look into Stored Procedures.