Decoding old mysql UTF8 storage... how do? - mysql

The following query, executed against an old MySQL database, should reveal a single UTF-8 character 'yama' for mountain.
select convert(sc_cardname using binary) as cn
from mtg.mtg_cdb_set_cards where setcardid = 214400
Instead it yields the following 15 byte array:
[195, 165, 194, 177, 194, 177, 195, 168, 226, 128, 158, 226, 128, 176, 32]
What are these values and how do I get from there to a character identity?
For reference, the expected binary aray would be the following:
[229, 177, 177]
Update: the following code fixes the yama problem, but I don't know why:
var iconv = new Iconv('utf8','ISO-8859-1');
shortBuffer = buffer.slice(0,-9);
result = iconv.convert(shortBuffer).toString('utf8');

The answer was this, everything was actually encoded in LATIN1... changing the connection properties to reflect that solved the problem

Related

How do I convert MySQL BINARY columns I created with Node Buffers to strings in Rust?

What I did
Stored a UUID as BINARY(16) in NodeJS using
const uuid = Buffer.from('myEditedUuid');
(A followup to How do I fetch binary columns from MySQL in Rust?)
What I want to do
I want to fetch said UUID using Rust https://docs.rs/mysql/20.0.0/mysql/.
I am currently using Vec<u8> to gain said UUID:
#[derive(Debug, PartialEq, Eq, Serialize)]
pub struct Policy {
sub: String,
contents: Option<String>,
}
#[derive(Debug, PartialEq, Eq, Serialize)]
pub struct RawPolicy {
sub: Option<Vec<u8>>,
contents: Option<String>,
}
// fetch policies themselves
let policies: Vec<RawPolicy> = connection.query_map("SELECT sub, contents FROM policy", |(sub, contents)| {
RawPolicy { sub, contents }
},)?;
// convert uuid to string
let processed = policies.into_iter().map(|policy| {
let sub = policy.sub.unwrap();
let sub_string = String::from_utf8(sub).unwrap().to_string();
Policy {
sub: sub_string,
contents: policy.contents,
}
}).collect();
What my problem is
In Node, I would receive a Buffer from said database and use something like uuidBUffer.toString('utf8');
So in Rust, I try to use String::from_utf8(), but said Vec does not seem to be a valid utf8-vec:
panicked at 'called `Result::unwrap()` on an `Err` value: FromUtf8Error { bytes: [17, 234, 79, 61, 99, 181, 10, 240, 164, 224, 103, 175, 134, 6, 72, 71], error: Utf8Error { valid_up_to: 1, error_len: Some(1) } }'
My question is
Is Using Vec correct way of fetching BINARY-Columns and if so, how do I convert them back to a string?
Edit1:
Node seems to use Base 16 to Convert A string to a Buffer (Buffer.from('abcd') => <Buffer 61 62 63 64>).
Fetching my parsed UUID in Rust made With Buffer.from() gives me Vec<u8> [17, 234, 79, 61, 99, 181, 10, 240, 164, 224, 103, 175, 134, 6, 72, 71] which thows said utf8-Error.
Vec does not seem to be allowed by MySQL in Rust.
Solution is simple:
You need to convert the BINARY to hex at you database Query or you code. So either try Using the HEX-Crate https://docs.rs/hex/0.4.2/hex/ or rewrite your Query:
Rewriting The Query
let policies: Vec<RawPolicy> = connection.query_map("SELECT hex(sub), contents FROM policy", |(sub, contents)| {
RawPolicy { sub, contents }
},)?;
Converts the sub to hex numbers. Now the resulting Vec can be converted using
let sub = policy.sub.unwrap();
let sub_string = String::from_utf8(sub).unwrap();
from_utf8_lossy can be used
let input = [17, 234, 79, 61, 99, 181, 10, 240, 164, 224, 103, 175, 134, 6, 72, 71];
let output = String::from_utf8_lossy(&input); // "\u{11}�O=c�\n��g��\u{6}HG"
Invalid characters will be replaced by �
The output "\u{11}�O=c�\n��g��\u{6}HG" is the same as the nodejs output "\u0011�O=c�\n��g��\u0006HG".
Unless this string is to be send to a javascript runtime, it should be kept that way.
But if this string is to be send to a javascript runtime (browser or nodejs), then the unicode point notations\u{x} should be substituted to their equivalent notation in javascript
playground
from_ut16_lossy can be used as well
If some of the previous � are not utf-8 encoded but utf-16, they will be converted, if not the same � will be used to render them.
let input:&[u16] = &vec![17, 234, 79, 61, 99, 181, 10, 240, 164, 224, 103, 175, 134, 6, 72, 71];
println!("{}", String::from_utf16_lossy(input))
playground

How to read a csv file into a list of lists in SWI prolog where the inner list represents each line of the CSV?

I have a CSV file that look something like below: i.e. not in Prolog format
james,facebook,intel,samsung
rebecca,intel,samsung,facebook
Ian,samsung,facebook,intel
I am trying to write a Prolog predicate that reads the file and returns a list that looks like
[[james,facebook,intel,samsung],[rebecca,intel,samsung,facebook],[Ian,samsung,facebook,intel]]
to be used further in other predicates.
I am still a beginner and have found some good information from SO and modified them to see if I can get it but I`m stuck because I only generate a list that looks like this
[[(james,facebook,intel,samsung)],[(rebecca,intel,samsung,facebook)],[(Ian,samsung,facebook,intel)]]
which means when I call the head of the inner lists I get (james,facebook,intel,samsung) and not james.
Here is the code being used :- (seen on SO and modified)
stream_representations(Input,Lines) :-
read_line_to_codes(Input,Line),
( Line == end_of_file
-> Lines = []
; atom_codes(FinalLine, Line),
term_to_atom(LineTerm,FinalLine),
Lines = [[LineTerm] | FurtherLines],
stream_representations(Input,FurtherLines)
).
main(Lines) :-
open('file.txt', read, Input),
stream_representations(Input, Lines),
close(Input).
The problem lies with term_to_atom(LineTerm,FinalLine).
First we read a line of the CSV file into a list of character codes in
read_line_to_codes(Input,Line).
Let's simulate input with atom_codes/2:
?- atom_codes('james,facebook,intel,samsung',Line).
Line = [106, 97, 109, 101, 115, 44, 102, 97, 99|...].
Then we recompose the original atom read in into FinalLine (this seems wasteful, there must be a way to hoover up a line into an atom directly)
?- atom_codes('james,facebook,intel,samsung',Line),
atom_codes(FinalLine, Line).
Line = [106, 97, 109, 101, 115, 44, 102, 97, 99|...],
FinalLine = 'james,facebook,intel,samsung'.
The we try to map this atom in FinalLine into a term, LineTerm, using term_to_atom/2
?- atom_codes('james,facebook,intel,samsung',Line),
atom_codes(FinalLine, Line),
term_to_atom(LineTerm,FinalLine).
Line = [106, 97, 109, 101, 115, 44, 102, 97, 99|...],
FinalLine = 'james,facebook,intel,samsung',
LineTerm = (james, facebook, intel, samsung).
You see the problem here: LineTerm is not quite a list, but a nested term using the functor , to separate elements:
?- atom_codes('james,facebook,intel,samsung',Line),
atom_codes(FinalLine, Line),
term_to_atom(LineTerm,FinalLine),
write_canonical(LineTerm).
','(james,','(facebook,','(intel,samsung)))
Line = [106, 97, 109, 101, 115, 44, 102, 97, 99|...],
FinalLine = 'james,facebook,intel,samsung',
LineTerm = (james, facebook, intel, samsung).
This ','(james,','(facebook,','(intel,samsung))) term will thus also be in the final result, just written differently: (james,facebook,intel,samsung) and packed into a list:
[(james,facebook,intel,samsung)]
You do not want this term, you want a list. You could use atomic_list_concat/2 to create a new atom that can be read as a list:
?- atom_codes('james,facebook,intel,samsung',Line),
atom_codes(FinalLine, Line),
atomic_list_concat(['[',FinalLine,']'],ListyAtom),
term_to_atom(LineTerm,ListyAtom),
LineTerm = [V1,V2,V3,V4].
Line = [106, 97, 109, 101, 115, 44, 102, 97, 99|...],
FinalLine = 'james,facebook,intel,samsung',
ListyAtom = '[james,facebook,intel,samsung]',
LineTerm = [james, facebook, intel, samsung],
V1 = james,
V2 = facebook,
V3 = intel,
V4 = samsung.
But that's rather barbaric.
We must do this whole processing in fewer steps:
Read a line of comma-separated strings on input.
Transform this into a list of either atoms or strings directly.
DCGs seem like the correct solution. Maybe someone can add a two-liner.

I have a CSV with NO line delimiter. How can I efficiently add one?

I have a CSV file (4.7 million characters) that I am struggling to import into a spreadsheet.
It seems the line delimiter is just a space...and yet there are also spaces after every comma.
What can I do to correctly organize this data in a spreadsheet?
I have tried using Google sheets import and Microsoft Excel import.
Example of current CSV
73, 5/11/2018,Vet Check,Result:Pregnant Multiple, , 73, 5/19/2018,Move To String/Pen,Move To:16, , 73, 5/22/2018,Mastitis,Treat. Name:Spectramast, Treat. Type:Intramammary, Comments:4 Times, Move To:1673, 5/25/2018,Move To String/Pen,Move To:10, , 73, 5/28/2018,Move To String/Pen,Move To:11, , 73, 7/20/2018,Vet Check,Result:OK - Confirmed PG, ,
Where the linebreaks should be.
73, 5/11/2018,Vet Check,Result:Pregnant Multiple, ,
73, 5/19/2018,Move To String/Pen,Move To:16, ,
73, 5/22/2018,Mastitis,Treat. Name:Spectramast, Treat. Type:Intramammary, Comments:4 Times, Move To:16
73, 5/25/2018,Move To String/Pen,Move To:10, ,
73, 5/28/2018,Move To String/Pen,Move To:11, ,
73, 7/20/2018,Vet Check,Result:OK - Confirmed PG, ,
It seems that you could apply this kind of regex https://regex101.com/r/HU13Um/2
Then using sed and tail, if you run
<input sed -r 's/([0-9]{2}, *[0-9]+\/)/\n\1/g' | tail -n +2 >output
you will have
73, 5/11/2018,Vet Check,Result:Pregnant Multiple, ,
73, 5/19/2018,Move To String/Pen,Move To:16, ,
73, 5/22/2018,Mastitis,Treat. Name:Spectramast, Treat. Type:Intramammary, Comments:4 Times, Move To:16
73, 5/25/2018,Move To String/Pen,Move To:10, ,
73, 5/28/2018,Move To String/Pen,Move To:11, ,
73, 7/20/2018,Vet Check,Result:OK - Confirmed PG, ,

gdal_merge on a three band .tif - remove the 'no data' value

I have a large set of .tif files and I need to merge/mosaic them all into one .tif with the no-data value removed (i.e. value 230, 245, 255).
However, when I put this in...pixel '230, 245, 255' becomes '0, 245, 255').
I am trying to get NO PIXEL returned for 230, 245, 255. Is that possible?
I:\TFS_6\trial_merge>gdal_merge.py -o test.tif -n 230 245 255 file1.tif file2.tif
ERROR 4: `245' does not exist in the file system,
and is not recognised as a supported dataset name.
ERROR 4: `255' does not exist in the file system,
and is not recognised as a supported dataset name.
0...10...20...30...40...50...60...70...80...90...100 - done.
gdalbuildvrt -addalpha -hidenodata -srcnodata "230 245 255" merged_tif.vrt *.tif
This turned the 'NoData' values into '230 245 255'...so I was able to filter BOTH 'NoData' and '230 245 255' accordingly,

Rails - Strange characters pass through validation and break query

I copy-pasted a string into a form field and a strange character broke my MySql query.
I could force the error on the console this way (the weird character is in the middle of the two words "Invalid" and "Character", you can also copy-paste it):
> dog.name = "Invalid ​Character"
> dog.save # -> false
Which returns the following error:
ActiveRecord::StatementInvalid: Mysql2::Error: Incorrect string value: '\xE2\x80\x8BCha...' for column 'name' at row 1: UPDATE `dogs` SET `name` = 'Invalid ​Character' WHERE `dogs`.`id` = 2227
It replaced the character by '\xE2\x80\x8B' as the error said.
Is there any validation that I could use to remove these kind of weird characters?
Obs: I also saw that
> "Invalid ​Character".unpack('U*')
Returns
[73, 110, 118, 97, 108, 105, 100, 32, 8203, 67, 104, 97, 114, 97, 99, 116, 101, 114]
The weird character must be the 8230 one.
Obs2: In my application.rb, I have: config.encoding = "utf-8"
EDIT
On my console, I got:
> ActiveRecord::Base.connection.charset # -> "utf8"
> ActiveRecord::Base.connection.collation # -> "utf8_unicode_ci"
I also ran (on the rails db mySql console):
> SELECT table_collation FROM INFORMATION_SCHEMA.TABLES where table_name = 'dogs';
and got "utf8_unicode_ci"
EDIT2
If I change the table's character set to utf8mb4 I don't get the error. But still, I have to filter those characters.
On the rails db MySql console, I used:
SHOW CREATE TABLE dogs;
To find out that the charset for the table was latin1.
I just added a migration with this query:
ALTER TABLE dogs CONVERT TO CHARACTER SET utf8mb4;
And it started to work fine.