How to change a string delimiter in Ab Initio? - ab-initio

In Ab Initio graph I have an input file which has pipe-delimited values in rows. I use similar DML file to parse it:
record
decimal("|",0, maximum_length=19, sign_reserved) v1 = NULL("");
utf8 string("|", maximum_length=10) v2 = "";
utf8 string("|", maximum_length=10) v3 = "";
utf8 string("|", maximum_length=40) v4 = "";
utf8 string("|", maximum_length=255) v5 = "";
utf8 string("\n", maximum_length=40) v6 = "";
end
For use in later equality comparisons with other data I want to make all those strings pipe-delimited, so I need to change v6 value.
I tried do it with simple Reformat by changing output DML to this one, and leaving the transform function empty:
record
decimal("|",0, maximum_length=19, sign_reserved) v1 = NULL("");
utf8 string("|", maximum_length=10) v2 = "";
utf8 string("|", maximum_length=10) v3 = "";
utf8 string("|", maximum_length=40) v4 = "";
utf8 string("|", maximum_length=255) v5 = "";
utf8 string("|", maximum_length=40) v6 = "";
string(1) newline = "\n";
end
However, this left trash character inside v6. Later I needed to filter v6 value to make it only contain proper characters. This solution doesn't seem neat.
To avoid this trash left inside v6 I tried to use reinterpret_as, string_concat and others but nothing ended up with a nice solution.
How should I change the delimiter of v6 in a simple way?

A == B compares the value of A to the value of B. The comparison returns the same result regardless of whether A and B have the same delimiter or not. If you actually do need to change the delimiter of a field, the Reformat method you suggest is the correct one. If you're seeing garbage in the v6 values coming out, it means there was garbage in v6 going in.
More broadly speaking, Stack Overflow isn't the right venue for discussing things Ab Initio. You'd be better off posing your question to Ab Initio support or on the dedicated Ab Initio Forum accessible through the GDE. The Forum is monitored by numerous Ab Initio users and employees, and you're pretty much guaranteed to get a prompt response.

Related

How can I limit string characters to utf8mb3

I created in Go some conversion for decoding inputs.
It is working as charm, (thanks to "golang.org/x/net/html/charset"), but now I have to limit output to characters contained only in utf8mb3.
As far as I know, go default "builtin" is full utf-8.
The problem is that the underlying database setting is locked by vendor rules and setted to utf8mb3 (yep mysql), we can't change those.
So far I'm using this to limit characters and rewrite "unallowed" to "*":
//compile our regexp. if fails, return undecoded
allowedCharsREGEX = `[^ěščřžýáíéúůťňĺľŕĚŠČŘŽÝÁÍÉÚŮŤŇĹĽŔ!?§©®±%¼½¾¿ß÷£¥¢~¡#&_\"\\/:;a-zA-Z_0-9\t\n\r\ ]`
reg := regexp.MustCompile(allowedCharsREGEX)
procString := outStr
// replace not allowed chars
procString = reg.ReplaceAllString(outStr,"*")
to limit output characters but want to expand it to utf8mb3 char list.
From documentation seems unicode IsValid is full utf8.
Any possible "quick solution"?
Go v.1.13, ubuntu 20.04
Not everything should be done with a regexp.
utf8mb3 contains all runes from the BMP which can be encoded with 3 Bytes in UTF-8.
sb := &strings.Builder{}
for _, r := range input {
if r < 0xFFFF {
sb.WriteRune(r)
} else {
sb.WriteByte('*')
}
}
return sb.String

Converting delimited values into a comma separated list

My Webapp (PHP/jQuery/MySQL) has features which enable me to send out nicely formatted html email notifications to my customers based on certain events. The code works nicely and merges data from my Database into form fields although I need to enhance it to be able to provide enriched/localised/reformatted data in some circumstances.
For example:
- Provide date/time values in a user's own timezone
- Provide monetary values formatted to a user's locale
This requires me to do another pass of the email content to detect whether any fields remain unmerged before sending the email off to the user and if so, to format those field values appropriately before sending the email. Therefore what I want to do is extract a list of all delimited fieldnames from a table field value and return that list in comma delimited form.
I can already count how many times a delimeter appears
I can also find the position of the first delimeter
It looks like it would be easy to split the values if I was using the same opening and closing delimeters but because I have many email templates already in use, this isn't currently viable
I don't have any code for this yet. I'm just trying to avoid writing my own MySQL function to do this, by using existing MySQL functions if they are capable of doing this.
I've tried using various combinations of SUBSTRING, SUBSTRING_INDEX, LOCATE.
So what I need to be able to do is something like this:
SELECT msg_id, values_found_between(msg_content,"<",">") AS comma_delimited_list;
So for example, with source data of...
msg_id | msg_content
-------+------------
1 | The quick brown <fox> jumps over the lazy <dog>
2 | The quick brown fox jumps over the lazy dog
I can get a resulting recordset such as this:
msg_id | comma_seperated_list
-------+------------
1 | fox,dog
Alright, I had a crack and this seems to work well:
CREATE FUNCTION db.`FN_find_values_between`(`in_haystack` VARCHAR(10000), `in_opening_delimiter` VARCHAR(1),`in_closing_delimiter` VARCHAR(1)) RETURNS varchar(1000) CHARSET utf8
BEGIN
DECLARE numFoundOpen INT DEFAULT 0;
DECLARE numFoundClose INT DEFAULT 0;
DECLARE numFoundTarget INT DEFAULT 0;
DECLARE numCurrentIndex INT DEFAULT 0;
DECLARE strOutput VARCHAR(1000) DEFAULT "";
DECLARE numSearchFromPos INT DEFAULT 1;
DECLARE numCurrentCharPosStart INT DEFAULT 1;
DECLARE numCurrentCharPosEnd INT DEFAULT 1;
DECLARE strCurrentFieldname VARCHAR(50) DEFAULT "";
DECLARE numLength INT DEFAULT 0;
SET numFoundOpen=
(SELECT
ROUND ((LENGTH(in_haystack)- LENGTH( REPLACE (in_haystack, in_opening_delimiter, ""))) / LENGTH(in_opening_delimiter)));
SET numFoundClose=
(SELECT
ROUND ((LENGTH(in_haystack)- LENGTH( REPLACE (in_haystack, in_closing_delimiter, ""))) / LENGTH(in_closing_delimiter)));
IF (numFoundOpen=numFoundClose) THEN
SET numFoundTarget=numFoundOpen;
END IF;
WHILE numCurrentIndex < numFoundTarget DO
SET numCurrentIndex=numCurrentIndex+1;
SET numCurrentCharPosStart = LOCATE(in_opening_delimiter, in_haystack, numSearchFromPos);
SET numCurrentCharPosEnd = LOCATE(in_closing_delimiter, in_haystack, numSearchFromPos);
SET numLength=1+(numCurrentCharPosEnd-numCurrentCharPosStart);
SET strCurrentFieldname=SUBSTRING(in_haystack,numCurrentCharPosStart,numLength);
SET strOutput=CONCAT(strOutput,strCurrentFieldname,",");
SET strCurrentFieldname="";
SET numSearchFromPos=numCurrentCharPosEnd+1;
END WHILE;
IF (strOutput <> "") THEN
SET strOutput=LEFT(strOutput,LENGTH(strOutput)-1);
END IF;
RETURN strOutput;
END;
As per the code above, I managed to write my own MySQL function to do this.

unescape diactrics in \u0 format (json) in ms sql (SQL Server)

I'm getting json file, which I load to Azure SQL databese. This json is direct output from API, so there is nothing I can do with it before loading to DB.
In that file, all Polish diactircs are escaped to "C/C++/Java source code" (based on: http://www.fileformat.info/info/unicode/char/0142/index.htm
So for example:
ł is \u0142
I was trying to find some method to convert (unescape) those to proper Polish letters.
In worse case scenario, I can write function which will replace all combinations
Repalce(Replace(Replace(string,'\u0142',N'ł'),'\u0144',N'ń')))
And so on, making one big, terrible function...
I was looking for some ready functions like there is for URLdecode, which was answered here on stack in many topics, and here: https://www.codeproject.com/Articles/1005508/URL-Decode-in-T-SQL
Using this solution would be possible but I cannot figure out cast/convert with proper collation and types in there, to get result I'm looking for.
So if anyone knows/has function that would make conversion in string for unescaping that \u this would be great, but I will manage to write something on my own if I would get right conversion. For example I tried:
select convert(nvarchar(1), convert(varbinary, 0x0142, 1))
I made assumption that changing \u to 0x will be the answer but it gives some Chinese characters. So this is wrong direction...
Edit:
After googling more I found exactly same question here on stack from #Pasetchnik: Json escape unicode in SQL Server
And it looks this would be the best solution that there is in MS SQL.
Onlty thing I needed to change was using NVARCHAR instead of VARCHAR that is in linked solution:
CREATE FUNCTION dbo.Json_Unicode_Decode(#escapedString nVARCHAR(MAX))
RETURNS nVARCHAR(MAX)
AS
BEGIN
DECLARE #pos INT = 0,
#char nvarCHAR,
#escapeLen TINYINT = 2,
#hexDigits TINYINT = 4
SET #pos = CHARINDEX('\u', #escapedString, #pos)
WHILE #pos > 0
BEGIN
SET #char = NCHAR(CONVERT(varbinary(8), '0x' + SUBSTRING(#escapedString, #pos + #escapeLen, #hexDigits), 1))
SET #escapedString = STUFF(#escapedString, #pos, #escapeLen + #hexDigits, #char)
SET #pos = CHARINDEX('\u', #escapedString, #pos)
END
RETURN #escapedString
END
Instead of nested REPLACE you could use:
DECLARE #string NVARCHAR(MAX)= N'\u0142 \u0144\u0142';
SELECT #string = REPLACE(#string,u, ch)
FROM (VALUES ('\u0142',N'ł'),('\u0144', N'ń')) s(u, ch);
SELECT #string;
DBFiddle Demo

Node.js encode ISO8859-1 to UTF-8

I have an application that allows users to persist strings to a database and those strings may contain emojis. The problem I have is an emoji such as 😊 will get stored in MySQL as 😊
When I retrieve this string using a PHP MySQL client and render it in a web browser, it renders fine probably because the Content-Type is set to UTF-8. When I try to read the string in node.js, I get back what I think is the ISO8859-1 encoding a literal 😊. The charset on the table is set to latin1 and that's where I'm getting ISO8859-1 from.
What's the right way to encode the string in node.js so that I can see the emoji and not the encoding set by MySQL when I console.log the string?
😊 is Mojibake for 😊. Interpreting the former as latin1, you get hex F09F988A, which is the UTF-8 hex for that Emoji.
(Note: UTF-8 outside MySQL is equivalent to utf8mb4 inside MySQL.)
In MySQL, you must have the column/table declared with CHARACTER SET utf8mb4. You must also state that the data being stored/fetched is encoded utf8mb4. Note: utf8 will not suffice.
Do a SELECT HEX(col) FROM ... to see if you get that hex for that Emoji. If that is the case and the column is currently latin1, then part of the fix is to carefully convert the column to utf8mb4. That is, you have CHARACTER SET latin1, but have UTF-8 bytes in it; this will leave bytes alone while fixing charset. Assuming the column is already VARCHAR(111) CHARACTER SET latin1 NOT NULL, then do this 2-step ALTER:
ALTER TABLE tbl MODIFY COLUMN col VARBINARY(111) NOT NULL;
ALTER TABLE tbl MODIFY COLUMN col VARCHAR(111) CHARACTER SET utf8mb4 NOT NULL;
Virtually any other conversion mechanism will make a worse mess.
As for establishing the connection correctly, it goes something like this for node.js:
var connection = mysql.createConnection({ ... , charset : 'utf8mb4'});
You do not need, and should not convert encoding. Just use the right protocols. If you send the HTML page in UTF-8, the browser will send the data back to your server in UTF-8.
Then you want to store the data to your database which is in latin1, that won't work at all. You must convert your database to UTF-8 as well. That includes the database, the tables, and eventually the columns themselves. Also make sure that your database client is configured to connect in UTF-8, because the client itself has to declare its encoding.
Once you have the whole data-flux in UTF-8, everything will work flawlessly.
Server -> GET HTML -> POST -> Server -> SQL Client -> Database -> Table -> Column
It is recommended to use iconv(A simple ISO-8859-1 to UTF-8 conversion)
From this gist
var iconv = require('iconv');
function toUTF8(body) {
// convert from iso-8859-1 to utf-8
var ic = new iconv.Iconv('iso-8859-1', 'utf-8');
var buf = ic.convert(body);
return buf.toString('utf-8');
}
here if you pass anything in ISO-8859-1 , it will return it's UTF-8.
for example,
toUTF8("😊");
will return 😊
I have found a super dirty way to convert it back:
const isoToUtfTable = {
'ð': 0xf0,
'Ÿ': 0x9f,
'˜': 0x98,
'Š': 0x8a
};
function convertISO8859ToUtf8(s) {
const buf = new Uint8Array([...s].map(c => isoToUtfTable[c]));
return String.fromCharCode(...buf)
}
function decode_utf8(s) {
return decodeURIComponent(escape(s));
}
console.log(decode_utf8(convertISO8859ToUtf8('😊')))
Now you simply need to complete the isoToUtfTable table (it's small, see https://en.wikipedia.org/wiki/ISO/IEC_8859-1).
Maybe try to look at node-iconv.
const iconv = new Iconv('ISO-8859-2', 'UTF-8');
const buffer = iconv.convert(something);
console.log(buffer);
console.log(buffer.toString('UTF8'));
This is a full answer from #zerkms solution
const isoToUtfTable = {
'€':0x80,
'na':0x81,
'‚':0x82,
'ƒ':0x83,
'„':0x84,
'…':0x85,
'†':0x86,
'‡':0x87,
'ˆ':0x88,
'‰':0x89,
'Š':0x8a,
'‹':0x8b,
'Œ':0x8c,
'na':0x8d,
'Ž':0x8e,
'na':0x8f,
'na':0x90,
'‘':0x91,
'’':0x92,
'“':0x93,
'”':0x94,
'•':0x95,
'–':0x96,
'—':0x97,
'˜':0x98,
'™':0x99,
'š':0x9a,
'›':0x9b,
'œ':0x9c,
'na':0x9d,
'ž':0x9e,
'Ÿ':0x9f,
'NSBP':0xa0,
'¡':0xa1,
'¢':0xa2,
'£':0xa3,
'¤':0xa4,
'¥':0xa5,
'¦':0xa6,
'§':0xa7,
'¨':0xa8,
'©':0xa9,
'ª':0xaa,
'«':0xab,
'¬':0xac,
'SHY':0xad,
'®':0xae,
'¯':0xaf,
'°':0xb0,
'±':0xb1,
'²':0xb2,
'³':0xb3,
'´':0xb4,
'µ':0xb5,
'¶':0xb6,
'·':0xb7,
'¸':0xb8,
'¹':0xb9,
'º':0xba,
'»':0xbb,
'¼':0xbc,
'½':0xbd,
'¾':0xbe,
'¿':0xbf,
'À':0xc0,
'Á':0xc1,
'Â':0xc2,
'Ã':0xc3,
'Ä':0xc4,
'Å':0xc5,
'Æ':0xc6,
'Ç':0xc7,
'È':0xc8,
'É':0xc9,
'Ê':0xca,
'Ë':0xcb,
'Ì':0xcc,
'Í':0xcd,
'Î':0xce,
'Ï':0xcf,
'Ð':0xd0,
'Ñ':0xd1,
'Ò':0xd2,
'Ó':0xd3,
'Ô':0xd4,
'Õ':0xd5,
'Ö':0xd6,
'×':0xd7,
'Ø':0xd8,
'Ù':0xd9,
'Ú':0xda,
'Û':0xdb,
'Ü':0xdc,
'Ý':0xdd,
'Þ':0xde,
'ß':0xdf,
'à':0xe0,
'á':0xe1,
'â':0xe2,
'ã':0xe3,
'ä':0xe4,
'å':0xe5,
'æ':0xe6,
'ç':0xe7,
'è':0xe8,
'é':0xe9,
'ê':0xea,
'ë':0xeb,
'ì':0xec,
'í':0xed,
'î':0xee,
'ï':0xef,
'ð':0xf0,
'ñ':0xf1,
'ò':0xf2,
'ó':0xf3,
'ô':0xf4,
'õ':0xf5,
'ö':0xf6,
'÷':0xf7,
'ø':0xf8,
'ù':0xf9,
'ú':0xfa,
'û':0xfb,
'ü':0xfc,
'ý':0xfd,
'þ':0xfe,
'ÿ':0xff }
let offsetArray = [];
function convertISO8859ToUtf8Simple(s) {
offsetArray = [];
const buf = new Uint8Array([...s].map((c, index) =>
{
if(isoToUtfTable[c]) {
if(offsetArray.length > 0 && offsetArray[offsetArray.length -1]+3 < index) {
offsetArray.push(index);
}
if(offsetArray.length == 0) {
offsetArray.push(index);
}
}
return isoToUtfTable[c];
}
));
return String.fromCharCode(...buf);
}
function decode_utf8(s) {
return decodeURIComponent(escape(s));
}
function emojiStringToArray(str) {
split = str.split(/([\uD800-\uDBFF][\uDC00-\uDFFF])/);
arr = [];
for (var i=0; i<split.length; i++) {
char = split[i]
if (char !== "" && !char.includes('\u0000')) {
arr.push(char);
}
}
return arr;
};
const string = 'hello 😌😌 with some emojis 😊 🐝';
function finalString(s) {
const emojis = emojiStringToArray(decode_utf8(convertISO8859ToUtf8Simple(s)));
for(let i = 0; i<offsetArray.length; i++){
let position = 0;
if (i == 0) {
position = offsetArray[i];
} else {
position = (i * -3) + offsetArray[i] + (i);
}
s = [s.slice(0, position), emojis[i], s.slice(position+4)].join('');
}
return s;
}
console.log(finalString(string));

CF8 and AES decrypting MySQL AES: encodings are not same

This has become more of an exercise in what am I doing wrong than mission critical, but I'd still like to see what (simple probably) mistake I'm making.
I'm using mysql (5.1.x) AES_ENCRYPT to encrypt a string. I'm using CF's generateSecretKey('AES') to make a key (I've tried it at defaul and 128 and 256 bit lengths).
So let's say my code looks like this:
<cfset key = 'qLHVTZL9zF81kiTnNnK0Vg=='/>
<cfset strToEncrypt = '4111111111111111'/>
<cfquery name="i" datasource="#dsn#">
INSERT INTO table(str)
VALUES AES_ENCRYPT(strToEncrypt,'#key#');
</cfquery>
That works fine as expected and I can select it using SELECT AES_DECRYPT(str,'#key#') AS... with no problems at all.
What I can't seem to do though is get CF to decrypt it using something like:
<cfquery name="s" datasource="#dsn#">
SELECT str
FROM table
</cfquery>
<cfoutput>#Decrypt(s.str,key,'AES')#</cfoutput>
or
<cfoutput>#Decrypt(toString(s.str),key,'AES')#</cfoutput>
I keep getting "The input and output encodings are not same" (including the toString() - without that I get a binary data error). The field type for the encrypted string in the db is blob.
This entry explains that mySQL handles AES-128 keys a bit differently than you might expect:
.. the MySQL algorithm just or’s the bytes of a given passphrase
against the previous bytes if the password is longer than 16 chars and
just leaves them 0 when the password is shorter than 16 chars.
Not highly tested, but this seems to yield the same results (in hex).
<cfscript>
function getMySQLAES128Key( key ) {
var keyBytes = charsetDecode( arguments.key, "utf-8" );
var finalBytes = listToArray( repeatString("0,", 16) );
for (var i = 1; i <= arrayLen(keyBytes); i++) {
// adjust for base 0 vs 1 index
var pos = ((i-1) % 16) + 1;
finalBytes[ pos ] = bitXOR(finalBytes[ pos ], keyBytes[ i ]);
}
return binaryEncode( javacast("byte[]", finalBytes ), "base64" );
}
key = "qLHVTZL9zF81kiTnNnK0Vg==";
input = "4111111111111111";
encrypted = encrypt(input, getMySQLAES128Key(key), "AES", "hex");
WriteDump("encrypted="& encrypted);
// note: assumes input is in "hex". either convert the bytes
// to hex in mySQL first or use binaryEncode
decrypted = decrypt(encrypted, getMySQLAES128Key(key), "AES", "hex");
WriteDump("decrypted="& decrypted);
</cfscript>
Note: If you are using mySQL for encryption be sure to see its documentation which mentions the plain text may end up in various logs (replication, history, etectera) and "may be read by anyone having read access to that information".
Update: Things may have changed, but according to this 2004 bug report the .mysql_history file is only on Unix. (Keep in mind there may be other log files) Detailed instructions for clearing .mysql_history can be found in the manual, but in summary:
Set the MYSQL_HISTFILE variable to /dev/null (on each log in)
Create .mysql_history as a symbolic link to /dev/null (only once)