SSIS Derived Column - Parse Text between break returns - ssis

I have a text field from a SQL Server Source. It is a phone number field that typically has this format:
Home: 555-555-1212
Work: 555-555-1212
Cell: 555-555-1212
Emergency: 555-555-1212
I'm trying to split among fields so that only 555-555-1212 is displayed
I am then taking this field and converting to a string. There are literally break returns (\r\n) between the labels here. The goal here is to have this data split among multiple fields (home,work,cell,emergency,etc.) I was researching how to split text among fields and I made some progress. In the case of home numbers, I used this logic:
SUBSTRING(Phone_converted,FINDSTRING(Phone_converted,"Home:",1) + 5,FINDSTRING(Phone_converted,"\n",1) - FINDSTRING(Phone_converted,"Home:",1) - 5)
This works great as it parses up to the text return and I get 555-555-1212.
Now I experience an issue when searching for a text between break returns. I tried the same logic for Work numbers:
SUBSTRING(Phone_converted,FINDSTRING(Phone_converted,"Work:",1) + 5,FINDSTRING(Phone_converted,"\n",1) - FINDSTRING(Phone_converted,"Work:",1) - 5)
But that won't work and results in writing to my error redirection file. I then tried to insert a break return to find the text at the beginning
SUBSTRING(Phone_converted,FINDSTRING(Phone_converted,"\nWork:",1) + 5,FINDSTRING(Phone_converted,"\n",1) - FINDSTRING(Phone_converted,"\nWork:",1) - 5)
No luck there either. Any ideas on how I can address this. Also, I would appreciate an idea of how I can handle the emergency title at the end. There won't be a break return in that situation, but I still want to parse the text.

I look at your data and I see
Home:|555-555-1212|Work:|555-555-1212|Cell:|555-555-1212|Emergency:|555-555-1212
I'm using the pipe character, |, as a placeholder for where I would segment that string, which is basically wherever you have whitespace (space, tab, newline, etc).
There are two approaches to this. I'll start with the easy one.
Script Component
String.Split is your friend here. Look at what it did with that source data
I added a new Script Component, acting as a Transformation and created 4 output columns, all string of length 12 codepage 1252: Home, Work, Cell, and Emergency. I populate them like so
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
string[] split = Row.PhoneData.Split();
Row.Home = split[1];
Row.Work = split[4];
Row.Cell = split[7];
Row.Emergency = split[10];
}
Derived Column
I'm not going to build out a full blown implementation of this. The above is much to simple but I run into situations where ETL devs say they aren't allowed to use Script tasks/components and that's usually because people reached for them first instead of last.
The approach here is to have lots of Derived Columns Components on your Data Flow. It won't hurt performance and in fact can make it easier. It definitely will make your debugging easier as you'll have lots of it to do.
DER Find Colons
This would add 4 columns into the dataflow - HomeColonPosition, WorkColonPosition etc. You've already started down this path but just build it out into the actual data flow as you'll need to reference these positions and again, it's easier to fix the calculation that populates a column versus a calculation that's wrong and used everywhere. You're likely to find that 4 derived columns are useful here as you'd want to use the previous colon's position as the starting point for the third argument to FINDSTRING
Thus, instead of Work being
FINDSTRING(PhoneData, ":", FINDSTRING(PhoneData, ":" 1) + 1)
it would just be
FINDSTRING(PhoneData, ":", HomeColonPosition + 1)
Just knowing the position of the 4 colons in that string, I can figure out where the phone numbers are (maybe). The position of the colon + 2 (colon and the space) is the starting point and then go out 12 characters.
Where this approach gets ugly, much as it did with the script approach is when that data isn't consistent.

Related

How to add back comments/whitespaces in translator using the Antlr4's visitor model

I'm currently writing a TSQL (Sybase/Microsoft SQL) to MySQL translator using the ANTLR4 visitor approach.
I'm able to push comments and whitespaces to different channels so that I can use that information later.
What's not super clear is:
how do I get the data back?
and more importantly how do I plug the comments and whitespaces back into my translated MySQL code?
Re: #1, this seems to work to get the list of all tokens including the comments/whitespaces:
public static List<Token> getHiddenTokensFromString(String sqlIn, int hiddenChannel) {
CharStream charStream = CharStreams.fromString(sqlIn);
CaseChangingCharStream upper = new CaseChangingCharStream(charStream, true);
TSqlLexer lexer = new TSqlLexer(upper);
CommonTokenStream commonTokenStream = new CommonTokenStream(lexer, hiddenChannel);
commonTokenStream.fill();
List<Token> hiddenTokens = commonTokenStream.getTokens();
return hiddenTokens;
}
Re #2, what makes it particularly challenging is that as part of the translation, lines of SQL have to be moved around, some lines removed and some lines added.
Any help will be greatly appreciated.
Thanks.
The ANTLR4 lexer creates a number of tokens, each with an index (a running number). Provided you didn't just skip a token, all tokens are available for later inspection, once the parsing step is done, regardless of their channels (the channel is actually just a number property on a token).
So, given you have a token you want to translate, get its index and then ask the token stream for the tokens with the next smaller index or next higher index. These are usually the hidden whitespaces.
Once you have the whitespace token use its start and stop index to get the original text from the char stream. And since you know where you are in the translation process when you do that, it should be easy to know where to insert the original text.

Extract tokens from grammar

I have been working through the Advent of Code problems in Perl6 this year and was attempting to use a grammar to parse the Day 3's input.
Given input in this form: #1 # 1,3: 4x4 and this grammar that I created:
grammar Claim {
token TOP {
'#' <id> \s* '#' \s* <coordinates> ':' \s* <dimensions>
}
token digits {
<digit>+
}
token id {
<digits>
}
token coordinates {
<digits> ',' <digits>
}
token dimensions {
<digits> 'x' <digits>
}
}
say Claim.parse('#1 # 1,3: 4x4');
I am interested in extracting the actual tokens that were matched i.e. id, x + y from coordinates, and height + width from the dimensions from the resulting parse. I understand that I can pull them from the resulting Match object of Claim.parse(<input>), but I have to dig down through each grammar production to get the value I need e.g.
say $match<id>.hash<digits>.<digit>;
this seems a little messy, is there a better way?
For the particular challenge you're solving, using a grammar is like using a sledgehammer to crack a nut.
Like #Scimon says, a single regex would be fine. You can keep it nicely readable by laying it out appropriately. You can name the captures and keep them all at the top level:
/ ^
'#' $<id>=(\d+) ' '
'# ' $<x>=(\d+) ',' $<y>=(\d+)
': ' $<w>=(\d+) x $<d>=(\d+)
$
/;
say ~$<id x y w d>; # 1 1 3 4 4
(The prefix ~ calls .Str on the value on its right hand side. Called on a Match object it stringifies to the matched strings.)
With that out the way, your question remains perfectly cromulent as it is because it's important to know how P6 scales in this regard from simple regexes like the one above to the largest and most complex parsing tasks. So that's what the rest of this answer covers, using your example as the starting point.
Digging less messily
say $match<id>.hash<digits>.<digit>; # [「1」]
this seems a little messy, is there a better way?
Your say includes unnecessary code and output nesting. You could just simplify to something like:
say ~$match<id> # 1
Digging a little deeper less messily
I am interested in extracting the actual tokens that were matched i.e. id, x + y from coordinates, and height + width from the dimensions from the resulting parse.
For matches of multiple tokens you no longer have the luxury of relying on Perl 6 guessing which one you mean. (When there's only one, guess which one it guesses you mean. :))
One way to write your say to get the y coordinate:
say ~$match<coordinates><digits>[1] # 3
If you want to drop the <digits> you can mark which parts of a pattern should be stored in a list of numbered captures. One way to do so is to put parentheses around those parts:
token coordinates { (<digits>) ',' (<digits>) }
Now you've eliminated the need to mention <digits>:
say ~$match<coordinates>[1] # 3
You could also name the new parenthesized captures:
token coordinates { $<x>=(<digits>) ',' $<y>=(<digits>) }
say ~$match<coordinates><y> # 3
Pre-digging
I have to dig down through each grammar production to get the value I need
The above techniques still all dig down into the automatically generated parse tree which by default precisely corresponds to the tree implicit in the grammar's hierarchy of rule calls. The above techniques just make the way you dig into it seem a little shallower.
Another step is to do the digging work as part of the parsing process so that the say is simple.
You could inline some code right into the TOP token to store just the interesting data you've made. Just insert a {...} block in the appropriate spot (for this sort of thing that means the end of the token given that you need the token pattern to have already done its matching work):
my $made;
grammar Claim {
token TOP {
'#' <id> \s* '#' \s* <coordinates> ':' \s* <dimensions>
{ $made = ~($<id>, $<coordinatess><x y>, $<dimensions><digits>[0,1]) }
}
...
Now you can write just:
say $made # 1 1 3 4 4
This illustrates that you can just write arbitrary code at any point in any rule -- something that's not possible with most parsing formalisms and their related tools -- and the code can access the parse state as it is at that point.
Pre-digging less messily
Inlining code is quick and dirty. So is using a variable.
The normal thing to do for storing data is to instead use the make function. This hangs data off the match object that's being constructed corresponding to a given rule. This can then be retrieved using the .made method. So instead of $make = you'd have:
{ make ~($<id>, $<coordinatess><x y>, $<dimensions><digits>[0,1]) }
And now you can write:
say $match.made # 1 1 3 4 4
That's much tidier. But there's more.
A sparse subtree of a parse tree
.oO ( 🎶 On the first day of an imagined 2019 Perl 6 Christmas Advent calendar 🎶 a StackOverflow title said to me ... )
In the above example I constructed a .made payload for just the TOP node. For larger grammars it's common to form a sparse subtree (a term I coined for this because I couldn't find a standard existing term).
This sparse subtree consists of the .made payload for the TOP that's a data structure referring to .made payloads of lower level rules which in turn refer to lower level rules and so on, skipping uninteresting intermediate rules.
The canonical use case for this is to form an Abstract Syntax Tree after parsing some programming code.
In fact there's an alias for .made, namely .ast:
say $match.ast # 1 1 3 4 4
While this is trivial to use, it's also fully general. P6 uses a P6 grammar to parse P6 code -- and then builds an AST using this mechanism.
Making it all elegant
For maintainability and reusability you can and typically should not insert code inline at the end of rules but should instead use Action objects.
In summary
There are a range of general mechanisms that scale from simple to complex scenarios and can be combined as best fits any given use case.
Add parentheses as I explained above, naming the capture that those parentheses zero in on, if that is a nice simplification for digging into the parse tree.
Inline any action you wish to take during parsing of a rule. You get full access to the parse state at that point. This is great for making it easy to extract just the data you want from a parse because you can use the make convenience function. And you can abstract all actions that are to be taken at the end of successfully matching rules out of a grammar, ensuring this is a clean solution code-wise and that a single grammar remains reusable for multiple actions.
One final thing. You may wish to prune the parse tree to omit unnecessary leaf detail (to reduce memory consumption and/or simplify parse tree displays). To do so, write <.foo>, with a dot preceding the rule name, to switch the default automatic capturing off for that rule.
You can refer to each of you named portions directly. So to get the cordinates you can access :
say $match.<coordinates>.<digits>
this will return you the Array of digits matches. Ig you just want the values the easiest way is probably :
say $match.<coordinates>.<digits>.map( *.Int) or say $match.<coordinates>.<digits>>>.Int or even say $match.<coordinates>.<digits>».Int
to cast them to Ints
For the id field it's even easier you can just cast the <id> match to an Int :
say $match.<id>.Int

How to remove string that is no longer needed in MySQL database?

This seemed like it should be very simple to do yet I've not been able to find an answer after weeks of looking.
I'm trying to remove strings that are no longer needed. Regex_replace sounds perfect but is not available in MySQL.
In MySQL how would I accomplish changing this:
[quote=ABC;xxxxxx]
to this:
[quote=ABC]
The issues are:
- this can appear anywhere in a text blob
- the xxxxxx can only be numeric but may be 6, 7 or 8 characters long
- not adding/removing any rows, just rewriting the contents of one column on one row at a time.
Thanks.
I don't think you really need REGEX_Replace (though it would make things easier of course).
Assuming that the example you presented is a real reflection of what you have:
Your starting point is with the string [quote=<something>;, meaning that you can start searching for [quote=,
Once you found it, you need to search for ; and after that for ],
Once you found them both, you know what to extract when where to start for the next search (if the pattern you mentioned can appear more than once within a singe blob.
Did I get you correctly?
EDIT
This paradigm is aimed to convert all instances of [quote=ABC;xxxxxx] to [quote=ABC] under the following assumptions:
The pattern can appear any number of times within the input string,
The length of xxxxxx is not fixed,
The resulting string (after removing all the appearances of ;xxxxxx) should replace the value in the table,
Performance is not an issue since either this is going to be a one-time job (through the whole table) or it will run every time on a single string (e.g. before INSERTing a new record).
Some MySQL functions that will be used:
INSTR: Searches within a string for the first appearance of a sub-string and returns the position (offset) where the sub-string was found,
SUBSTR: Returns a substring from a string (several ways to use it),
CONCAT: Concatenates two or more strings.
The guidelines presented here apply for the manipulation of a single INPUT string. If this needs to be used over, say, a whole table, simply get the strings into a CURSOR and loop.
Here are the steps:
Declare five INT local variables to serve as indices and total input string length, say L_Start, L_UpTo, l_Total_Length, l_temp1 and l_temp 2, setting the initial value for l_Start = 1 and l_Total_Length = LENGTH(INPUT_String),
Declare a string variable into which you will copy the "cleaned" result and initiate it as '', say l_Output_str; also declare a temporary string to hold the value of 'ABD', say l_Quote,
Start a infinite loop (you will set the exit condition within it; see below),
Exit loop if l_Start >= l_Total_Length (here is one of the two exit points from the loop),
Find the first location of '[quote=' within the input string starting from L_Start,
If the returned value is 0 (i.e. substring not found), concatenate the current contents of l_Output_str with whatever remains if the input string from position L_start (e.g. SET l_Output_str = CONCAT(l_Output_str,SUBSTR(INPUT_String,L_Start) ;) and exit loop (second exit position),
Search the input string for the ; symbol starting from L_start + 7 (i.e. the length of [quote=) and save the value in l_temp_1,
Search the input string for the ] symbol starting from L_start + 7 + l_temp2 and save the value in l_temp_2,
Add the found result to output string as SET l_Output_str = CONCAT(l_Output_str,'[quote=',SUBSTR(INPUT_String,L_Start + 7, l_temp_2 - l_temp_1),']') ;,
Set L_Start = L_Start + 7 + l_temp_2 + 1 ;
End of loop.
Notes:
As I neither made the code nor tested it, it is possible that I'm not setting indices correctly; you will need to perform detailed tests to make get it working as needed;
The above IS the method I suggested;
If the input string is very long (many MBs), you might observe poor performance (i.e. it might take few seconds to complete) because of the concatenations. There are some steps that can be taken to improve performance, but let's have this working first and then, if needed, tackle the performance issues.
Hope that the above is clear and comprehensive.

Get Unique String from a longer string when the unique string is at 2 different locations

In a web application that I am creating tests for, there are 2 sets of strings from which I wish to get a substring (which is unique) to use for identifying that element on the Web Page:
Parent Form:
InputText-eLeType-AQAAAAAAAAAAAAAAAAAAAVWZ-bMs-bms_9999999_3512-bMs-obj-bMsDot-com-bMsDot-bmssolutions-bMsDot-COMPONENT-bMsDot-bms_9999999_109-bMs-textField-bMs-ABNylGGXXu8IPwjI4jMM5y1K
SubForm:
InputText-eLeType-AQAAAAAAAAAAAAAAAAAAAVXJ-bMs-bms_FK_9999999_406_ID-bMs-obj-bMsDot-com-bMsDot-bmssolutions-bMsDot-COMPONENT-bMsDot-bms_9999999_177-bMs-searchLookupField-bMs-ABNylGGXXu8IPwjI4jMM5y1K-bMs-AQAAAAAAAAAAAAAAAAAAAVWZ-bMs-PRIMARY9999999_480-bMs-obj-bMsDot-com-bMsDot-bmssolutions-bMsDot-COMPONENT-bMsDot-bms_9999999_109
I wish to get the substring from both of these using a single function, so that I don't have to create a different functions for each type I encounter:
Substring in the above 2 provided strings is:
ABNylGGXXu8IPwjI4jMM5y1K
This substring can change for each element on the web page, but is unique for each element of the page and so useful to identify.
I cannot use the full string, as it changes for each environment or if I generate a new environment to host the web pages (the complete string depends on the Meta Data).
We tried doing it for the Parent Form, by using the "-" as the delimiter and identifying the last -bMs- and then taking the string, but that does not work for the SubForm.
So, my main question is, is there some RegEx that can be created to extract only that string (composed of alphabets [upper & lower case] and numbers) from the full string? Or some other simpler way to identify that string?
You could try a combination of positive Lookbehind, [A-Z] and [a-z]. Try this code:
(?<=-bMs-)[A-Z]{3}[a-z]\w+
Demo: https://regex101.com/r/YUZiFa/1
It seems to work without even the positive Lookbehind
[A-Z]{3}[a-z]\w+
Demo: https://regex101.com/r/YUZiFa/2
If you're happy to base the selection of the elements on the previous one, then this might work for you:
(?<=searchLookupField-bMs-|textField-bMs-)\w+
Example
And if you wanted to be extra certain, you could append a second lookahead to the end.
(?<=searchLookupField-bMs-|textField-bMs-)\w+(?=-bMs-|$)
Example
If these don't work, or if the whole string varies greatly, then some more examples would help us narrow it down and come up with a great answer!

COBOL code to replace characters by html entities

I want to replace the characters '<' and '>' by < and > with COBOL. I was wondering about INSPECT statement, but it looks like this statement just can be used to translate one char by another. My intention is to replace all html characters by their html entities.
Can anyone figure out some way to do it? Maybe looping over the string and testing each char is the only way?
GnuCOBOL or IBM COBOL examples are welcome.
My best code is something like it: (http://ideone.com/MKiAc6)
IDENTIFICATION DIVISION.
PROGRAM-ID. HTMLSECURE.
ENVIRONMENT DIVISION.
DATA DIVISION.
WORKING-STORAGE SECTION.
77 INPTXT PIC X(50).
77 OUTTXT PIC X(500).
77 I PIC 9(4) COMP VALUE 1.
77 P PIC 9(4) COMP VALUE 1.
PROCEDURE DIVISION.
MOVE 1 TO P
MOVE '<SCRIPT> TEST TEST </SCRIPT>' TO INPTXT
PERFORM VARYING I FROM 1 BY 1
UNTIL I EQUAL LENGTH OF INPTXT
EVALUATE INPTXT(I:1)
WHEN '<'
MOVE "<" TO OUTTXT(P:4)
ADD 4 TO P
WHEN '>'
MOVE ">" TO OUTTXT(P:4)
ADD 4 TO P
WHEN OTHER
MOVE INPTXT(I:1) TO OUTTXT(P:1)
ADD 1 TO P
END-EVALUATE
END-PERFORM
DISPLAY OUTTXT
STOP RUN
.
GnuCOBOL (yes, another name branding change) has an intrinsic function extension, FUNCTION SUBSTITUTE.
move function substitute(inptxt, ">", ">", "<", "<") to where-ever-including-inptxt
Takes a subject string, and pairs of patterns and replacements. (This is not regex patterns, straight up text matching). See http://opencobol.add1tocobol.com/gnucobol/#function-substitute for some more details. The patterns and replacements can all be different lengths.
As intrinsic functions return anonymous COBOL fields, the result of the function can be used to overwrite the subject field, without worry of sliding overlap or other "change while reading" problems.
COBOL is a language of fixed-length fields. So no, INSPECT is not going to be able to do what you want.
If you need this for an IBM Mainframe, your SORT product (assuming sufficiently up-to-date) can do this using FINDREP.
If you look at the XML processing possibilities in Enterprise COBOL, you will see that they do exactly what you want (I'd guess). GnuCOBOL can also readily interface with lots of other things. If you are writing GnuCOBOL for running on a non-Mainframe, I'd suggest you ask on the GnuCOBOL part of SourceForge.
Otherwise, yes, it would come down to looping through the data. Once you clarify what you want a bit more, you may get examples of that if you still need them.