How to add back comments/whitespaces in translator using the Antlr4's visitor model - mysql

I'm currently writing a TSQL (Sybase/Microsoft SQL) to MySQL translator using the ANTLR4 visitor approach.
I'm able to push comments and whitespaces to different channels so that I can use that information later.
What's not super clear is:
how do I get the data back?
and more importantly how do I plug the comments and whitespaces back into my translated MySQL code?
Re: #1, this seems to work to get the list of all tokens including the comments/whitespaces:
public static List<Token> getHiddenTokensFromString(String sqlIn, int hiddenChannel) {
CharStream charStream = CharStreams.fromString(sqlIn);
CaseChangingCharStream upper = new CaseChangingCharStream(charStream, true);
TSqlLexer lexer = new TSqlLexer(upper);
CommonTokenStream commonTokenStream = new CommonTokenStream(lexer, hiddenChannel);
commonTokenStream.fill();
List<Token> hiddenTokens = commonTokenStream.getTokens();
return hiddenTokens;
}
Re #2, what makes it particularly challenging is that as part of the translation, lines of SQL have to be moved around, some lines removed and some lines added.
Any help will be greatly appreciated.
Thanks.

The ANTLR4 lexer creates a number of tokens, each with an index (a running number). Provided you didn't just skip a token, all tokens are available for later inspection, once the parsing step is done, regardless of their channels (the channel is actually just a number property on a token).
So, given you have a token you want to translate, get its index and then ask the token stream for the tokens with the next smaller index or next higher index. These are usually the hidden whitespaces.
Once you have the whitespace token use its start and stop index to get the original text from the char stream. And since you know where you are in the translation process when you do that, it should be easy to know where to insert the original text.

Related

Univocity parser - false delimiter autodetection when too little information given

I set the parser to detect the delimiters automatically
CsvParserSettings settings = new CsvParserSettings();
settings.detectFormatAutomatically();
I have only 1 single record : 47W2E2qxPs, http://usda.gov/mattis.html
What I got :
code: 47W2E2qxPshttp url: //usda.gov/mattis.html
I expected the delimiter to be , and not :
so my expected result would be 47W2E2qxPs and http://usda.gov/mattis.html .
Could I fix it in an elegant way?
Author of the library here. The detection process is a heuristic that uses statistics collected from multiple rows of part of your input. Therefore it depends a lot on the size of the input.
Its purpose is to handle situations where you can't easily determine what is the CSV format - such as when users upload random files to you. Don't use the detection process if you already know what is the correct delimiter.
In your case, one row of data is absolutely not enough to reliably detect the delimiter, especially if there are multiple symbols present. There is little you can do about it except for testing what was the detected delimiter before continuing:
parser.beginParsing(new File("/path/to/your.csv"));
CsvFormat format = parser.getDetectedFormat();
//check if the format is sane.
The next version (2.6.0) will include more options to assist the heuristic such as providing a set of allowed characters to be used as delimiters - which will probably help in your case.

SSIS Derived Column - Parse Text between break returns

I have a text field from a SQL Server Source. It is a phone number field that typically has this format:
Home: 555-555-1212
Work: 555-555-1212
Cell: 555-555-1212
Emergency: 555-555-1212
I'm trying to split among fields so that only 555-555-1212 is displayed
I am then taking this field and converting to a string. There are literally break returns (\r\n) between the labels here. The goal here is to have this data split among multiple fields (home,work,cell,emergency,etc.) I was researching how to split text among fields and I made some progress. In the case of home numbers, I used this logic:
SUBSTRING(Phone_converted,FINDSTRING(Phone_converted,"Home:",1) + 5,FINDSTRING(Phone_converted,"\n",1) - FINDSTRING(Phone_converted,"Home:",1) - 5)
This works great as it parses up to the text return and I get 555-555-1212.
Now I experience an issue when searching for a text between break returns. I tried the same logic for Work numbers:
SUBSTRING(Phone_converted,FINDSTRING(Phone_converted,"Work:",1) + 5,FINDSTRING(Phone_converted,"\n",1) - FINDSTRING(Phone_converted,"Work:",1) - 5)
But that won't work and results in writing to my error redirection file. I then tried to insert a break return to find the text at the beginning
SUBSTRING(Phone_converted,FINDSTRING(Phone_converted,"\nWork:",1) + 5,FINDSTRING(Phone_converted,"\n",1) - FINDSTRING(Phone_converted,"\nWork:",1) - 5)
No luck there either. Any ideas on how I can address this. Also, I would appreciate an idea of how I can handle the emergency title at the end. There won't be a break return in that situation, but I still want to parse the text.
I look at your data and I see
Home:|555-555-1212|Work:|555-555-1212|Cell:|555-555-1212|Emergency:|555-555-1212
I'm using the pipe character, |, as a placeholder for where I would segment that string, which is basically wherever you have whitespace (space, tab, newline, etc).
There are two approaches to this. I'll start with the easy one.
Script Component
String.Split is your friend here. Look at what it did with that source data
I added a new Script Component, acting as a Transformation and created 4 output columns, all string of length 12 codepage 1252: Home, Work, Cell, and Emergency. I populate them like so
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
string[] split = Row.PhoneData.Split();
Row.Home = split[1];
Row.Work = split[4];
Row.Cell = split[7];
Row.Emergency = split[10];
}
Derived Column
I'm not going to build out a full blown implementation of this. The above is much to simple but I run into situations where ETL devs say they aren't allowed to use Script tasks/components and that's usually because people reached for them first instead of last.
The approach here is to have lots of Derived Columns Components on your Data Flow. It won't hurt performance and in fact can make it easier. It definitely will make your debugging easier as you'll have lots of it to do.
DER Find Colons
This would add 4 columns into the dataflow - HomeColonPosition, WorkColonPosition etc. You've already started down this path but just build it out into the actual data flow as you'll need to reference these positions and again, it's easier to fix the calculation that populates a column versus a calculation that's wrong and used everywhere. You're likely to find that 4 derived columns are useful here as you'd want to use the previous colon's position as the starting point for the third argument to FINDSTRING
Thus, instead of Work being
FINDSTRING(PhoneData, ":", FINDSTRING(PhoneData, ":" 1) + 1)
it would just be
FINDSTRING(PhoneData, ":", HomeColonPosition + 1)
Just knowing the position of the 4 colons in that string, I can figure out where the phone numbers are (maybe). The position of the colon + 2 (colon and the space) is the starting point and then go out 12 characters.
Where this approach gets ugly, much as it did with the script approach is when that data isn't consistent.

Redacted comments in MS's source code for .NET [duplicate]

The Reference Source page for stringbuilder.cs has this comment in the ToString method:
if (chunk.m_ChunkLength > 0)
{
// Copy these into local variables so that they
// are stable even in the presence of ----s (hackers might do this)
char[] sourceArray = chunk.m_ChunkChars;
int chunkOffset = chunk.m_ChunkOffset;
int chunkLength = chunk.m_ChunkLength;
What does this mean? Is ----s something a malicious user might insert into a string to be formatted?
The source code for the published Reference Source is pushed through a filter that removes objectionable content from the source. Verboten words are one, Microsoft programmers use profanity in their comments. So are the names of devs, Microsoft wants to hide their identity. Such a word or name is substituted by dashes.
In this case you can tell what used to be there from the CoreCLR, the open-sourced version of the .NET Framework. It is a verboten word:
// Copy these into local variables so that they are stable even in the presence of race conditions
Which was hand-edited from the original that you looked at before being submitted to Github, Microsoft also doesn't want to accuse their customers of being hackers, it originally said races, thus turning into ----s :)
In the CoreCLR repository you have a fuller quote:
Copy these into local variables so that they are stable even in the presence of race conditions
Github
Basically: it's a threading consideration.
In addition to the great answer by #Jeroen, this is more than just a threading consideration. It's to prevent someone from intentionally creating a race condition and causing a buffer overflow in that manner. Later in the code, the length of that local variable is checked. If the code were to check the length of the accessible variable instead, it could have changed on a different thread between the time length was checked and wstrcpy was called:
// Check that we will not overrun our boundaries.
if ((uint)(chunkLength + chunkOffset) <= ret.Length && (uint)chunkLength <= (uint)sourceArray.Length)
{
///
/// imagine that another thread has changed the chunk.m_ChunkChars array here!
/// we're now in big trouble, our attempt to prevent a buffer overflow has been thawrted!
/// oh wait, we're ok, because we're using a local variable that the other thread can't access anyway.
fixed (char* sourcePtr = sourceArray)
string.wstrcpy(destinationPtr + chunkOffset, sourcePtr, chunkLength);
}
else
{
throw new ArgumentOutOfRangeException("chunkLength", Environment.GetResourceString("ArgumentOutOfRange_Index"));
}
}
chunk = chunk.m_ChunkPrevious;
} while (chunk != null);
Really interesting question though.
Don't think that this is the case - the code in question copies to local variables to prevent bad things happening if the string builder instance is mutated on another thread.
I think the ---- may relate to a four letter swear word...

How to enumerate the keys and values of a record in AppleScript

When I use AppleScript to get the properties of an object, a record is returned.
tell application "iPhoto"
properties of album 1
end tell
==> {id:6.442450942E+9, url:"", name:"Events", class:album, type:smart album, parent:missing value, children:{}}
How can I iterate over the key/value pairs of the returned record so that I don't have to know exactly what keys are in the record?
To clarify the question, I need to enumerate the keys and values because I'd like to write a generic AppleScript routine to convert records and lists into JSON which can then be output by the script.
I know it's an old Q but there are possibilities to access the keys and the values now (10.9+). In 10.9 you need to use Scripting libraries to make this run, in 10.10 you can use the code right inside the Script Editor:
use framework "Foundation"
set testRecord to {a:"aaa", b:"bbb", c:"ccc"}
set objCDictionary to current application's NSDictionary's dictionaryWithDictionary:testRecord
set allKeys to objCDictionary's allKeys()
repeat with theKey in allKeys
log theKey as text
log (objCDictionary's valueForKey:theKey) as text
end repeat
This is no hack or workaround. It just uses the "new" ability to access Objective-C-Objects from AppleScript.
Found this Q during searching for other topics and couldn't resist to answer ;-)
Update to deliver JSON functionality:
Of course we can dive deeper into the Foundation classes and use the NSJSONSerialization object:
use framework "Foundation"
set testRecord to {a:"aaa", b:"bbb", c:"ccc"}
set objCDictionary to current application's NSDictionary's dictionaryWithDictionary:testRecord
set {jsonDictionary, anError} to current application's NSJSONSerialization's dataWithJSONObject:objCDictionary options:(current application's NSJSONWritingPrettyPrinted) |error|:(reference)
if jsonDictionary is missing value then
log "An error occured: " & anError as text
else
log (current application's NSString's alloc()'s initWithData:jsonDictionary encoding:(current application's NSUTF8StringEncoding)) as text
end if
Have fun, Michael / Hamburg
If you just want to iterate through the values of the record, you could do something like this:
tell application "iPhoto"
repeat with value in (properties of album 1) as list
log value
end repeat
end tell
But it's not very clear to me what you really want to achieve.
Basically, what AtomicToothbrush and foo said. AppleScript records are more like C structs, with a known list of labels, than like an associative array, with arbitrary keys, and there is no (decent) in-language way to introspect the labels on a record. (And even if there were, you’d still have the problem of applying them to get values.)
In most cases, the answer is “use an associative array library instead.” However, you’re specifically interested in the labels from a properties value, which means we need a hack. The usual one is to force an error using the record, and then parse the error message, something like this:
set x to {a:1, b:2}
try
myRecord as string
on error message e
-- e will be the string “Can’t make {a:1, b:2} into type string”
end
Parsing this, and especially parsing this while allowing for non-English locales, is left as an exercise for the reader.
ShooTerKo's answer is incredibly helpful to me.
I'll bring up another possibility I'm surprised I didn't see anyone else mention, though. I have to go between AppleScript and JSON a lot in my scripts, and if you can install software on the computers that need to run the script, then I highly recommend JSONHelper to basically make the whole problem go away:
https://github.com/isair/JSONHelper

A StringToken Parser which gives Google Search style "Did you mean:" Suggestions

Seeking a method to:
Take whitespace separated tokens in a String; return a suggested Word
ie:
Google Search can take "fonetic wrd nterpreterr",
and atop of the result page it shows "Did you mean: phonetic word interpreter"
A solution in any of the C* languages or Java would be preferred.
Are there any existing Open Libraries which perform such functionality?
Or is there a way to Utilise a Google API to request a suggested word?
In his article How to Write a Spelling Corrector, Peter Norvig discusses how a Google-like spellchecker could be implemented. The article contains a 20-line implementation in Python, as well as links to several reimplementations in C, C++, C# and Java. Here is an excerpt:
The full details of an
industrial-strength spell corrector
like Google's would be more confusing
than enlightening, but I figured that
on the plane flight home, in less than
a page of code, I could write a toy
spelling corrector that achieves 80 or
90% accuracy at a processing speed of
at least 10 words per second.
Using Norvig's code and this text as training set, i get the following results:
>>> import spellch
>>> [spellch.correct(w) for w in 'fonetic wrd nterpreterr'.split()]
['phonetic', 'word', 'interpreters']
You can use the yahoo web service here:
http://developer.yahoo.com/search/web/V1/spellingSuggestion.html
However it's only a web service... (i.e. there are no APIs for other language etc..) but it outputs JSON or XML, so... pretty easy to adapt to any language...
You can also use the Google API's to spell check. There is an ASP implementation here (I'm not to credit for this, though).
First off:
Java
C++
C#
Use the one of your choice. I suspect it runs the query against a spell-checking engine with a word limit of exactly one, it then does nothing if the entire query is valid, otherwise it replaces each word with that word's best match. In other words, the following algorithm (an empty return string means that the query had no problems):
startup()
{
set the spelling engines word suggestion limit to 1
}
option 1()
{
int currentPosition = engine.NextWord(start the search at word 0, querystring);
if(currentPosition == -1)
return empty string; // Query is a-ok.
while(currentPosition != -1)
{
queryString = engine.ReplaceWord(engine.CurrentWord, queryString, the suggestion with index 0);
currentPosition = engine.NextWord(currentPosition, querystring);
}
return queryString;
}
Since no one has yet mentioned it, I'll give one more phrase to search for: "edit distance" (for example, link text).
That can be used to find closest matches, assuming it's typos where letters are transposed, missing or added.
But usually this is also coupled with some sort of relevancy information; either by simple popularity (to assume most commonly used close-enough match is most likely correct word), or by contextual likelihood (words that follow preceding correct word, or come before one). This gets into information retrieval; one way to start is to look at bigram and trigrams (sequences of words seen together). Google has very extensive freely available data sets for these.
For simple initial solution though a dictionary couple with Levenshtein-based matchers works surprisingly well.
You could plug Lucene, which has a dictionary facility implementing the Levenshtein distance method.
Here's an example from the Wiki, where 2 is the distance.
String[] l=spellChecker.suggestSimilar("sevanty", 2);
//l[0] = "seventy"
http://wiki.apache.org/lucene-java/SpellChecker
An older link http://today.java.net/pub/a/today/2005/08/09/didyoumean.html
The Google SOAP Search APIs do that.
If you have a dictionary stored as a trie, there is a fairly straightforward way to find best-matching entries, where characters can be inserted, deleted, or replaced.
void match(trie t, char* w, string s, int budget){
if (budget < 0) return;
if (*w=='\0') print s;
foreach (char c, subtrie t1 in t){
/* try matching or replacing c */
match(t1, w+1, s+c, (*w==c ? budget : budget-1));
/* try deleting c */
match(t1, w, s, budget-1);
}
/* try inserting *w */
match(t, w+1, s + *w, budget-1);
}
The idea is that first you call it with a budget of zero, and see if it prints anything out. Then try a budget of 1, and so on, until it prints out some matches. The bigger the budget the longer it takes. You might want to only go up to a budget of 2.
Added: It's not too hard to extend this to handle common prefixes and suffixes. For example, English prefixes like "un", "anti" and "dis" can be in the dictionary, and can then link back to the top of the dictionary. For suffixes like "ism", "'s", and "ed" there can be a separate trie containing just the suffixes, and most words can link to that suffix trie. Then it can handle strange words like "antinationalizationalization".