Drools: Merged cells in CSV - csv

I have a set of Drools rules stored in an Excel document that for various reasons needs to be replaced with a .csv file. The problem is that .csv files don't support merged cells, making it difficult if not impossible to properly convert the rules.
After a lot of googling, I found references to using "..." to indicate merged cells, but no explicit examples on how to use it. Documentation found in the source code gives a few more hints, but is still too ambiguous; I've tried countless different interpretations of it without any success.
Any help would be appreciated.

We had the same issue as you. After reviewing their source code: CsvParser + DefaultRuleSheetListener, I found the solution. Here this post can help you to save time.
Only specify ... at ObjectType Matching row, i.e. the one below CONDTION, ACTION row. Starting from the begin of the Merged cell to the end of the merged cell. Please note, for the continued merged cell, you cannot just use "...", but the code will ignore it after normalized and trim and treat it as an empty cell and silently ignore it. Put anything such as a..., b..., etc. Here is the example.
Please also note Drools uses buffered reader, not CSV reader, it cannot handle one cell value spanning multiple lines. Unless you have your CSVParser which uses CSVReader.
Here is a simplified example.
CONDITION,CONDITION,CONDITION,ACTION,ACTION
$Client:Client(),$Product:Product()...,anythingButNotJust3Dots...,,
"clientType == ""$param""","planType == ""$param""","accountType == ""$param""","documents.add(""$param"");","documents.add(""$param"");"
INDIVIDUAL,RRSP,CASH,document1,document2
INDIVIDUAL,RESP,CASH,document2,
INDIVIDUAL,RIF,CASH,document3,
INDIVIDUAL,,MARGIN,document4,document6

Related

Univocity parser - false delimiter autodetection when too little information given

I set the parser to detect the delimiters automatically
CsvParserSettings settings = new CsvParserSettings();
settings.detectFormatAutomatically();
I have only 1 single record : 47W2E2qxPs, http://usda.gov/mattis.html
What I got :
code: 47W2E2qxPshttp url: //usda.gov/mattis.html
I expected the delimiter to be , and not :
so my expected result would be 47W2E2qxPs and http://usda.gov/mattis.html .
Could I fix it in an elegant way?
Author of the library here. The detection process is a heuristic that uses statistics collected from multiple rows of part of your input. Therefore it depends a lot on the size of the input.
Its purpose is to handle situations where you can't easily determine what is the CSV format - such as when users upload random files to you. Don't use the detection process if you already know what is the correct delimiter.
In your case, one row of data is absolutely not enough to reliably detect the delimiter, especially if there are multiple symbols present. There is little you can do about it except for testing what was the detected delimiter before continuing:
parser.beginParsing(new File("/path/to/your.csv"));
CsvFormat format = parser.getDetectedFormat();
//check if the format is sane.
The next version (2.6.0) will include more options to assist the heuristic such as providing a set of allowed characters to be used as delimiters - which will probably help in your case.

Extract aligned sections of FASTA to new file

I've already looked here and in other forums, but couldn't find the answer to my question. I want to design baits for a target enrichment Sequencing approach and have the output of a MarkerMiner search for orthologous loci from four different genomes with A. thaliana as a Reference as. These output alignments are separate Fasta-Files for each A. thaliana annotated gene with the sequences from my datasets aligned to it.
I have already run a script to filter out those loci supported to be orthologous by at least two of my four input datasets.
However, now, I'm stumped.
My alignments are gappy, since the input data is mostly RNAseq whereas the Reference contains the introns as well. So it looks like this :
AT01G1234567
ATCGATCGATGCGCGCTAGCTGAATCGATCGGATCGCGGTAGCTGGAGCTAGSTCGGATCGC
MyData1
CGATGCGCGC-----------CGGATCGCGG---------------CGGATCGC
MyData2
CGCTGCGCGC------------GGATAGCGG---------------CGGATCCC
To effectively design baits I now need to extract all the aligned parts from the file, so that I will end up with separate files; or separate alignments within the file; for the parts that are aligned between MyData and the Reference sequence with all the gappy parts excluded. There are about 1300 of these fasta files, so doing it manually is no option.
I have a bit of programming experience in python and with Linux command line tools, however I am completely lost on how to go about this. I would appreciate a hint, on what kind of tools are out there I could use or what kind of algorithm I need to come up with.
Thank you.
Cheers

Applying "Automatic" number formatting

Is it possible to apply the 'Automatic' number format programmatically through GAS? My issue is that as I write columns of numbers, Sheets seems to attempt to apply appropriate formatting, but gets it wrong sometimes. That is, particular small integers (1 sometimes) will be formatted as dates. The range is being written in one myRange.setValues() method and I can't see any pattern to the mistakes and therefore don't see any way to prevent the surprise mis-formatting.
But, when I select the range in sheets and just click "Automatic" on the number format menu all returns to normal. It doesn't help to click that upfront as the writing of data somehow resets the format.
Despite the long-winded intro, my question is very simple: how to programmatically apply "Automatic" number formatting. I'm thinking this is very basic, especially since google and searches here have been no help.
My current fallback solution is to use myRange.setNumberFormat("0") as the format for the whole range. This is not ideal as some numbers are very large and are easier to read in scientific notation. There are also some text strings in the range, but these format properly regardless of format applied. I also would prefer to avoid having to iterate through the data and test for values to determine the best format, when it's just a couple clicks in the user interface.
we can use .setNumberFormat('General');
Below is the example:
var spreadsheet = SpreadsheetApp.getActive();
spreadsheet.getRange("B:B").setNumberFormat("General");
I use copyFormatToRange to copy/apply Automatic format:
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sheet = ss.getActiveSheet();
var source_cell = sheet.getRange("A1");//A1: cell having automatic format
source_cell.copyFormatToRange(sheet,1,1,2,2);//copy format of cell A1 to cell A2
You can write an API that opens another spreadsheet, read any cell that having the automatic format.
var ss = SpreadsheetApp.openById(SpreadsheetId);//Id of another spreadsheet
Then use copyFormatToRange to your wanted cell.
I was having trouble finding anything documented, and tried pretty much everything suggested previously (null, 'General', the "magic" format of '0.###############', etc., etc.).
In my particular case, I had ranges previously set to strict plain text, which then got replaced with a checkbox data validation. Anytime the box was checked it was converted to the text "TRUE" instead of remaining a checkbox. 'General' and the "magic" format functionally worked fine, but did not actually set the format back explicitly to "Automatic".
I finally decided, why not just try this:
range.setNumberFormat('Automatic');
And it worked. This really should be documented, but at least a little bit of common sense lead me to the answer regardless.
If you don't have dates in the range, the below solution appears to be the best available option (without resorting to an API-based solution):
myRange.setNumberFormat('0.###############');
A zero-point-15x'#' seems to be a 'magic' number format that will allow very large numbers to show as scientific notation and smaller integers and decimals to show in the 'standard' format pre-application of number formatting. This is also the format that is returned for cells that contain non-dates formatted with the 'Automatic' selection in the user interface.
Adding or removing even one # will 'break the spell' and cause very large numbers to display in non-scientific notation. I also tested changes before the decimal place, but leaving the 15x#:
Also functional: myRange.setNumberFormat('#,##0.###############');
So there is some flexibility for prefixes.
Non-functional: myRange.setNumberFormat('#.###############');
The 0 is evidently required.
And finally,
Non-functional: savegameRange.setNumberFormat('0.##############[red]');
This turns numbers red, but breaks the 'magic' formatting. So no suffixes it appears.
Again, if you have dates in the range, this will not work as they will, not surprisingly, display as the underlying number. And potentially more problematic (but totally understandable), the only way to return them to date form is manually applying a date format, assuming you know which cells 'were' dates.
Complete replication of 'Automatic' number formatting requires traversing the range to find dates and apply desired date format, but otherwise applying the 'magic' format. (My original dataset was a mix of numbers and strings, so the simple approach given above works.)

F# csv provider with different column order

If I define a type
type MyType = CsvProvider<"schema.csv",
Schema="A->MyA=int, B->MyB=int">
And if i load csv's like
let csv1 = MyType.Load("file1.csv")
If "file1.csv" contains all the columns that "schema.csv" has, but with different order, and have extra columns which do not appear in "schema.csv". Can I still load it provided that I am only interested in the columns that specified in "schema.csv"?
Either you have a locked schema of the CSV-files, and use CsvProvider, or you dont.
You always have the option of "reverting" to CsvFile (CsvParser): http://fsharp.github.io/FSharp.Data/library/CsvFile.html
With the latter you can easily parse any CSV-file, confirm that it has the columns you want, and then read them as wanted.
I usually revert to the CsvFile, since often creating CSV-files are done somewhat unstructured and apperently ad-hoc (at least in the cases I have encountered), and then CsvFile are a good solution, with somewhat more flexibility then in CsvProvider. Yes somewhat more code too, but still...
That use case is not supported. If the column order is different things won't work. The whole CsvProvider is built on the assumption that the data you give it has the same structure of the sample you provided. You can always submit an issue here: https://github.com/fsharp/FSharp.Data/issues/

How to remove first row and last row without having identifier in Fixed length Flat file?

I have a fixed-length flat file with a header and footer, but those rows do not have any identifier like H for header, T for trailer or D for Data or something along those lines.
All my data starts with position 1, but header and footer start with different positions in the row format. I tried to use a conditional split, but I could not get the result I wanted.
Please help me to find some logic to get header and footer out of the data file. I want to store data in a SQL Server table, and header/footer data in a flat file for future reference.
As you have not included the exact file format in the your question, let us assume the looks like this:
This is the header line with some spaces at the beginning.
This, is, real, line 1
This, is, real, line 2
End of file. It has some spaces too at the beginning.
If this is a correct representation of your situation, you can use a flat file source, read each record in ins entirety in one column, say EntireRow, and then use a conditional split to identify the header/footer using the following condition:
LEN(EntireRow) > LEN(TRIM(EntireRow))
The issue here would be how would you split the meaningful rows (The real lines 1 and 2 in the example above). You can dump it in another file. Now, this file will be clean with header and footer removed. This is a simple but not the most efficient solution.
Solution 2:
A better solution would be to use a Script Component. Learn about how to add a row dynamically. One such example: http://www.sqlis.com/post/The-Script-Component-as-a-Transformation.aspx
Now, to get you started, a sample code that would eliminate the header/footer using your condition that they do not start at position 1 ... with no error handling. The link above will tell you how to add the output and the columns within that output. Make sure the data types your columns match with what your source file has.
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
/*
Add your code here
*/
if ((Row.EntireRow.Length == Row.EntireRow.Trim().Length))
{
string[] sArrFields = Row.EntireRow.Split(',');
MeaningfulBuffer.AddRow();
MeaningfulBuffer.Col1 = sArrFields[0];
MeaningfulBuffer.Col2 = sArrFields[1];
MeaningfulBuffer.Col3 = sArrFields[2];
MeaningfulBuffer.Col4 = sArrFields[3];
}
}
Hope this much guidance is sufficient for you to resolve your issue and one of the two solutions I have offered works. Please respond back.