What does 'multiline strings are different' meant by from RIDE (Robot Framework) output? - csv

i am trying to compare two csv file data and followed below process in RIDE -
${csvA} = Get File ${filePathA}
${csvB} = Get File ${filePathB}
Should Be Equal As Strings ${csvA} ${csvB}
Here are my two csv contents -
csvA data
Harshil,45,8.03,DMJ
Divy,55,8,VVN
Parth,1,9,vvn
kjhjmb,44,0.5,bugg
csvB data
Harshil,45,8.03,DMJ
Divy,55,78,VVN
Parth,1,9,vvnbcb
acc,5,6,afafa
As few of the data is not in match, when i Run the code in RIDE, the result is FAIL. But in the log below data is shown -
**
Multiline strings are different:
--- first
+++ second
## -1,4 +1,4 ##
Harshil,45,8.03,DMJ
-Divy,55,8,VVN
-Parth,1,9,vvn
-kjhjmb,44,0.5,bugg
+Divy,55,78,VVN
+Parth,1,9,vvnbcb
+acc,5,6,afafa**
I would like to know the meaning of ---first +++second ##-1,4+1,4## content.
Thanks in advance!

When robot compares multiline strings (data that has newlines in it), it uses the standard unix tool diff to show the differences. Those characters are all part of what's called a unified diff. Even though you pass in raw data, it's treating the data as two files and showing the differences between the two in a format familiar to most programmers.
Here are two references to read more about the format:
What does "## -1 +1 ##" mean in Git's diff output?. (stackoverflow)
the diff man page (gnu.org)
In short, the ## gives you a reference for which line numbers are different, and the + and - show you which lines are different.
In your specific example it's telling you that three lines were different between the two strings: the line beginning with Divy, the line beginning with Parth, and the line beginning with acc. Since the line beginning with Harshil does not show a + or -, that means it was identical between the two strings.

Related

NIFI - Using one ReplaceText Processor how to add brackets at the beginning and end of each line

I have the following 10000 rows of log file every 5 seconds.
log_datetime1 host_name1 log_message1
log_datetime2 host_name2 log_message2
log_datetime3 host_name3 log_message3
I want to send them to kudu or parquet table as the following JSON
{"cureent_datetime":"datetime", "log_data":"log_datetime1 host_name1 log_message1"}
{"cureent_datetime":"datetime", "log_data":"log_datetime2 host_name2 log_message2"}
{"cureent_datetime":"datetime", "log_data":"log_datetime3 host_name3 log_message3"}
Currently I'm using Two ReplaceText Processors. One to add the
{"cureent_datetime":"datetime", "log_data":" at the beginning of each line of 10000 rows log file and the second one to add "} at the end of each line.
Was wondering if I could do the both step in one ReplaceText proecssor.
Using the search pattern (.+)(?=\n) and the replacement pattern {"current_datetime":"datetime", "log_data":"$1"} will result in the desired output. The search pattern looks for text which is followed by a newline, and the replacement includes the capture group inside the templated JSON structure.

line feed within a column in csv

I have a csv like below. some of columns have line break like column B below. when I doing wc -l file.csv unix is returning 4 but actually these are 3 records. I don't want to replace line break with space, I am going to load data in database using sql loader and want to load data as it is. what should I do so that unix consider line break as one record?
A,B,C,D
1,"hello
world",sds,sds
2,sdsd,sdds,sdds
Unless you're dealing with trivial cases (No quoted fields, no embedded commas, no embedded newlines, etc.), CSV data is best processed with tools that understand the format. Languages like perl and python have CSV parsing libraries available, there are packages like csvkit that provide useful utilities, and more.
Using csvstat from csvkit on your example:
$ csvstat -H --count foo.csv
Row count: 3

Unix diff with custom line separator

Looking to compare two CSV files. Suppose the field separator is $, each record has two fields, and the file can be formatted something like:
a$simple line$
b$run-on-
line$
c$simple line$
Is there some switch or variety of Unix diff command that will let me run the comparison where the record separator (line separator) is the $ sign immediately followed by a new line?
Ideally I want to be guaranteed that diff outputs the entire record when any change is detected.
With the default behavior, I could potentially get a partial record as diff output (in scenarios where the record runs over several lines).
Is there some smarter way to do this that I'm not considering?
--
Edited to add: sample of expected output
If I compared the CSV file above with:
a$simple line$
b$run-on-changed-
line$
c$simple line$
... I would want to see the entire record b reported as a difference. Something like
2c2
< b$run-on-\nline$
---
> b$run-on-changed-\nline$
Peter, there is no direct support of custom line separator in gnu diff: http://man7.org/linux/man-pages/man1/diff.1.html (gnu diffutils)
You may try to use sed twice: sed to convert your format to one-record-per-line for diffing; diff converted files; sed back to multiline record format.
First sed will convert all $\n to real \n; and \n without $ before it to some unique special sequence, like #%#$%#$%#$#.
Then do diff.
Second sed will convert #%#$%#$%#$# back to \n (or to \\n to easier viewing of diff output)
There are diff variants which support working with csv. Some of them may handle csv with line breaks inside fields:
https://pypi.python.org/pypi/csvdiff (python)
csvdiff allows you to compare the semantic contents of two CSV files, ignoring things like row and column ordering in order to get to what’s actually changed. This is useful if you’re comparing the output of an automatic system from one day to the next, so that you can look at just what’s changed.
https://github.com/agardiner/csv-diff (ruby)
Unlike a standard diff that compares line by line, and is sensitive to the ordering of records, CSV-Diff identifies common lines by key field(s), and then compares the contents of the fields in each line.
http://csvdiff.sourceforge.net/ (perl)
csvdiff is a perl script to compare/diff two (comma) seperated files with each other. The part that is different to standard diff is, that you'll get the number of the record where the difference occours and the field/column which is different. The separator can be set to the value you want it to, not just comma. Also you can to provide a third file which contains the columnnames in one(!) line separated by your separator.

Apache Nifi - store lines into 1 file

Using Apache Nifi, I created a flow that read a Json file and splits it line by line in order to verify if the content is correct. After that I have 2 outputs: 1 - for successful line and 2-for unsuccessful ones and the output is a Json file.
For the moment, all the lines are stored into separate files but what I want to do is to store each "good" line into 1 file and each "bad" one in another.
What processor should I use?
The RouteText processor was designed for exactly what you are trying to do. It allows you to route lines of text to different relationships based on expressions you create. It bundles the lines from each FlowFile together for each relationship.
You can see the documentation for it here: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.RouteText/index.html
You can get an example template (doing almost exactly what you would like to do) using RouteText here: https://github.com/hortonworks-gallery/nifi-templates/blob/master/templates/SplitRouteMergeVsRouteText.xml

Entry delimiter of JSON files for Hive table

We are collecting JSON data (public social media posts in particular) via REST API invocations, which we plan to dump into HDFS, then abstract a Hive table on top it using SerDe. I wonder though what would be the appropriate delimiter per JSON entry in a file? Is it new line ("\n")? So it would look like this:
{ id: entry1 ... post: }
{ id: entry2 ... post: }
...
{ id: entryn ... post: }
How about if we encounter a new line character within the JSON data itself, for example in post?
The best way would be one record per line, separated by "\n" exactly as you guessed.
This also means that you should be careful to escape "\n" that may be inside the JSON elements.
Indented JSON won't work well with hadoop/hive, since to distribute processing, hadoop must be able to tell when a records ends, so it can split processing of a file with N bytes with W workers in W chunks of size roughly N/W.
The splitting is done by the particular InputFormat that's been used, in case of text, TextInputFormat.
TextInputFormat will basically split the file at the first instance of "\n" found after byte i*N/W (for i from 1 to W-1).
For this reason, having other "\n" around would confuse Hadoop and it will give you incomplete records.
As an alternative, I wouldn't recommend it, but if you really wanted you could use a character other than "\n" by configuring the property "textinputformat.record.delimiter" when reading the file through hadoop/hive, using a character that won't be in JSON (for instance, \001 or CTRL-A is commonly used by Hive as a field delimiter) but that can be tricky since it has to also be supported by the SerDe.
Also, if you change the record delimiter, anybody who copies/uses the file on HDFS must be aware of the delimiter, or they won't be able to parse it correctly, and will need special code to do it, while keeping "\n" as a delimiter, the files will still be normal text files and can be used by other tools.
As for the SerDe, I'd recommend this one, with the disclaimer that I wrote it :)
https://github.com/rcongiu/Hive-JSON-Serde