Using Apache Nifi, I created a flow that read a Json file and splits it line by line in order to verify if the content is correct. After that I have 2 outputs: 1 - for successful line and 2-for unsuccessful ones and the output is a Json file.
For the moment, all the lines are stored into separate files but what I want to do is to store each "good" line into 1 file and each "bad" one in another.
What processor should I use?
The RouteText processor was designed for exactly what you are trying to do. It allows you to route lines of text to different relationships based on expressions you create. It bundles the lines from each FlowFile together for each relationship.
You can see the documentation for it here: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.RouteText/index.html
You can get an example template (doing almost exactly what you would like to do) using RouteText here: https://github.com/hortonworks-gallery/nifi-templates/blob/master/templates/SplitRouteMergeVsRouteText.xml
Related
I have the following 10000 rows of log file every 5 seconds.
log_datetime1 host_name1 log_message1
log_datetime2 host_name2 log_message2
log_datetime3 host_name3 log_message3
I want to send them to kudu or parquet table as the following JSON
{"cureent_datetime":"datetime", "log_data":"log_datetime1 host_name1 log_message1"}
{"cureent_datetime":"datetime", "log_data":"log_datetime2 host_name2 log_message2"}
{"cureent_datetime":"datetime", "log_data":"log_datetime3 host_name3 log_message3"}
Currently I'm using Two ReplaceText Processors. One to add the
{"cureent_datetime":"datetime", "log_data":" at the beginning of each line of 10000 rows log file and the second one to add "} at the end of each line.
Was wondering if I could do the both step in one ReplaceText proecssor.
Using the search pattern (.+)(?=\n) and the replacement pattern {"current_datetime":"datetime", "log_data":"$1"} will result in the desired output. The search pattern looks for text which is followed by a newline, and the replacement includes the capture group inside the templated JSON structure.
I have a csv like below. some of columns have line break like column B below. when I doing wc -l file.csv unix is returning 4 but actually these are 3 records. I don't want to replace line break with space, I am going to load data in database using sql loader and want to load data as it is. what should I do so that unix consider line break as one record?
A,B,C,D
1,"hello
world",sds,sds
2,sdsd,sdds,sdds
Unless you're dealing with trivial cases (No quoted fields, no embedded commas, no embedded newlines, etc.), CSV data is best processed with tools that understand the format. Languages like perl and python have CSV parsing libraries available, there are packages like csvkit that provide useful utilities, and more.
Using csvstat from csvkit on your example:
$ csvstat -H --count foo.csv
Row count: 3
I'm using v5.1.1 of JMeter and attempting to use the "CSV Data Set Config". The file is read correctly as I can tell from the Debug Sampler/Results Tree, but the file is not being read line by line. In other words, it reads the first line and never proceeds to the next line for processing.
I would like to use the data inside the CSV to iterate over a series of HTTP Requests to an external API. I currently have a single thread with only the "CSV Data Set Config" and "HTTP Request".
Do I need to wrap this with a ForEach controller or another looping construct? Perhaps I'm missing it but I do not see in the documentation that would indicate it's necessary.
Thanks
You dont need to wrap this in a ForEach loop. First line in the CSV file is a var name:
Let's say your csv file looks like
foo, bar
1, John
2, George
3, Laura
And you use an http request sampler
then ${foo} and ${bar} will get iterated sequentially. However please make sure you are mindful about the CSV Data Set Config options. The following options works ok for me:
By default CSV Data Set Config doesn't trigged any "looping", it reads next line from the CSV file for each thread (virtual user) for each iteration.
So if you want to see more values from the CSV file - either add more users or loops or both.
Given
This CSV file:
line1
line2
line3
Following CSV Data Set Config setup:
And the following Thread Group setup:
You will get the following values (assuming __threadNum() function to visualize current virtual user number and ${__jm__Thread Group__idx} pre-defined variable to show current Thread Group iteration) :
Check out JMeter Parameterization - The Complete Guide article for more information on various approaches on parameterizing JMeter tests using external data sources
i am trying to compare two csv file data and followed below process in RIDE -
${csvA} = Get File ${filePathA}
${csvB} = Get File ${filePathB}
Should Be Equal As Strings ${csvA} ${csvB}
Here are my two csv contents -
csvA data
Harshil,45,8.03,DMJ
Divy,55,8,VVN
Parth,1,9,vvn
kjhjmb,44,0.5,bugg
csvB data
Harshil,45,8.03,DMJ
Divy,55,78,VVN
Parth,1,9,vvnbcb
acc,5,6,afafa
As few of the data is not in match, when i Run the code in RIDE, the result is FAIL. But in the log below data is shown -
**
Multiline strings are different:
--- first
+++ second
## -1,4 +1,4 ##
Harshil,45,8.03,DMJ
-Divy,55,8,VVN
-Parth,1,9,vvn
-kjhjmb,44,0.5,bugg
+Divy,55,78,VVN
+Parth,1,9,vvnbcb
+acc,5,6,afafa**
I would like to know the meaning of ---first +++second ##-1,4+1,4## content.
Thanks in advance!
When robot compares multiline strings (data that has newlines in it), it uses the standard unix tool diff to show the differences. Those characters are all part of what's called a unified diff. Even though you pass in raw data, it's treating the data as two files and showing the differences between the two in a format familiar to most programmers.
Here are two references to read more about the format:
What does "## -1 +1 ##" mean in Git's diff output?. (stackoverflow)
the diff man page (gnu.org)
In short, the ## gives you a reference for which line numbers are different, and the + and - show you which lines are different.
In your specific example it's telling you that three lines were different between the two strings: the line beginning with Divy, the line beginning with Parth, and the line beginning with acc. Since the line beginning with Harshil does not show a + or -, that means it was identical between the two strings.
I have a set of csv files that are very simple to load into Stata using the -insheet- command. But they have very uninformative variable names. For each of these files, I also have a file of metadata consisting of two columns: the original (uninformative) variable names, and a description of what the variables actually mean. I'd like to use these metadata files to create variable labels, preferably without going through and typing up all the separate label commands or turning the metadata file into a dictionary for each file. It seems like there must be a quick way of loading the metadata file into Stata and looping through it to generate the label commands, but I don't know what it is. Any thoughts?
Ideally each line of the metadata is something like
varname1 "more interesting description"
in which case you can prefix each line with
label var
and then run the file as if it were a do-file using do. See the help for label. That is easy in a decent text editor, as for example searching for the start of each line and replacing it with label var (note the need for the space).
What could bite here includes:
You don't have double quotes " " as delimiters, in which case you need to insert them.
The extra information does not qualify as a variable label because it is more than 80 characters long. See help limits.
There are other ways to do this with Stata. You could write a program to read in the metadata and write out a do-file using file, but if this were my problem I would reach first for my text editor. (Most experienced Stata programmers use something else as well as doedit.)