How can I compare JSON files from two different sources (same schema)in Boomi. If there is a difference, we reject the payload with error stating fields cannot be changed. Note that there may be upwards of 200 attributes.
Related
There is a nested json with very deep structure. File is of the format json.gz size 3.5GB. Once this file is uncompressed it is of size 100GB.
This json file is of the format, where Multiline = True (if this condition is used to read the file via spark.read_json then only we get to see the proper json schema).
Also, this file has a single record, in which it has two columns of Struct type array, with multilevel nesting.
How should I read this file and extract information. What kind of cluster / technique to use to extract relevant data from this file.
Structure of the JSON (multiline)
This is a single record. and the entire data is present in 2 columns - in_netxxxx and provider_xxxxx
enter image description here
I was able to achieve this in a bit different way.
Use the utility - Big Text File Splitter -
BigTextFileSplitter - Withdata Softwarehttps://www.withdata.com › big-text-file-splitter ( as the file was huge and multiple level nested) the split record size I kept was 500. This generated around 24 split files of around 3gb each. Entire process took 30 -40 mins.
Processed the _corrupt_record seperately - and populated the required information.
Read the each split file in a using - this option removes the _corrupt_record and also removes the null rows.
spark.read.option("mode", "DROPMALFORMED").json(file_path)
Once the information is fetched form each file, we can merge all the files into a single file, as per standard process.
CSV ("comma separated values") files, like many data sources, can have aberrations:
More or fewer fields than there are columns.
Field values that might present challenges (Eg, containing the field-separator).
Is there some way to configure the jackson CsvMapper so that is operates more liberally: ie, is less restrictive with regard to parsing the data records in CSV files?
I suggest looking into the configuration options for com.fasterxml.jackson.dataformat.csv.CsvMapper. The setup below helped me deal with trailing, unmatched columns (in my case, one or more commas with no field content between them):
CsvMapper csvMapper = (new CsvMapper()).configure(Feature.IGNORE_TRAILING_UNMAPPABLE, true);
I am sending the following csv files to marklogic
id,first_name,last_name,email,country,ip_address
5,Shawn,Grant,sgrant0#51.la,Liberia,37.194.161.124
5,Joshua,Fields,jfields1#godaddy.com,Colombia,54.224.238.176
5,Johnny,Bell,jbell2#t.co,Finland,159.38.61.122
Through mlcp using following command
C:\mlcp-9.0.3\bin>mlcp.bat import -host localhost -port 9636 -username admin -pa
ssword admin -input_file_path D:\test.csv -input_file_type delimited_text -docum
ent_type json
What happened ?
When i seen query console i had one JSON document with following information
id,first_name,last_name,email,country,ip_address
5,Shawn,Grant,sgrant0#51.la,Liberia,37.194.161.124
What i am expecting ?
By default first column of csv is taken by creating json/xml document . Since i am sending 3 rows it should have latest information(i.e.3rd row) right.
By Assumption
Since i am sending all three rows at once in mlcp we cant say which one is going first to ML DB
Let me know whether my assumption is right or wrong ..
Thanks
MLCP wants to be as fast as possible. In the case of CSV files it will process the rows using many threads (and even shard the document if you pass the split option). With this, there is no guarantee that it will be processed in any particular order. You may be able to tune some of the settings in MLCP to use one thread and not shard the file to affect the results you want, but in that case, you are loosing some of the power of MLCP.
Second to that, an observaion: You are adding quite a bit of overhead of inserting and overwriting un-needed documents from how I interpret your problem statement. Why not sort and filter your initial CSV document to only one record per ID and save your computer from doing more work.
When converting CSV to AVRO I would like to output all the rejections to a file (let's say error.csv).
A rejection is usually caused by a wrong data type - e.g. when a "string" value appears in a "long" field.
I am trying to do it using incompatible output, however instead of saving the rows that failed to convert (2 in the example below), it saves the whole CSV file. Is it possible to filter out somehow only these records that failed to convert? (Does NiFi add some markers to these records etc?)
Both processors: RouteOnAttribute and RouteOnContent route the whole files. Does the "incompatible" leg of the flow somehow mark single records with something like "error" attribute that is available after splitting the file into rows? I cannot find this in any doc.
I recommend using a SplitText processor upstream of ConvertCSVToAvro, if you can, so you are only converting one record at a time. You will also have a clear context for what the errors attribute refers to on any flowfiles sent to the incompatible output.
Sending the entire failed file to the incompatible relationship appears to be a purposeful choice. I assume it may be necessary if the CSV file is not well formed, especially with respect to records being neatly contained on one line (or properly escaped). If your data violates this assumption, SplitText might make things worse by creating a fragmented set of failed lines.
I am using SuperCsv to process contact csv files from different sources.
Number of columns is the same and there is a header in file so I want to use the CsvBeanReader.
Has different sources have different columns and header titles, I am building dynamically the cellProcessors array based on the number of columns identified in the header.
I was struggling for a few hours with a SuperCsvException telling me there was a mismatch between the number of processors and some particular files which happen to all be csv exports from google mail contacts applications before I noticed these files had datarows ending with a useless comma where has the header row has not.
I solved the problem by catching the first SuperCsvException and adding the extra cell processor at this time but I was wondering whether this last comma was present in other types of csv files and whether superCsv had any option that could allow to keep the power of CsvBeanReader allowing for this last comma flexibility.
I would consider using the CsvListReader.Read() to get a list of string values. If you then by the length of the list know what to do, you can apply an array of processors using the Util.executeCellProcessors() which takes as input the list of strings and the cellprocessors.