I have to develop a mapreduce program that is needed to perform a join on two different data sets.
One of them is a csv file and other is an avro file.
I am using MultipleInputs to process both sources. However to process both dataset in one single reducer, I am converting the Avro Data to Text by using
new Text(key.datum.toString())
My challenge is to convert the Json String generated above to Avro rcord back in reducer as the final output needs to be in avro format.
Is there a particular function or class that can be used to do this?
If yes, can you please quote an example as well?
Related
I have an ADF pipeline exporting from xml dataset (ADLS) to json dataset (ADLS) with a copy Data activity. Due to the complex xml structure, I need to parse the nested xml to nested json then use T-SQL to parse the nested json into Synapse table.
However, the output nested has double backslash (It seems like escape characters) at nodes which have comma in it. You can check a sample of xml input and json output below:
xml input
<Address2>test, test</Address2>
json output
"Address2":"test\\, test"
How can I remove the double backslash in the output json with copy data activity in Azure Data Factory ?
Unfortunately there is no such provision in CopyData Activity.
However, I just tried with just the lines you provided as sample source and sink with CopyData Activity and it just copies as is. I don't see any \\. Perhaps you could share the exact pipeline you have, with details of the nested XML, JSON and T-SQL that you are using.
Repro: (with all default settings and properties)
suppose we are developing an application that pulls Avro records from a source
stream (e.g. Kafka/Kinesis/etc), parses them into JSON, then further processes that
JSON with additional transformations. Further assume these records can have a
varying schema (which we can look up and fetch from a registry).
We would like to use Spark's built in from_avro function, But it is pretty clear that
Spark from_avro wants you to hard code a >Fixed< schema into your code. It doesn't seem
to allow the schema to vary row by incoming row.
That sort of makes sense if you are parsing the Avro to Internal row format.. One would need
a consistent structure for the dataframe. But what if we wanted something like
from_avro which grabbed the bytes from some column in the row and also grabbed the string
representation of the Avro schema from some other column in the row, and then parsed that Avro
into a JSON string.
Does such built-in method exist? Or is such functionality available in a 3rd party library ?
Thanks !
Snowflake supports multiple file types via creation FILE_FORMAT (avro, json, csv etc).
Now I have tested SELECTing from snowflake stage (s3) both:
*.avro files (generated from nifi processor batching 10k source oracle table).
*.json files (json per line).
And when Select $1 from #myStg, snowflake expands as many rows as records on avro or json files (cool), but.. the $1 variant is both json format and now i wonder if whatever snowflake file_format we use do records always arrive as json on the variant $1 ?
I haven't tested csv or others snowflake file_formats.
Or i wonder if i get json from the avros (from oracle table) because maybe NiFi processor creates avro files (with internally uses json format).
Maybe im making some confusion here.. i know avro files contain both:
avro schema - language similar to json key/value.
compressed data (binary).
Thanks,
Emanuel O.
I tried with CSV, When Its came to CSV its parsing each records in the file like below
So when its came to JSON it will treat one complete JSON as one records so its displaying in JSON format.
I have case: in flow content is always json format and the data inside json always change (both kyes and values). Is this possible to convert this flow content to csv?
Please note that, keys in json are always change.
Many thanks,
To achieve this usecase we need to generate avro schema dynamically for each json record first then convert to AVRO finally convert AVRO to CSV
Flow:
1.SplitJson //split the array of json records into individual records
2.InferAvroSchema //infer the avro schema based on Json record and store in attribute
3.ConvertJSONToAvro //convert each json record into Avro data file
4.ConvertRecord //read the avro data file dynamically and convert into CSV format
5.MergeContent (or) MergeRecord processor //to merge the splitted flowfiles into one flowfile based on defragment strategy.
Save this xml and upload to your nifi instance and change as per your requirements.
When using crossfilter (for example for dc.js), do I always need to transform my data to a flat JSON for input?
Flat JSON data when reading from AJAX requests tend to be a lot larger than it needs to be (in comparison to for example nested JSON, value to array or CSV data).
Is there an API available which can read in other types than flat json? Are there plans to add those?
I would like to avoid to let the client transform the data before using it.