How to create a BQ-schema from XSD - json

I need some guidance on how to proceed with a problem.
Our integration team receives xml files which are converted to json and sent to pub/sub. We then ingest the json files (or are supposed to) into bigquery.
The problem is that the xml files do not include all possible objects or values all the time. So, I cant create a correct schema in bq to receive the json files. I got the xsd file with an extension file which gives me all possible objects but I don't know how to convert this to a correct bq schema.
Do you have any suggestions on how to create a bq schema from xsd files? I was thinking that if I create an xml file with dummy data (including all objects and more than one object when creating repeated objects) with help of the xsd maybe that xml file may be converted to json and then use the auto-schema detection of bq.
Any suggestions?
Thanks,
Cris

If you have the XSD schema files, you can convert these to a valid JSON schema. There are a few tools that can help you to accomplish this.
Keep in mind that the tools are for general purposes and not for the particular case of BigQuery, so you'll have to tune the result to get a valid JSON schema. For this check the components of a BigQuery schema, and for quick reference the sample provided in the documentation.

Related

Convert JSON to CSV in nifi

I want to convert JSON files to CSV in nifi. We can achieve this in Python and other programming languages and have multiple articles on it. I have multiple JSON files and each file has different schema(one specific file will have one schema only). I can see there are templates to convert CSV to JSON and other conversions. But I didn't see any template to convert JSON data to CSV. I have gone through the article https://community.hortonworks.com/articles/64069/converting-a-large-json-file-into-csv.html ,however here we are hard coding the schema. As I have multiple files and each file has different schema, I can't hardcode the schema. Any suggestions please.
Conversion between formats is typically done through ConvertRecord by plugging in the appropriate record reader and record writer, in this case a JSON reader and CSV writer.
To make use of the record processors you need to defined Avro schemas for your data and put them in a schema registry, NiFi provides a local one.
There are lots of examples and posts out there about the record stuff, this slide deck shows an example of CSV to JSON, but would be easy to reverse the situation for your scenario:
https://www.slideshare.net/BryanBende/apache-nifi-record-processing
This post has some other info:
https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries

AWS Glue Crawler Classifies json file as UNKNOWN

I'm working on an ETL job that will ingest JSON files into a RDS staging table. The crawler I've configured classifies JSON files without issue as long as they are under 1MB in size. If I minify a file (instead of pretty print) it will classify the file without issue if the result is under 1MB.
I'm having trouble coming up with a workaround. I tried converting the JSON to BSON or GZIPing the JSON file but it is still classified as UNKNOWN.
Has anyone else run into this issue? Is there a better way to do this?
I have two json files which are 42mb and 16mb, partitioned on S3 as path:
s3://bucket/stg/year/month/_0.json
s3://bucket/stg/year/month/_1.json
I had the same problem as you, crawler classification as UNKNOWN.
I were able to solved it:
You must create custom classifier with jsonPath as "$[*]" then create new crawler with the classifier.
Run your new crawler with the data on S3 and proper schema will be created.
DO NOT update your current crawler with the classifier as it won't apply the change, I don't know why, maybe because of classifier versioning AWS mentioned in their documents. Create new crawler make them work
As mentioned in
https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html#custom-classifier-json
When you run a crawler using the built-in JSON classifier, the entire file is used to define the schema. Because you don’t specify a JSON path, the crawler treats the data as one object, that is, just an array.
That is something which Dung also pointed out in his answer.
Please also note that file encoding can lead to JSON being classified as UNKNOWN. Please try and re-encode the file as UTF-8.

Can the avro json be extened with additional information?

The avro format is used in hadoop as a header to describe the contents of the binary file that follows. My question is whether the json part of the avro file can be extended to include information that is not necessary for hadoop? The typical use case would be to attach meta-data like the originator of the file and a date to the file without it needing to be data and part of the file.
Yes. Avro files can be annotated with additional information in the json schema or with specific additional name:value pairs. Additionally, we have been able to read these avro files with Pentaho and Google Big Query. One caveat is that the schema and name:value pairs are discarded during the import process. So if you feel you will need them later, you should extract and store local copies of them.

Converting JSON txt file to XML

We're constructing a network of data and part of that includes modifying a search query from a public website to pull all of the data we want. That data, however, when pulled is stored into a JSON txt file.
Ultimately we want this data to be stored in an Access Database so the next step, we thought, was to convert it to XML so we can have an Excel sheet to import. We found a formatting tool (http:jsonformatter.org). When running the tool we received the following error:
“Microsoft Access has encountered an error processing the XML schema in file ‘Data.xml’,
A document must contain exactly one root element”
I've no idea what this entails or where to start debugging. Are there alternatives we might consider?
The error says that there is more than one root element. Have you validated the XML generated? I looked at the website. I tried to ask via comment but I don't have enough rep but you should post some of your json and xml.
If I am reading your issue correctly, you are converting json to xml format and then to excel?
I would suggest writing some code to consume the json and export the xml files to import.

JSON Schema Builder Program

Is there an existing program that helps forming a JSON Schema?
There's this great tool to get you started on generating a JSON Schema: http://www.jsonschema.net/ . Just feed a sample JSON files and out comes out a JSON Schema that you can then tweak.
Check this demo one. It is at an early stage but you can already edit jSON documents with a schema constraint as well as design a Schema itself.
And here is an official thread about Schema-based JSON editor:
You can use Orderly. They have DSL for schema description and you can try it online.
You could try this one (XML ValidatorBuddy) which is actually an XML editor but it also supports JSON and especially JSON Schema editing. The editor is a Windows desktop application and can do auto-completion and syntax-coloring for JSON schema files. You can also validate your JSON files against JSON schema.
You can do this in Liquid XML Studio, but its a commercial product
This isn't exactly something that'll help you with a 'schema', per se, but it's a visual way to navigate and manage JSON data.
http://braincast.nl/samples/jsoneditor/