Indexing JSON in GeoMesa - geomesa

Assume I want to perist JSON files in GeoMesa (on Accumulo). These JSON files have geometries and time. Can I use a XZ3 index? If yes then how?
NB: By JSON I am not refering to GeoJSON.

You can write a GeoMesa converter (a configuration file) to extract the values you want out of your JSON and into a GeoTools SimpleFeature, and ingest those into GeoMesa. Download the Accumulo distribution from github and look at the example under examples/ingest/json/.
Full documentation for converters is available here.
You also have the option of storing JSON strings as attributes, and querying them using JSON-Path. There is more information on that here.
The indices created for your data will depend on the attributes present. If you have a non-point geometry and a date defined, then you will automatically build an XZ3 index. More information on indices is available here and here

Related

How do we name the files that are streamed via firehose?

I'm building an architecture using boto3, and I hope to dump the data in JSON format from API to S3. What blocks in my way right now is first, firehose does NOT support JSON; my workaround right now is not compressing them but it's still different from a JSON file. But I still want to see a better choice to make the files more compatible.
And second, the file names can't be customized. All the data I collected will be eventually converted onto Athena for the query, so can boto3 do the naming?
Answering a couple of the questions you have. Firstly if you stream JSON into Firehose it will write JSON to S3. JSON is the file data structure and compression is the file type. Compressing JSON doesn't make it something else. You'll just need to decompress it before consuming it.
RE: file naming, you shouldn't care about that. Let the system name it whatever. If you define the Athena table with the location, you'll be able to query it. When new files are added, you'll be able to query them immediately.
Here is an AWS tutorial that walks you through this process. JSON stream to S3 with Athena query.

How to create a BQ-schema from XSD

I need some guidance on how to proceed with a problem.
Our integration team receives xml files which are converted to json and sent to pub/sub. We then ingest the json files (or are supposed to) into bigquery.
The problem is that the xml files do not include all possible objects or values all the time. So, I cant create a correct schema in bq to receive the json files. I got the xsd file with an extension file which gives me all possible objects but I don't know how to convert this to a correct bq schema.
Do you have any suggestions on how to create a bq schema from xsd files? I was thinking that if I create an xml file with dummy data (including all objects and more than one object when creating repeated objects) with help of the xsd maybe that xml file may be converted to json and then use the auto-schema detection of bq.
Any suggestions?
Thanks,
Cris
If you have the XSD schema files, you can convert these to a valid JSON schema. There are a few tools that can help you to accomplish this.
Keep in mind that the tools are for general purposes and not for the particular case of BigQuery, so you'll have to tune the result to get a valid JSON schema. For this check the components of a BigQuery schema, and for quick reference the sample provided in the documentation.

Convert JSON to CSV in nifi

I want to convert JSON files to CSV in nifi. We can achieve this in Python and other programming languages and have multiple articles on it. I have multiple JSON files and each file has different schema(one specific file will have one schema only). I can see there are templates to convert CSV to JSON and other conversions. But I didn't see any template to convert JSON data to CSV. I have gone through the article https://community.hortonworks.com/articles/64069/converting-a-large-json-file-into-csv.html ,however here we are hard coding the schema. As I have multiple files and each file has different schema, I can't hardcode the schema. Any suggestions please.
Conversion between formats is typically done through ConvertRecord by plugging in the appropriate record reader and record writer, in this case a JSON reader and CSV writer.
To make use of the record processors you need to defined Avro schemas for your data and put them in a schema registry, NiFi provides a local one.
There are lots of examples and posts out there about the record stuff, this slide deck shows an example of CSV to JSON, but would be easy to reverse the situation for your scenario:
https://www.slideshare.net/BryanBende/apache-nifi-record-processing
This post has some other info:
https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries

AWS Glue Crawler Classifies json file as UNKNOWN

I'm working on an ETL job that will ingest JSON files into a RDS staging table. The crawler I've configured classifies JSON files without issue as long as they are under 1MB in size. If I minify a file (instead of pretty print) it will classify the file without issue if the result is under 1MB.
I'm having trouble coming up with a workaround. I tried converting the JSON to BSON or GZIPing the JSON file but it is still classified as UNKNOWN.
Has anyone else run into this issue? Is there a better way to do this?
I have two json files which are 42mb and 16mb, partitioned on S3 as path:
s3://bucket/stg/year/month/_0.json
s3://bucket/stg/year/month/_1.json
I had the same problem as you, crawler classification as UNKNOWN.
I were able to solved it:
You must create custom classifier with jsonPath as "$[*]" then create new crawler with the classifier.
Run your new crawler with the data on S3 and proper schema will be created.
DO NOT update your current crawler with the classifier as it won't apply the change, I don't know why, maybe because of classifier versioning AWS mentioned in their documents. Create new crawler make them work
As mentioned in
https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html#custom-classifier-json
When you run a crawler using the built-in JSON classifier, the entire file is used to define the schema. Because you don’t specify a JSON path, the crawler treats the data as one object, that is, just an array.
That is something which Dung also pointed out in his answer.
Please also note that file encoding can lead to JSON being classified as UNKNOWN. Please try and re-encode the file as UTF-8.

Can the avro json be extened with additional information?

The avro format is used in hadoop as a header to describe the contents of the binary file that follows. My question is whether the json part of the avro file can be extended to include information that is not necessary for hadoop? The typical use case would be to attach meta-data like the originator of the file and a date to the file without it needing to be data and part of the file.
Yes. Avro files can be annotated with additional information in the json schema or with specific additional name:value pairs. Additionally, we have been able to read these avro files with Pentaho and Google Big Query. One caveat is that the schema and name:value pairs are discarded during the import process. So if you feel you will need them later, you should extract and store local copies of them.