Apache Spark-Get All Field Names From Nested Arbitrary JSON Files - json

I have run into a somewhat perplexing issue that has plagued me for several months now. I am trying to create an Avro Schema (schema-enforced format for serializing arbitrary data, basically, as I understand it) to convert some complex JSON files (arbitrary and nested) eventually to Parquet in a pipeline.
I am wondering if there is a way to get the superset of field names I need for this use case staying in Apache Spark instead of Hadoop MR in a reasonable fashion?
I think Apache Arrow under development might be able to help avoid this by treating JSON as a first class citizen eventually, but it is still aways off yet.
Any guidance would be sincerely appreciated!

Related

Is HDF5 an Appropriate Technology to Store JSON Data?

I've inherited some code which makes calls to a web API, and get's a deeply nested (up to eight levels) response.
I've written some code to flatten the structure so that it can be written to .csv files, and a SQL database, for people to consume more easily.
What I'd really like to do though is keep a version of the original response, so that there's a reference of the original structure if I ever want/need it.
I understand that HDF5 is primarily meant to store numerical data. Is there any reason not to use it to dump JSON blobs? It seems a lot easier than setting up a NoSQL database.
It should be fine. It sounds like you'd be storing each JSON response as a HDF5 variable length string. Which is fine, it's just a string to the library.
Do you plan to store each response as a separate dataset? That may be inefficient if you are talking about >1000's of responses.
Alternatively, you can create a 1-d extensible dataset, and just append to it with each response.
Decided it was easier to set up a Mongo database.

storing complex perl data structures in MySQL

I work on a large perl website that currently stores all the configurations in a perl module. I have the need to move these settings into MySQL. The problem is the settings are defined in lots of variables and most of them are complex structures (like hash of hashes and array of hashes).
My first idea was to use either XML, YAML, or Storable perl module to easily write and read the variables from a simple file, but my boss doesn't want either of these solutions. He wants it to be stored in MySQL, so other solutions are not an option.
My question is, does anybody know about any CPAN module(s) that will help me to do that task; what I basically need is a way to map all the perl complex perl structures I have to MySQL tables.
Any suggestion would be really appreciated. Thanks!
Option 1: Store the data in serialized form (Data::Dumper, Storable, JSON, etc...) in MySQL's TEXT/MEDIUMTEXT/LONGTEXT type field (65KB/16MB/4GB max sizes respectively)
Option 2: Use DBIx ORM (Object-to-Relational-Mapping), which is the way to automatically map Perl data to DB tables (similar to Java's Hybernate). You'll need to convert your data structures to objects as far as I'm aware, though there may be DBIx module that can deal with non-blessed data structures.
Frankly, unless you need to manipulate the config data in detail within MySQL piece by piece, option #1 is dramatically simpler. However, if your boss's goal is to be able to either query details of configuration, or manipulate its individual elements one by one, you will have to go with #2.
Why you don't want to use Storable qw(freeze thaw) with MySQL?

Dynamic JSON file vs API

I am designing a system with 30,000 objects or so and can't decide between the two: either have a JSON file pre computed for each one and get data by pointing to URL of the file (I think Twitter does something similar) or have a PHP/Perl/whatever else script that will produce JSON object on the fly when requested, from let's say database, and send it back. Is one more suited for than another? I guess if it takes a long time to generate the JSON data it is better to have already done JSON files. What if generating is as quick as accessing a database? Although I suppose one has a dedicated table in the database specifically for that. Data doesn't change very often so updating is not a constant thing. In that respect the data is static for all intense and purposes.
Anyways, any thought would be much appreciated!
Alex
You might want to try MongoDB which retrieves the objects as JSON and is highly scalable and easy to setup.

Importing JSON strings via CSV into MySQL. Good or bad?

I'm sitting on a CSV import into a database trying to think of ways to add more data without having to change the API (add new fields).
Since working with JSON quite a bit on the client, I'm thinking of storing data into MYSQL as JSON string. So if I have a field
image_filenames
Where I'm currently storing data like this:
img123.jpg
WOuld it make sense to store multiple images in a JSON array like so:
{"img_base":"img123.jpg", "img_alt_1":"img123-1.jpg", "img_alt_2":"img123-2" }
I can deserialize server side, so it woudn't be much of a problem to grab the image I need from the JSON array, while it does not bloat up the API.
Question:
I can't find anything at all on importing CSV with JSON strings. So, what's good and bad in doing so? Are there security concerns (SQL-injections)?
Thanks!
Transferred from comments to here:
If you have a data model scheme that changes or is inconsistent, then the relational database storage isn't the best choice. Sure, you can serialize it and store as binary string, but why? IMO, and I'm not a fan of NoSQLs, but MongoDB looks like something you might make use of. Its document scheme is JSON-based, it'd be familiar to you if you work with JSON-based code on a daily basis. I'd use that to store the data rather than relational db.
Non-relational ones do less work, so they work faster in some scenarios. They also don't have a scheme, so there's no alter table statement as such, therefore you can add "columns" as much as you like. If you don't have relations and need something to store data in JSON format, but that it can be searchable - MongoDB would be great.

Multiple JSON files, parse, and load into tables

I'm a real beginner when it comes time for this, so I apologize in advance.
The long and short of what I am looking for is a fairly simple concept - I want to pull JSON data off a server, parse it, and load it into excel, access, or some other type of tables. Basically, I want to be able to store the data so I can filter, sort, and query it.
To make matters a little more complicated, the server will only return truncated results with each JSON, so it will be necessary to make multiple requests to the server.
Are there tools out there or code available which will help me do what I am looking for? I am completely lost, and I have no idea where to start.
(please be gentle)
I'm glad seeing this question b/c I'm doing very similar things! And based on what I'd gone through, it has lot to do with how those tables are designed or even linked together at first, and then the mapping between these tables and different JSON objects at different depth or position in the original JSON file. After the mapping rules are made clear, the code can be done by merely hard-coding the mapping(I mean like: if you got JSON object after a certain parent of it, then you save the data into certain table(s)) if you're using some high level JSON paring library.
OK as i have to dash home from the office now:
Assuming that you are going to use Excel to Parse the data you are going to need:
1.Some Json Parser JSON Parser for VBA
2.Some code to download the JSON
3.A loop of VBA code that loops through each file and parses it into a sheet.
Is this ok for a starter? If you are struggling let me know and I will try and knock something up a little better over the weekend.