Loading json format data into google bigquery performance issue - json

I have loaded JSON format data structure into Google bigquery "Nested" table (I have 2 levels of nested "repeated" records ) the average length of JSON line is 5000 characters.
The load time is much slower than loading flat file( same size in total ) into Google bigquery .
What are the "rule of thumbs" while loading json into nested records?
How can i improve my performance ?
In terms query of performance, is it much slower also to retreive date from nested table , than flat table ?
Please Help , I have found it difficult to reach experienced "DBA" in that area
Regards

I don't know of any reason json imports should be slower, but we haven't benchmarked them.
If perf is slow, you may be better off breaking the import into chunks and passing multiple source files into the load job.
It shouldn't be any slower retrieving the data from the nested table (and might be faster). The columnar storage format should store your nested data more efficiently than a corresponding flat table.

Related

Storing large JSON data in Postgres is infeasible, so what are the alternatives?

I have large JSON data, greater than 2kB, in each record of my table and currently, these are being stored in JSONB field.
My tech stack is Django and Postgres.
I don't perform any updates/modifications on this json data but i do need to read it, frequently and fast. However, due to the JSON data being larger than 2kB, Postgres splits it into chunks and puts it into the TOAST table, and hence the read process has become very slow.
So what are the alternatives? Should i use another database like MongoDB to store these large JSON data fields?
Note: I don't want to pull the keys out from this JSON and turn them into columns. This data comes from an API.
It is hard to answer specifically without knowing the details of your situation, but here are some things you may try:
Use Postgres 12 (stored) generated columns to maintain the fields or smaller JSON blobs that are commonly needed. This adds storage overhead, but frees you from having to maintain this duplication yourself.
Create indexes for any JSON fields you are querying (Postgresql allows you to create indexes for JSON expressions).
Use a composite index, where the first field in the index the field you are querying on, and the second field (/json expression) is that value you wish to retrieve. In this case Postgresql should retrieve the value from the index.
Similar to 1, create a materialised view which extracts the fields you need and allows you to query them quickly. You can add indexes to the materialised view too. This may be a good solution as materialised views can be slow to update, but in your case your data doesn't update anyway.
Investigate why the toast tables are being slow. I'm not sure what performance you are seeing, but if you really do need to pull back a lot of data then you are going to need fast data access whatever database you choose to go with.
Your mileage may vary with all of the above suggestions, especially as each will depend on your particular use case. (see the questions in my comment)
However, the overall idea is to use the tools that Postgresql provides to make your data quickly accessible. Yes this may involve pulling the data out of its original JSON blob, but this doesn't need to be done manually. Postgresql provides some great tools for this.
If you just need to store and read fully this json object without using the json structure in your WHERE query, what about simply storing this data as binary in a bytea column? https://www.postgresql.org/docs/current/datatype-binary.html

Efficient way to load a large amount of data to a Vanilla JS frontend with a Flask-MySQL-SQLAlchemy Backend?

I have a flask-based web-server setup that inputs a lot of unique data points. While insertion into the DB is done asynchronously, Getting data out of it is the harder part. For each request, I am looking at an average of 250,000 rows of raw data that needs to be displayed in a graph format using plotly.js. Executing the raw query on the MySQL command line takes about 10 seconds to return the data.
However, since I am using SQLAlchemy as my ORM, there seems to be a significant overhead. The extracted data then needs to be dumped into a JSON format to be sent to the front-end to display.
I understand that this situation has a lot of variables that can be changed but I am asking this question after about a week of trying to find solutions to this problem. Is the solution to throw hardware at it?
TLDR: Large amount of data in Backend(Flask, SQLAlchemy, MySQL); Need to display it on the frontend after querying 250,000 records of data and converting it to JSON; Any good solutions?
EDIT: I forgot to mention the size of the data. The JSON object that is sent is about 22.6MB for 250,000 rows of SQL. The table this problem deals with has about 15 columns with Floats, Timestamps, VarChars and Integers. I'm willing to give any more information that can be given.
Turns out that launching an Async process that does this in the background and hope that it works is a perfectly good solution.

Store input data for Apache Hive as CSV or JSON?

I am currently building a system that uses the Hadoop ecosystem on a cluster to store our data and make it easily accessible and queryable. I have a few questions concerning how to store the input data that we want to make accessible with Hive.
Our data looks like this:
Sensor data of heterogeneous Sensors
Data is saved in time-based order.
Data size: Several hundred GBs to TBs, each file is ~20-40 GB in size
Current data format: Proprietary data format
Binary files storing data of C++ Structs
Each struct has a timestamp, and several different fields (some are nested too) for each Sensor message
I have written a data parser that is able to convert any of our data files to a new output format. We have thought about storing our data in CSV or JSON data format on the HDFS and then use the data with Hive.
My questions:
Is CSV a better fit than JSON when using Hive?
We are concerned about worse performance when choosing JSON. The file size overhead of JSON should also be considered. Are there other benefits when using CSV over JSON?
If we chose JSON, should we not have nested objects in our files?
Because we have yet to build the structure of the JSON files, we can still choose to not have these nested objects and rather have all fields in the root object and append the name of the nested object to the field names. Example: Field nestedObjectName.nestedField1 instead of having real nested objects. I am aware of the fact that Hive does not like dots in its field names.
If we chose JSON, which JSON SerDe should we use?
I've read that rcongiu's Hive-JSON SerDe might be the best one. The blog post is quite old (07/2013) and things might have changed.
If we chose CSV, should we create one big Table or several "smaller" Tables that allow different views on the data?
What I mean is that we have our C++ structs. These structs have about 200 unique field names at the moment. I don't think that having a big "compound" table is a good idea. Should we rather split our binary data files into several CSV files that each corresponds to one logical data group (i.e. all data of one sensor type - rain intensity sensor for example)? I could think of a possible downside: This might make querying "all" (as in all sensor type tables) data in a time range more complicated.
Some more information about our Cluster setup:
We run a Hortonworks HDP Cluster with HDFS 2.7.3, Hive 1.2.1000, Spark 2.0.0, Zeppelin Notebook 0.6.0. We plan to keep our Stack updated.
If somebody thinks that there is a better data format than CSV and JSON, please also mention your idea. At best, the data format should be somewhat storage-effective, matured and have APIs in several programming languages (at least C++, Java, Python). We are open to input as we can still decide which data format to use. Thanks!

is JSON a good solution for data transfer between client and server?

I am trying to understand why JSON is widely used for data transfer between client and server. I understand that it offers simple design which is easy to understand. However, on the contrary;
A JSON string includes repeated data, e.g, incase of a table, columns names (keys) are repeated in each object . Would it not be wise to send columns as first object and rest of the object should be the data (without columns/keys information) from the table.
Once we have a JSON object, the searching based on keys is expensive (in time) compared to indexes. Imagine a table with 20-30 column, doing this searching for each key for each object would cost a lot more time compare to directly using indexes.
There may be many more drawbacks and advantages, add here if you know one.
I think if you want data transfer then you want a table based format. The JSON format is not a table based format like standard databases or Excel. This can complicate analyzing data if there is a problem because someone will usually use excel for that (sorting, filtering, formulas). Also building test files will be more difficult because you can't simply use excel to export to JSON.
But, If you wanted to use JSON for data transfer you could basically build a JSON version of a CSV file. You would only use arrays.
Columns: ["First_Name", "Last_Name"]
Rows: [
["Joe", "Master"],
["Alice", "Gooberg"]
.... etc
]
Seems messy to me though.
If you wanted to use objects then you will have to embed Column names for every bit of data, which in my opinion indicates a wrong approach.

Performance issue with JSON input

I am loading mysql table from a mongodb source through kettle.
Mongodb table has more than 4 million records and when I run the kettle job it takes 17 hours to finish the first time load.
Even for incremental load it takes more than a hour.I tried with increasing commit size and also giving more memory to the job, but still performance is not improving. I think JSON input step takes a very long time to parse the data and hence its very slow.
I have these steps in my transformation
Mongodb input step
Json Input
Strings cut
If field value is null
Concat fields
Select values
Table output.
Same 4 million records when extracted from postgre was way more fast than mongodb.
Is there a way I can improve the performance?
Please help me.
Thanks,
Deepthi
Run multiple copies of the step. It sounds like you have mongo input then a json input step to parse the json results right? So use 4 or 8 copies of the json input step ( or more depending on cpu's) and it'll speed up.
Alternatively do you really need to parse the full json, maybe you can extract the data via a regex or something.