Load thousands of JSON files into BigQuery - json

I have around 10,000 JSON files, and I want to load them into BigQuery. As BQ only accept ndJSON, I spent hours searching for a solution, but I can't find a easy and clean way to convert all the files to ndJSON.
I tested cat test.json | jq -c '.[]' > testNDJSON.json and it works well to convert a file, but how to convert all the files at once?
Right now, my ~10k files are on a GCP bucket, and weight ~5go.
Thanks!

Did you come across Dataprep in your search? Dataprep can read data from Cloud Storage, help you format the data and insert data to BigQuery for you.
Alternatively, you can use Cloud DataFlow I/O transform to deal with this automatically. See the link below for reference.
Hope this helps.

my suggestion is to use a Google-provided Cloud Dataflow template to transfer your files to BQ, you can use the one called Cloud Storage Text to BigQuery
, it's important to consider the UDF function to transform your JSON files.

Related

How do we name the files that are streamed via firehose?

I'm building an architecture using boto3, and I hope to dump the data in JSON format from API to S3. What blocks in my way right now is first, firehose does NOT support JSON; my workaround right now is not compressing them but it's still different from a JSON file. But I still want to see a better choice to make the files more compatible.
And second, the file names can't be customized. All the data I collected will be eventually converted onto Athena for the query, so can boto3 do the naming?
Answering a couple of the questions you have. Firstly if you stream JSON into Firehose it will write JSON to S3. JSON is the file data structure and compression is the file type. Compressing JSON doesn't make it something else. You'll just need to decompress it before consuming it.
RE: file naming, you shouldn't care about that. Let the system name it whatever. If you define the Athena table with the location, you'll be able to query it. When new files are added, you'll be able to query them immediately.
Here is an AWS tutorial that walks you through this process. JSON stream to S3 with Athena query.

Dump dictionary as json file to Google Cloud storage from Jupyter Notebook on Data Proc

I am using spark on Google dataproc cluster. I have created a dictionary in Jupyter notebook which I want to dump in my GCS bucket. However, it seems the usual way of dumping to json using fopen() does not work in case of gcp. So, how can I write my dictionary as .json file to GCS. Or, is there any other way to get the dictionary?
It's funny, I could write spark dataframe to gcs without any hassle, but apparently, I can't load JSON on gcs unless I have it on my local system!
Please help!
Thank you.
The file in GCS is not in your local file system so that's why you cannot call "fopen" on it. You can either save to GCS by directly using a GCS client (for example, this tutorial), or treat the GCS location as an HDFS destination (for example, saveAsTextFile("gs://...")

How to import CSV files into Firebase

I see we can import json files into firebase.
What I would like to know is if there is a way to import CSV files (I have files that could have about 50K or even more records with about 10 columns).
Does it even make sense to have such files in firebase ?
I can't answer if it make sense to have such files in Firebase, you should answer that.
I also had to upload CSV files to Firebase and I finally transformed my CSV into JSON and used firebase-import to add my Json into Firebase.
there's a lot of CSV to JSON converters (even online ones). You can pick the one you like the most (I personnaly used node-csvtojson).
I've uploaded many files (tab separated files) (40MB each) into firebase.
Here are the steps:
I wrote a Java code to translate TSV into JSON files.
I used firebase-import to upload them. To install just type in cmd:
npm install firebase-import
One trick I used on top of all the one already mentioned is to synchronize a google spreadsheet with firebase.
You create a script that upload directly to firebase db base on row / columns. It worked quite well and can be more visual for fine tuning the raw data compared to csv/json format directly.
Ref: https://www.sohamkamani.com/blog/2017/03/09/sync-data-between-google-sheets-and-firebase/
Here is the fastest way to Import your CSV to Firestore:
Create an account in Jet Admin
Connect Firebase as a DataSource
Import CSV to Firestore
Ref:
https://blog.jetadmin.io/how-to-import-csv-to-firestore-database-without-code/

Big Query table to be extracted as JSON in Local machine

I have an idea on how to extract Table data to Cloud storage using Bq extract command but I would like rather like to know, if there are any options to extract a Big Query table as NewLine Delimited JSON to Local Machine?
I could extract Table data to GCS via CLI and also download JSON data from WEB UI but I am looking for solution using BQ CLI to download table data as JSON in Local machine?. I am wondering is that even possible?
You need to use Google Cloud Storage for your export job. Exporting data from BigQuery is explained here, check also the variants for different path syntaxes.
Then you can download the files from GCS to your local storage.
Gsutil tool can help you further to download the file from GCS to local machine.
You first need to export to GCS, then to transfer to local machine.
If you use the BQ Cli tool, then you can set output format to JSON, and you can redirect to a file. This way you can achieve some export locally, but it has certain other limits.
this exports the first 1000 line as JSON
bq --format=prettyjson query --n=1000 "SELECT * from publicdata:samples.shakespeare" > export.json
It's possible to extract data without using GCS, directly to your local machine, using BQ CLI.
Please see my other answer for details: BigQuery Table Data Export

Export from Couchbase to CSV file

I have a Couchbase Cluster with only one node (let's call it localhost) and I need to export all the data from a very big bucket (let's call it XXX) into a CSV file.
Now this seems to be a pretty easy task but I can't find the way to make it work.
According to the (really bad) documentation on the cbtransfer toold from Couchbase http://docs.couchbase.com/admin/admin/CLI/cbtransfer_tool.html they say this is possible but they don't explain it clearly. They just add a flag if you want the transfer to occur in csv format (?) but it is not working. Maybe someone who already did this can give me a hand?
Using the documentation I've been able to make an approach to the result I want to obtain (a clean CSV file with all the documents in the XXX bucket) using this command:
/opt/couchbase/bin/cbtransfer http://localhost:8091 /path/to/export/output.csv -b XXX
But what I get is that /path/to/export/output.csv is actually a folder with a lot of folders inside and it is storing some kind of json metadata that can be used to restore the XXX bucket in another instance of Couchbase.
Has anyone been able to export data from a Couchbase bucket (Json documents) into a CSV file?
From looking at the documentation, you have to put a slightly different syntax to export to a CSV. http://docs.couchbase.com/admin/admin/CLI/cbtransfer_tool.html
It needs to look like so:
cbtransfer http://[localhost]:8091 csv:./data.csv -b default -u Administrator -p password
Notice the "csv:" before the name of the csv file.
I tested this and it does export a CSV. Just be forwarned that you need a relatively flat document structure for this to work really well, as JSON can represent far more complex data structures than CSV obviously, e.g. arrays, sub-documents, etc. cbtransfer will not unravel those. For example, if there is a subdocument, cbtransfer will represent it as a JSON doc in the line of each CSV.
So depending on what your document structure is, exporting to CSV is not an ideal format. It is a step backwards.