Facing issue with Mongoexport json file "_id" column - json

I am exporting mongo collection to json format and then loading that data to bigquery table using bq load command.
mongoexport --uri mongo_uri --collection coll_1 --type json --fields id,createdAt,updatedAt --out data1.csv
The json row looks like below:
{"_id":{"$oid":"6234234345345234234sdfsf"},"id":1,"createdAt":"2021-05-11 04:15:15","updatedAt":null}
but when i run bq load command in bigquery it gives below error:
Invalid field name "$oid". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 300 characters long.
I think if mongoexport json contains {"_id": ObjectId(6234234345345234234sdfsf)} , my issue will be solved.
Is there any way to export json like this?
Or any other way to achive this?
Note: i can't use csv format because mongo documents contain commas.

By default, _id holds an ObjectId value, so it's better to store data in {"_id": ObjectId(6234234345345234234sdfsf)} format instead of storing it in "_id":{"$oid":"6234234345345234234sdfsf"}.
As you mentioned if json contains {"_id": ObjectId(6234234345345234234sdfsf)} your problem will be solved.

Replace $oid with oid. I'm using Python, so the code below worked:
with fileinput.FileInput("mongoexport_json.txt", inplace=True, encoding="utf8") as file:
for line in file:
print(line.replace('"$oid":', '"oid":'), end='')

Related

How to remove a specific data from the beginning of the json/avro schema file and the last bracket from the end of the file?

I am extracting the schema of a table from an oracle DB using Apache Nifi which I need to use to create a table in BigQuery. The extract SQL processor in NiFi is giving me a schema file which I am saving in my home directory. Now to use this schema file in BigQuery, I need to remove a certain part of the schema file from the beginning and end. How do I do this in unix using sed/awk?
Here is the content of the output file:
Obj^A^D^Vavro.schema<88>^L{"type":"record","name":"NiFi_ExecuteSQL_Record","namespace":"any.data","fields":[{"name":"FEED_SEQUENCE","type":["null",{"type":"bytes","logicalType":"decimal","precision":38,"scale":0}]},{"name":"FEED_UNIQUE_NAME","type":["null","string"]},{"name":"COUNTRY_CODE","type":["null","string"]},{"name":"EXTRACTION_TYPE","type":["null","string"]},{"name":"PROJECT_SEQUENCE","type":["null",{"type":"bytes","logicalType":"decimal","precision":38,"scale":0}]},{"name":"CREATED_BY","type":["null","string"]},{"name":"CREATED_DATE","type":["null",{"type":"long","logicalType":"timestamp-millis"}]},{"name":"UPDATED_BY","type":["null","string"]},{"name":"UPDATED_DATE","type":["null",{"type":"long","logicalType":"timestamp-millis"}]},{"name":"FEED_DESC","type":["null","string"]}]}^Tavro.codec^Hnull^#àÂ<87>)[ù<8b><97><90>"õ^S<98>[<98>±
I want to remove the Initial part Obj^A^D^Vavro.schema<88>^L{"type":"record","name":"NiFi_ExecuteSQL_Record","namespace":"any.data","fields":
and the ending part }^Tavro.codec^Hnull^#àÂ<87>)[ù<8b><97><90>"õ^S<98>[<98>± from the above.
Considering that You want to remove everything outside first [ and last ]:
sed 's/^[^[]*//;s/[^]]*$//'
Test:
$ cat out.file
Obj^A^D^Vavro.schema<88>^L{"type":"record","name":"NiFi_ExecuteSQL_Record","namespace":"any.data","fields":[{"name":"FEED_SEQUENCE","type":["null",{"type":"bytes","logicalType":"decimal","precision":38,"scale":0}]},{"name":"FEED_UNIQUE_NAME","type":["null","string"]},{"name":"COUNTRY_CODE","type":["null","string"]},{"name":"EXTRACTION_TYPE","type":["null","string"]},{"name":"PROJECT_SEQUENCE","type":["null",{"type":"bytes","logicalType":"decimal","precision":38,"scale":0}]},{"name":"CREATED_BY","type":["null","string"]},{"name":"CREATED_DATE","type":["null",{"type":"long","logicalType":"timestamp-millis"}]},{"name":"UPDATED_BY","type":["null","string"]},{"name":"UPDATED_DATE","type":["null",{"type":"long","logicalType":"timestamp-millis"}]},{"name":"FEED_DESC","type":["null","string"]}]}^Tavro.codec^Hnull^#àÂ<87>)[ù<8b><97><90>"õ^S<98>[<98>±
$ sed 's/^[^[]*//;s/[^]]*$//' out.file
[{"name":"FEED_SEQUENCE","type":["null",{"type":"bytes","logicalType":"decimal","precision":38,"scale":0}]},{"name":"FEED_UNIQUE_NAME","type":["null","string"]},{"name":"COUNTRY_CODE","type":["null","string"]},{"name":"EXTRACTION_TYPE","type":["null","string"]},{"name":"PROJECT_SEQUENCE","type":["null",{"type":"bytes","logicalType":"decimal","precision":38,"scale":0}]},{"name":"CREATED_BY","type":["null","string"]},{"name":"CREATED_DATE","type":["null",{"type":"long","logicalType":"timestamp-millis"}]},{"name":"UPDATED_BY","type":["null","string"]},{"name":"UPDATED_DATE","type":["null",{"type":"long","logicalType":"timestamp-millis"}]},{"name":"FEED_DESC","type":["null","string"]}]
You can use ExtractAvroMetadata processor to extract only the avro.schema from the avro flowfile.
In the processor for Metadata Keys property specify value as avro.schema, then processor extracts avro metadata and keep as flowfile attribute.
Use the attribute value(${avro.schema} in ReplaceText processor to overwrite the content of flowfile and create the table.
the data in 'd' file, by gnu sed,
sed -E 's/^[^\[]+(\[\{.+\})[^\}]+/\1/' d
consider use regex in Perl if you'd work on Json string manipulation

Oracle SQLcl: Spool to json, only include content in items array?

I'm making a query via Oracle SQLcl. I am spooling into a .json file.
The correct data is presented from the query, but the format is strange.
Starting off as:
SET ENCODING UTF-8
SET SQLFORMAT JSON
SPOOL content.json
Follwed by a query, produces a JSON file as requested.
However, how do I remove the outer structure, meaning this part:
{"results":[{"columns":[{"name":"ID","type":"NUMBER"},
{"name":"LANGUAGE","type":"VARCHAR2"},{"name":"LOCATION","type":"VARCHAR2"},{"name":"NAME","type":"VARCHAR2"}],"items": [
// Here is the actual data I want to see in the file exclusively
]
I only want to spool everything in the items array, not including that key itself.
Is this possible to set as a parameter before querying? Reading the Oracle docs have not yielded any answers, hence asking here.
Thats how I handle this.
After output to some file, I use jq command to recreate the file with only the items
ssh cat file.json | jq --compact-output --raw-output '.results[0].items' > items.json
`
Using this library = https://stedolan.github.io/jq/

How to perform mongoexport without "numberLong" objects

I did a mongoexport dump of a database:
$ mongoexport -d my_db -c articles
and I'm seeing some of our ids are wrapped in "$numberLong" objects. Unfortunately, it isn't consistent. Some filemakerIds are just plain ints:
{"_id":{"$oid":"52126317036480948dc2abf2"},"filemakerId":4129,
and some are not:
{"_id":{"$oid":"52126317036480948dc2abf1"},"filemakerId":{"$numberLong":"4073"},
These ids will always be a 3 or 4 digit number. It would be easier for me if the dump consistently displayed them as an Int ( eg "filemakerId":4129). Can mongoexport force this?
The alternative to the mongoexport tool can be MongoCompass where you can export your data in JSON and CSV.
First, go to aggregation console and create an aggregation pipeline.
In $project section convert your data into double or int.(i used double)
add this:
{$project:{"filemakerId":{$toDouble:"$filemakerId"}}}
By converting you data to double/int you will didn't get the "filemakerId":{"$numberLong":"4073"}.
what you will get "filemakerId":4073
Finally, click on export and choose the output format which would be JSON/CSV.
Hope you'll get your expected result.

Sqoop HDFS to Couchbase: json file format

I'm trying the export data form HDFS to Couchbase and I have a problem with my file format.
My configuration:
Couchbase server 2.0
Stack hadoop cdh4.1.2
sqoop 1.4.2 (compiled with hadoop2.0.0)
couchbase/hadoop connector (compiled with hadoop2.0.0)
When I run the export command, I can easily export files with this kind of format:
id,"value"
or
id,42
or
id,{"key":"value"}
But when I want to apply a Json object it doesn't work!
id,{"key1":"value1,"key2":"value2"}
The content is truncated at the first comma and diplay in base64 by couchbase because now the content is not a correct JSON...
So, my question is how the file must by formated to be stored as a json document?
Can we only export a key/value file?
I want to export json files form HDFS like the cbdocloader do it with files from local file system...
I'm afraid that this expected behavior as Sqoop is parsing your input file as CSV with comma as a separator. You might need tweak your file format to either escape separator or enclose entire JSON string. I would recommend reading how exactly Sqoop is dealing with escaping separators and enclosing strings in the user guide [1].
Links:
http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#id387098
I think your best bet is to convert the files to tab-delimited, if you're still working on this. If you look at the Sqoop documentation (http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html#_large_objects), there's an option --fields-terminated-by which allows you to specify which characters Sqoop splits fields on.
If you passed it --fields-terminated-by '\t', and a tab-delimited file, it would leave the commas in place in your JSON.
#mpiffaretti can you post your sqoop export command? I think each JSON object should have its own key value.
key1 {"dataOne":"ValueOne"}
key2 {"dataTwo":"ValueTwo"}
http://ajanacs.weebly.com/blog
In your case change the datea like below may help you solve the issue.
id,{"key":"value"}
id2,{"key2":"value2"}
Let me know if you have further questions on it.
[json] [sqoopexport] [couchbase]

Proper way to import json file to mongo

I've been trying to use mongo with some data imported, but I'm not able to use it properly with my document description.
This is an example of the .json I import using mongoimport: https://gist.github.com/2917854
mongoimport -d test -c example data.json
I noticed that all my document it's imported to a unique object in spite of creating one of object for each shop.
That's why when I try to find a shop or anything I want to query, all the document is returned.
db.example.find({"shops.name":"x"})
I want to be able to query the db to obtain products by the id using dot notation something similar to:
db.example.find({"shops.name":"x","categories.type":"shirts","clothes.id":"1"}
The problem is that all the document is imported like a single object. The question is: How
do I need to import the object to obtain my desired result?
Docs note that:
This utility takes a single file that contains 1 JSON/CSV/TSV string per line and inserts it.
In the structure you are using -assuming the errors on the gist are fixed- you are essentially importing one document with only shops field.
After breaking the data into separate shop docs, import using something like (shops being the collection name, makes more sense than using example):
mongoimport -d test -c shops data.json
and then you can query like:
db.shops.find({"name":x,"categories.type":"shirts"})
There is a parameter --jsonArray:
Accept import of data expressed with multiple MongoDB document within a single JSON array
Using this option you can feed it an array, so you only need to strip the outer object syntax i.e. everything at the beginning until and including "shops" :, and the } at the end.
Myself I use a little tool called jq that can extract the array from command line:
./jq '.shops' shops.json
IMPORT FROM JSON
mongoimport --db "databaseName" --collection "collectionName" --type json --file "fileName.json" --jsonArray
JSON format should be in this format. (Array of Objects)
[
{ name: "Name1", msg: "This is msg 1" },
{ name: "Name2", msg: "This is msg 2" },
{ name: "Name3", msg: "This is msg 3" }
]
IMPORT FROM CSV
mongoimport --db "databaseName" --collection "collectionName" --type csv --file "fileName.csv" --headerline
More Info
https://docs.mongodb.com/getting-started/shell/import-data/
Importing a JSON
The command mongoimport allows us to import human readable JSON in a specific database & a collection. To import a JSON data in a specific database & a collection, type mongoimport -d databaseName -c collectionName jsonFileName.json