json files downloaded by aws s3 sync / aws s3 cp are incomplete - json

I downloaded .json file from amazon s3, but its content is just a value of the first key/value pair.
original json file is like this:
{
"_1": [
{
"Name": "name",
"Type": "type"
}
]
}
but downloaded json file is not even json file, it only has list inside.
[
{
"Name": "name",
"Type": "type"
}
]
I tried aws s3 sync / aws s3 copy / aws s3api get-object and all of its result is same.
I only want to download original file from the s3 bucket.
is there any solution?
Updates
I just copied the original content on the s3 select from preview and saved it as a file.
I found out its md5 checksum and file size is totally different with the object overview.
It seems the original file on the s3 bucket is corrupted, but I'm not sure how its preview is still same as the original content.

I found out aws s3api select-object-content can give me same result as the S3 select from preview but without indents.
For indents, I decided to reindent after I receive the uncorrupted results.
I used below command to retrieve my json files.
aws s3api select-object-content \
--bucket $BUCKET \
--key $KEY --expression "select * from s3object" \
--expression-type 'SQL' \
--input-serialization '{"JSON": {"Type": "LINES"}, "CompressionType": "NONE"}' \
--output-serialization '{"JSON": {}}' \
/dev/stdout | python -mjson.tool > $KEY

Related

Is there a way to use AWS Step Functions Input to assembly command string on System Manager block?

I am creating a Step Function State machine to everytime an instance starts it copy a file from S3 to an specific folder inside this instance. The origin folder inside S3 bucket has a folder named with this instance ID. The instance ID I am passing as input for the System manager block, but I need to use it to create the command string that will be performed inside the EC2.
For example:
My input is: $.detail.instance-id (lets assume the following ID i-11223344556677889)
The Systems Manager API parameters are:
"CloudWatchOutputConfig": {
"CloudWatchLogGroupName": "myloggroup",
"CloudWatchOutputEnabled": true
},
"DocumentName": "AWS-RunShellScript",
"DocumentVersion": "$DEFAULT",
"InstanceIds.$": "States.Array($)",
"MaxConcurrency": "50",
"MaxErrors": "10",
"Parameters": {
"commands": [
{
"runuser -l ec2-user -c \"aws s3 cp s3://my-bucket/**MY_INSTANCEID**/myfile.xyz /home/ec2-user/myfolder/myfile.xyz\""
}
},
"TimeoutSeconds": 6000
}```
Summing up, I want to turn the line with the command replacing the MY_INSTANCEID by my input $.detail.instance-id, and perform the following command:
"runuser -l ec2-user -c "aws s3 cp s3://my-bucket/i-11223344556677889/myfile.xyz /home/ec2-user/myfolder/myfile.xyz""
Is there a way? I already tried to use the Fn::join withou success.
Thank you in advance,
kind regards,
Winner
It was necessary to use State.Format inside the State.Array so the it worked, and State.Format inside the State.Array cannot have quotes:
"CloudWatchOutputConfig": {
"CloudWatchLogGroupName": "myloggroup",
"CloudWatchOutputEnabled": true
},
"DocumentName": "AWS-RunShellScript",
"DocumentVersion": "$DEFAULT",
"InstanceIds.$": "States.Array($)",
"MaxConcurrency": "50",
"MaxErrors": "10",
"Parameters": {
"commands.$": "States.Array(States.Format('runuser -l ec2-user -c \"aws s3 cp s3://my-bucket/**MY_INSTANCEID**/myfile.xyz /home/ec2-user/myfolder/myfile.xyz\"', $))"
},
"TimeoutSeconds": 6000
}```
Was also necessary to use .$ after command.

How can I convert JSON to an environment variable file format? [duplicate]

This question already has answers here:
Convert JSON from AWS SSM to environment variables using jq
(2 answers)
Closed 1 year ago.
I am fetching a AWS Parameter Store JSON response using the AWS cli:
echo $(aws ssm get-parameters-by-path --with-decryption --region eu-central-1 \
--path "${PARAMETER_STORE_PREFIX}") >/home/ubuntu/.env.json
This outputs something like:
{
"Parameters": [
{
"Name": "/EXAMPLE/CORS_ORIGIN",
"Value": "https://example.com"
},
{
"Name": "/EXAMPLE/DATABASE_URL",
"Value": "db://user:pass#host:3306/example"
}
]
}
Instead of writing to a JSON file, I'd like to write to a .env file instead with the following format:
CORS_ORIGIN="https://example.com"
DATABASE_URL="db://user:pass#host:3306/example"
How could I achieve this? I found similar questions like Exporting JSON to environment variables, but they do not deal with nested JSON / arrays
Easy with jq and some string interpolation and #sh to properly quote the Value strings for shells:
$ jq -r '.Parameters[] | "\(.Name | .[rindex("/")+1:])=\(.Value | #sh)"' input.json
CORS_ORIGIN='https://example.com'
DATABASE_URL='db://user:pass#host:3306/example'

CURL Get download link from request and download file

I'm using conversocial API:
https://api-docs.conversocial.com/1.1/reports/
Using the sample from the documentation, as after all tweaks I receive this "output"
{
"report": {
"name": "dump", "generation_start_date": "2012-05-30T17:09:40",
"url": "https://api.conversocial.com/v1.1/reports/5067",
"date_from": "2012-05-21",
"generated_by": {
"url": "https://api.conversocial.com/v1.1/moderators/11599",
"id": "11599"
},
"generated_date": "2012-05-30T17:09:41",
"channel": {
"url": "https://api.conversocial.com/v1.1/channels/387",
"id": "387"
},
"date_to": "2012-05-28",
"download": "https://s3.amazonaws.com/conversocial/reports/70c68360-1234/#twitter-from-may-21-2012-to-may-28-2012.zip",
"id": "5067"
}
}
Currently, I can sort this JSON output to download only and will receive this output
{
"report" : {
"download" : "https://s3.amazonaws.com/conversocial/reports/70c68360-1234/#twitter-from-may-21-2012-to-may-28-2012.zip"
}
}
Is it anyway of automating this process by using CURL, to make curl download this file?
To download I'm planning to use simple way as:
curl URL_LINK > FILEPATH/EXAMPLE.ZIP
Currently thinking is there is a way to replace URL_LINK with download link?? Or any other way, method, way around????
Give a try to this:
curl $(curl -s https://httpbin.org/get | jq ".url" -r) > file
Just replace your url and the jq params, based in your json, thay may be:
jq ".report.download" -r
The -r will remove the double quotes "
The way it works is by using a command substitution $():
$(curl -s https://httpbin.org/get | jq ".url" -r)
This will fetch you URL and extract the new URL from the returned JSON using jq the one later is passed to curl as an argument.

Import data from Mongodb on GCE to Bigquery

My task is to import data from a mongodb collection hosted on GCE to Bigquery. I tried the following. Since bigquery does not accept '$' symbol in field names, I ran the following to remove the $oid field,
mongo test --quiet \
--eval "db.trial.find({}, {_id: 0})
.forEach(function(doc) {
print(JSON.stringify(doc)); });" \
> trial_noid.json
But, while importing the result file, I get an error that says
parse error: premature EOF (error code: invalid)
Is there a way to avoid these steps and directly transfer the data to bigquery from mongodb hosted on GCE?
In my opinion, the best practice is building your own extractor. That can be done with the language of your choice and you can extract to CSV or JSON.
But if you looking to a fast way and if your data is not huge and can fit within one server, then I recommend using mongoexport to extract to JSON. Let's assume you have a simple document structure such as below:
{
"_id" : "tdfMXH0En5of2rZXSQ2wpzVhZ",
"statuses" : [
{
"status" : "dc9e5511-466c-4146-888a-574918cc2534",
"score" : 53.24388894
}
],
"stored_at" : ISODate("2017-04-12T07:04:23.545Z")
}
Then you need to define your BigQuery Schema (mongodb_schema.json) such as:
$ cat > mongodb_schema.json <<EOF
[
{ "name":"_id", "type": "STRING" },
{ "name":"stored_at", "type": "record", "fields": [
{ "name":"date", "type": "STRING" }
]},
{ "name":"statuses", "type": "record", "mode": "repeated", "fields": [
{ "name":"status", "type": "STRING" },
{ "name":"score", "type": "FLOAT" }
]}
]
EOF
Now, the fun part starts :-) Extracting data as JSON from your MongoDB. Let's assume you have a cluster with replica set name statuses, your db is sample, and your collection is status.
mongoexport \
--host statuses/db-01:27017,db-02:27017,db-03:27017 \
-vv \
--db "sample" \
--collection "status" \
--type "json" \
--limit 100000 \
--out ~/sample.json
As you can see above, I limit the output to 100k records because I recommend you run sample and load to BigQuery before doing it for all your data. After running above command you should have your sample data in sample.json BUT there is a field $date which will cause you an error loading to BigQuery. To fix that we can use sed to replace them to simple field name:
# Fix Date field to make it compatible with BQ
sed -i 's/"\$date"/"date"/g' sample.json
Now you can compress, upload to Google Cloud Storage (GCS) and then load to BigQuery using following commands:
# Compress for faster load
gzip sample.json
# Move to GCloud
gsutil mv ./sample.json.gz gs://your-bucket/sample/sample.json.gz
# Load to BQ
bq load \
--source_format=NEWLINE_DELIMITED_JSON \
--max_bad_records=999999 \
--ignore_unknown_values=true \
--encoding=UTF-8 \
--replace \
"YOUR_DATASET.mongodb_sample" \
"gs://your-bucket/sample/*.json.gz" \
"mongodb_schema.json"
If everything was okay, then go back and remove --limit 100000 from mongoexport command and re-run above commands again to load everything instead of 100k sample.
With this solution, you can import your data with the same hierarchy to BigQuery but if you want to flat your data, then below alternative solution would work better.
ALTERNATIVE SOLUTION:
If you want more flexibility and performance is not your concern, then you can use mongo CLI tool as well. This way you can write your extract logic in a JavaScript and execute it against your data and then send output to BigQuery. Here is what I did for the same process but used JavaScript to output in CSV so I can load it much easier to BigQuery:
# Export Logic in JavaScript
cat > export-csv.js <<EOF
var size = 100000;
var maxCount = 1;
for (x = 0; x < maxCount; x = x + 1) {
var recToSkip = x * size;
db.entities.find().skip(recToSkip).limit(size).forEach(function(record) {
var row = record._id + "," + record.stored_at.toISOString();;
record.statuses.forEach(function (l) {
print(row + "," + l.status + "," + l.score)
});
});
}
EOF
# Execute on Mongo CLI
_MONGO_HOSTS="db-01:27017,db-02:27017,db-03:27017/sample?replicaSet=statuses"
mongo --quiet \
"${_MONGO_HOSTS}" \
export-csv.js \
| split -l 500000 --filter='gzip > $FILE.csv.gz' - sample_
# Load all Splitted Files to Google Cloud Storage
gsutil -m mv ./sample_* gs://your-bucket/sample/
# Load files to BigQuery
bq load \
--source_format=CSV \
--max_bad_records=999999 \
--ignore_unknown_values=true \
--encoding=UTF-8 \
--replace \
"YOUR_DATASET.mongodb_sample" \
"gs://your-bucket/sample/sample_*.csv.gz" \
"ID,StoredDate:DATETIME,Status,Score:FLOAT"
TIP: In above script I did the small trick by piping output to able to split the output into multiple files with sample_ prefix. Also during split, it will GZip the output so you can load it easier to GCS.
When using NEWLINE_DELIMITED_JSON to import data into BigQuery, one JSON object, including any nested/repeated fields, must appear on each line.
The issue with your input file appears to be that the JSON object is split into many lines; if you collapse it to a single line, it will resolve this
error.
Requiring this format allows BigQuery to split the file and process it in parallel without being concerned that splitting the file will put one part of a JSON object in one split, and another part in the next split.

is there any way to import a json file(contains 100 documents) in elasticsearch server.?

Is there any way to import a JSON file (contains 100 documents) in elasticsearch server? I want to import a big json file into es-server..
As dadoonet already mentioned, the bulk API is probably the way to go. To transform your file for the bulk protocol, you can use jq.
Assuming the file contains just the documents itself:
$ echo '{"foo":"bar"}{"baz":"qux"}' |
jq -c '
{ index: { _index: "myindex", _type: "mytype" } },
. '
{"index":{"_index":"myindex","_type":"mytype"}}
{"foo":"bar"}
{"index":{"_index":"myindex","_type":"mytype"}}
{"baz":"qux"}
And if the file contains the documents in a top level list they have to be unwrapped first:
$ echo '[{"foo":"bar"},{"baz":"qux"}]' |
jq -c '
.[] |
{ index: { _index: "myindex", _type: "mytype" } },
. '
{"index":{"_index":"myindex","_type":"mytype"}}
{"foo":"bar"}
{"index":{"_index":"myindex","_type":"mytype"}}
{"baz":"qux"}
jq's -c flag makes sure that each document is on a line by itself.
If you want to pipe straight to curl, you'll want to use --data-binary #-, and not just -d, otherwise curl will strip the newlines again.
You should use Bulk API. Note that you will need to add a header line before each json document.
$ cat requests
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }
$ curl -s -XPOST localhost:9200/_bulk --data-binary #requests; echo
{"took":7,"items":[{"create":{"_index":"test","_type":"type1","_id":"1","_version":1,"ok":true}}]}
I'm sure someone wants this so I'll make it easy to find.
FYI - This is using Node.js (essentially as a batch script) on the same server as the brand new ES instance. Ran it on 2 files with 4000 items each and it only took about 12 seconds on my shared virtual server. YMMV
var elasticsearch = require('elasticsearch'),
fs = require('fs'),
pubs = JSON.parse(fs.readFileSync(__dirname + '/pubs.json')), // name of my first file to parse
forms = JSON.parse(fs.readFileSync(__dirname + '/forms.json')); // and the second set
var client = new elasticsearch.Client({ // default is fine for me, change as you see fit
host: 'localhost:9200',
log: 'trace'
});
for (var i = 0; i < pubs.length; i++ ) {
client.create({
index: "epubs", // name your index
type: "pub", // describe the data thats getting created
id: i, // increment ID every iteration - I already sorted mine but not a requirement
body: pubs[i] // *** THIS ASSUMES YOUR DATA FILE IS FORMATTED LIKE SO: [{prop: val, prop2: val2}, {prop:...}, {prop:...}] - I converted mine from a CSV so pubs[i] is the current object {prop:..., prop2:...}
}, function(error, response) {
if (error) {
console.error(error);
return;
}
else {
console.log(response); // I don't recommend this but I like having my console flooded with stuff. It looks cool. Like I'm compiling a kernel really fast.
}
});
}
for (var a = 0; a < forms.length; a++ ) { // Same stuff here, just slight changes in type and variables
client.create({
index: "epubs",
type: "form",
id: a,
body: forms[a]
}, function(error, response) {
if (error) {
console.error(error);
return;
}
else {
console.log(response);
}
});
}
Hope I can help more than just myself with this. Not rocket science but may save someone 10 minutes.
Cheers
jq is a lightweight and flexible command-line JSON processor.
Usage:
cat file.json | jq -c '.[] | {"index": {"_index": "bookmarks", "_type": "bookmark", "_id": .id}}, .' | curl -XPOST localhost:9200/_bulk --data-binary #-
We’re taking the file file.json and piping its contents to jq first with the -c flag to construct compact output. Here’s the nugget: We’re taking advantage of the fact that jq can construct not only one but multiple objects per line of input. For each line, we’re creating the control JSON Elasticsearch needs (with the ID from our original object) and creating a second line that is just our original JSON object (.).
At this point we have our JSON formatted the way Elasticsearch’s bulk API expects it, so we just pipe it to curl which POSTs it to Elasticsearch!
Credit goes to Kevin Marsh
Import no, but you can index the documents by using the ES API.
You can use the index api to load each line (using some kind of code to read the file and make the curl calls) or the index bulk api to load them all. Assuming your data file can be formatted to work with it.
Read more here : ES API
A simple shell script would do the trick if you comfortable with shell something like this maybe (not tested):
while read line
do
curl -XPOST 'http://localhost:9200/<indexname>/<typeofdoc>/' -d "$line"
done <myfile.json
Peronally, I would probably use Python either pyes or the elastic-search client.
pyes on github
elastic search python client
Stream2es is also very useful for quickly loading data into es and may have a way to simply stream a file in. (I have not tested a file but have used it to load wikipedia doc for es perf testing)
Stream2es is the easiest way IMO.
e.g. assuming a file "some.json" containing a list of JSON documents, one per line:
curl -O download.elasticsearch.org/stream2es/stream2es; chmod +x stream2es
cat some.json | ./stream2es stdin --target "http://localhost:9200/my_index/my_type
You can use esbulk, a fast and simple bulk indexer:
$ esbulk -index myindex file.ldj
Here's an asciicast showing it loading Project Gutenberg data into Elasticsearch in about 11s.
Disclaimer: I'm the author.
you can use Elasticsearch Gatherer Plugin
The gatherer plugin for Elasticsearch is a framework for scalable data fetching and indexing. Content adapters are implemented in gatherer zip archives which are a special kind of plugins distributable over Elasticsearch nodes. They can receive job requests and execute them in local queues. Job states are maintained in a special index.
This plugin is under development.
Milestone 1 - deploy gatherer zips to nodes
Milestone 2 - job specification and execution
Milestone 3 - porting JDBC river to JDBC gatherer
Milestone 4 - gatherer job distribution by load/queue length/node name, cron jobs
Milestone 5 - more gatherers, more content adapters
reference https://github.com/jprante/elasticsearch-gatherer
One way is to create a bash script that does a bulk insert:
curl -XPOST http://127.0.0.1:9200/myindexname/type/_bulk?pretty=true --data-binary #myjsonfile.json
After you run the insert, run this command to get the count:
curl http://127.0.0.1:9200/myindexname/type/_count