While there are several posts about this topic on Stack Overflow, none match my exact use case. I am using a Linux shell script to run SnowSQL to generate a json file.
========================
My json file needs to have a comma between json objects.
This:
{
"CAMPAIGN": "Welcome_New",
"UUID": "fe881781-bdc2-41b2-95f2-e0e8c19dc597"
}
{
"CAMPAIGN": "Welcome_Existing",
"UUID": "77a41c02-beb9-48bf-ada4-b2074c1a78cb"
}
...needs to look this:
{
"CAMPAIGN": "Welcome_New",
"UUID": "fe881781-bdc2-41b2-95f2-e0e8c19dc597"
},
{
"CAMPAIGN": "Welcome_Existing",
"UUID": "77a41c02-beb9-48bf-ada4-b2074c1a78cb"
}
Here is my complete ksh script:
#!/usr/bin/ksh
. /appl/.snf_logon
export SNOW_PKEY_FILE=$(mktemp ./pkey-XXXXXX)
trap "rm -f ${SNOW_PKEY_FILE}" EXIT
LibGetSnowCred
{
outFile=JSON_FILE_TYPE_TEST.json
inDir=/testing
outFileNm=#my_db.my_schema.my_file_stage/${outFile}
snowsql \
--private-key-path $SNOW_PKEY_FILE \
-o exit_on_error=true \
-o friendly=false \
-o timing=false \
-o log_level=ERROR \
-o echo=true <<!
COPY INTO ${outFileNm}
FROM (SELECT object_construct(
'UUID',UUID
,'CAMPAIGN',CAMPAIGN)
FROM my_db.my_schema.JSON_Test_Table
LIMIT 2)
FILE_FORMAT=(
TYPE=JSON
COMPRESSION=NONE
)
OVERWRITE=True
HEADER=False
SINGLE=True
MAX_FILE_SIZE=4900000000
;
get ${outFileNm} file://${inDir}/;
rm ${outFileNm};
!
if [ $? -eq 0 ]; then
echo "Export successful"
else
echo "ERROR in export"
fi
}
Is the best practice to add the comma during the SELECT or after the file is generated and how?
With or without that comma, the text is still not JSON but just a random text that looks like JSON. You export several rows, each row as an independent object. You need to gather all these objects into an array to produce a valid JSON.
A JSON that encodes an array of rows looks like this:
[
{
"CAMPAIGN": "Welcome_New",
"UUID": "fe881781-bdc2-41b2-95f2-e0e8c19dc597"
},
{
"CAMPAIGN": "Welcome_Existing",
"UUID": "77a41c02-beb9-48bf-ada4-b2074c1a78cb"
}
]
The easiest way to produce this output would be to ask the database, if it supports this option (to wrap all the records into a list before generating the JSON, to not export each record in a separate JSON).
If this is not possible then you have a file that contains multiple JSONs. You can use jq to convert these individual JSONs into a JSON similar to the one described above (encoding an array of objects).
It is as simple as that:
jq --slurp '.' input_file > output_file
The option --slurp tells jq to read all the JSONs from the file input_file in memory, to parse them and to put them into an array. That is the program input.
'.' is the jq program. It says "dump the current object". It does not do any processing to the input data. The current object is the array.
After it executes the program (which, in this case doesn't do anything), jq dumps the modified value (as JSON, of course) to the standard output (by default, on screen).
The > output_file part redirects this output to a file (named output_file) instead of showing it on screen.
You can see how it works on the jq playground.
I downloaded .json file from amazon s3, but its content is just a value of the first key/value pair.
original json file is like this:
{
"_1": [
{
"Name": "name",
"Type": "type"
}
]
}
but downloaded json file is not even json file, it only has list inside.
[
{
"Name": "name",
"Type": "type"
}
]
I tried aws s3 sync / aws s3 copy / aws s3api get-object and all of its result is same.
I only want to download original file from the s3 bucket.
is there any solution?
Updates
I just copied the original content on the s3 select from preview and saved it as a file.
I found out its md5 checksum and file size is totally different with the object overview.
It seems the original file on the s3 bucket is corrupted, but I'm not sure how its preview is still same as the original content.
I found out aws s3api select-object-content can give me same result as the S3 select from preview but without indents.
For indents, I decided to reindent after I receive the uncorrupted results.
I used below command to retrieve my json files.
aws s3api select-object-content \
--bucket $BUCKET \
--key $KEY --expression "select * from s3object" \
--expression-type 'SQL' \
--input-serialization '{"JSON": {"Type": "LINES"}, "CompressionType": "NONE"}' \
--output-serialization '{"JSON": {}}' \
/dev/stdout | python -mjson.tool > $KEY
I'm new to ElasticSearch and Kibana and am having trouble getting Kibana to recognise my timestamps.
I have a JSON file with lots of data that I wish to insert into Elasticsearch using Curl. Here is an example of one of the JSON entries.
{"index":{"_id":"63"}}
{"account_number":63,"firstname":"Hughes","lastname":"Owens", "email":"hughesowens#valpreal.com", "_timestamp":"2013-07-05T08:49:30.123"}
I have tried to create an index in Elasticsearch using the command:
curl -XPUT 'http://localhost:9200/test/'
I have then tried to set up an appropriate mapping for the timestamp:
curl -XPUT 'http://localhost:9200/test/container/_mapping' -d'
{
"container" : {
"_timestamp" : {
"_timestamp" : {"enabled: true, "type":"date", "format": "date_hour_minute_second_fraction", "store":true}
}
}
}'
// format of timestamp from http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-date-format.html
I then have tried to bulk insert my data:
curl -XPOST 'localhost:9200/test/container/_bulk?pretty' --data-binary #myfile.json
All of these commands run without fault however when the data is viewed in Kibana the _timestamp field is not being recognised. Sorting via the timestamp does not work and trying to filter the data using different periods does not work. Any ideas on why this problem is occuring is appricieated.
Managed to solve the problem. So for anyone else having this problem:
The format we had our date saved in was incorrect, needed to be :
"_timestamp":"2013-07-05 08:49:30.123"
then our mapping needed to be:
curl -XPUT 'http://localhost:9200/test/container/_mapping' -d'
{
"container" : {
"_timestamp" : {"enabled": true, "type":"date", "format": "yyyy-MM-dd HH:mm:ss.SSS", "store":true, "path" : "_timestamp"}
}
}'
Hope this helps someone.
There is no need to make and ISO8601 date in case you have an epoch timestamp. To make Kibana recognize the field as date is has to be a date field though.
Please note that you have to set the field as date type BEFORE you input any data into the /index/type. Otherwise it will be stored as long and unchangeable.
Simple example that can be pasted into the marvel/sense plugin:
# Make sure the index isn't there
DELETE /logger
# Create the index
PUT /logger
# Add the mapping of properties to the document type `mem`
PUT /logger/_mapping/mem
{
"mem": {
"properties": {
"timestamp": {
"type": "date"
},
"free": {
"type": "long"
}
}
}
}
# Inspect the newly created mapping
GET /logger/_mapping/mem
Run each of these commands in serie.
Generate free mem logs
Here is a simple script that echo to your terminal and logs to your local elasticsearch:
while (( 1==1 )); do memfree=`free -b|tail -n 1|tr -s ' ' ' '|cut -d ' ' -f4`; echo $load; curl -XPOST "localhost:9200/logger/mem" -d "{ \"timestamp\": `date +%s%3N`, \"free\": $memfree }"; sleep 1; done
Inspect data in elastic search
Paste this in your marvel/sense
GET /logger/mem/_search
Now you can move to Kibana and do some graphs. Kibana will autodetect your date field.
This solution works for older ES <2.4
For the newer version of ES you can either use the "date" field along with the following parameters:
https://www.elastic.co/guide/en/elasticsearch/reference/current/date.html
I have a shell script with curl -s http://ifconfig.me/all.json command which prints below output in the terminal window.
{
"version" : {
"ip_addr": "201.73.103.12",
"lang": "java",
"remote_host": "OpenSSL/0.9.8w zlib/1.2.3 libidn/1.23 libssh2/1.2.2",",
"user_agent": "curl/7.23.1 (i386-sun-solaris2.11) libcurl/7.23.1
"charset": "",
"port": "63713"}
}
I need to display the JSON value in table format.
Someone, please help me with this to implement in UNIX shell script. Thanks!
Look into the cool command-line JSON query parser / formatter tool:
http://stedolan.github.io/jq/
You can pipe your JSON input into this tool and extract out / format the keys you need.
Or, just run multiple curl -s http://ifconfigme/ calls for each "row/column" you need and output it in some sane format, E.g.:
#!/bin/bash
ip=`curl -s http://ifconfig.me/ip`
host=`curl -s http://ifconfig.me/host`
echo -e "ip\thost"
echo -e "${ip}\t${host}"
Is there any way to import a JSON file (contains 100 documents) in elasticsearch server? I want to import a big json file into es-server..
As dadoonet already mentioned, the bulk API is probably the way to go. To transform your file for the bulk protocol, you can use jq.
Assuming the file contains just the documents itself:
$ echo '{"foo":"bar"}{"baz":"qux"}' |
jq -c '
{ index: { _index: "myindex", _type: "mytype" } },
. '
{"index":{"_index":"myindex","_type":"mytype"}}
{"foo":"bar"}
{"index":{"_index":"myindex","_type":"mytype"}}
{"baz":"qux"}
And if the file contains the documents in a top level list they have to be unwrapped first:
$ echo '[{"foo":"bar"},{"baz":"qux"}]' |
jq -c '
.[] |
{ index: { _index: "myindex", _type: "mytype" } },
. '
{"index":{"_index":"myindex","_type":"mytype"}}
{"foo":"bar"}
{"index":{"_index":"myindex","_type":"mytype"}}
{"baz":"qux"}
jq's -c flag makes sure that each document is on a line by itself.
If you want to pipe straight to curl, you'll want to use --data-binary #-, and not just -d, otherwise curl will strip the newlines again.
You should use Bulk API. Note that you will need to add a header line before each json document.
$ cat requests
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }
$ curl -s -XPOST localhost:9200/_bulk --data-binary #requests; echo
{"took":7,"items":[{"create":{"_index":"test","_type":"type1","_id":"1","_version":1,"ok":true}}]}
I'm sure someone wants this so I'll make it easy to find.
FYI - This is using Node.js (essentially as a batch script) on the same server as the brand new ES instance. Ran it on 2 files with 4000 items each and it only took about 12 seconds on my shared virtual server. YMMV
var elasticsearch = require('elasticsearch'),
fs = require('fs'),
pubs = JSON.parse(fs.readFileSync(__dirname + '/pubs.json')), // name of my first file to parse
forms = JSON.parse(fs.readFileSync(__dirname + '/forms.json')); // and the second set
var client = new elasticsearch.Client({ // default is fine for me, change as you see fit
host: 'localhost:9200',
log: 'trace'
});
for (var i = 0; i < pubs.length; i++ ) {
client.create({
index: "epubs", // name your index
type: "pub", // describe the data thats getting created
id: i, // increment ID every iteration - I already sorted mine but not a requirement
body: pubs[i] // *** THIS ASSUMES YOUR DATA FILE IS FORMATTED LIKE SO: [{prop: val, prop2: val2}, {prop:...}, {prop:...}] - I converted mine from a CSV so pubs[i] is the current object {prop:..., prop2:...}
}, function(error, response) {
if (error) {
console.error(error);
return;
}
else {
console.log(response); // I don't recommend this but I like having my console flooded with stuff. It looks cool. Like I'm compiling a kernel really fast.
}
});
}
for (var a = 0; a < forms.length; a++ ) { // Same stuff here, just slight changes in type and variables
client.create({
index: "epubs",
type: "form",
id: a,
body: forms[a]
}, function(error, response) {
if (error) {
console.error(error);
return;
}
else {
console.log(response);
}
});
}
Hope I can help more than just myself with this. Not rocket science but may save someone 10 minutes.
Cheers
jq is a lightweight and flexible command-line JSON processor.
Usage:
cat file.json | jq -c '.[] | {"index": {"_index": "bookmarks", "_type": "bookmark", "_id": .id}}, .' | curl -XPOST localhost:9200/_bulk --data-binary #-
We’re taking the file file.json and piping its contents to jq first with the -c flag to construct compact output. Here’s the nugget: We’re taking advantage of the fact that jq can construct not only one but multiple objects per line of input. For each line, we’re creating the control JSON Elasticsearch needs (with the ID from our original object) and creating a second line that is just our original JSON object (.).
At this point we have our JSON formatted the way Elasticsearch’s bulk API expects it, so we just pipe it to curl which POSTs it to Elasticsearch!
Credit goes to Kevin Marsh
Import no, but you can index the documents by using the ES API.
You can use the index api to load each line (using some kind of code to read the file and make the curl calls) or the index bulk api to load them all. Assuming your data file can be formatted to work with it.
Read more here : ES API
A simple shell script would do the trick if you comfortable with shell something like this maybe (not tested):
while read line
do
curl -XPOST 'http://localhost:9200/<indexname>/<typeofdoc>/' -d "$line"
done <myfile.json
Peronally, I would probably use Python either pyes or the elastic-search client.
pyes on github
elastic search python client
Stream2es is also very useful for quickly loading data into es and may have a way to simply stream a file in. (I have not tested a file but have used it to load wikipedia doc for es perf testing)
Stream2es is the easiest way IMO.
e.g. assuming a file "some.json" containing a list of JSON documents, one per line:
curl -O download.elasticsearch.org/stream2es/stream2es; chmod +x stream2es
cat some.json | ./stream2es stdin --target "http://localhost:9200/my_index/my_type
You can use esbulk, a fast and simple bulk indexer:
$ esbulk -index myindex file.ldj
Here's an asciicast showing it loading Project Gutenberg data into Elasticsearch in about 11s.
Disclaimer: I'm the author.
you can use Elasticsearch Gatherer Plugin
The gatherer plugin for Elasticsearch is a framework for scalable data fetching and indexing. Content adapters are implemented in gatherer zip archives which are a special kind of plugins distributable over Elasticsearch nodes. They can receive job requests and execute them in local queues. Job states are maintained in a special index.
This plugin is under development.
Milestone 1 - deploy gatherer zips to nodes
Milestone 2 - job specification and execution
Milestone 3 - porting JDBC river to JDBC gatherer
Milestone 4 - gatherer job distribution by load/queue length/node name, cron jobs
Milestone 5 - more gatherers, more content adapters
reference https://github.com/jprante/elasticsearch-gatherer
One way is to create a bash script that does a bulk insert:
curl -XPOST http://127.0.0.1:9200/myindexname/type/_bulk?pretty=true --data-binary #myjsonfile.json
After you run the insert, run this command to get the count:
curl http://127.0.0.1:9200/myindexname/type/_count