Is there any way in Elasticsearch to get results as CSV file in curl API? - csv

I am using elastic search.
I need results from elastic search as a CSV file.
Any curl URL or any plugins to achieve this?

I've done just this using cURL and jq ("like sed, but for JSON"). For example, you can do the following to get CSV output for the top 20 values of a given facet:
$ curl -X GET 'http://localhost:9200/myindex/item/_search?from=0&size=0' -d '
{"from": 0,
"size": 0,
"facets": {
"sourceResource.subject.name": {
"global": true,
"terms": {
"order": "count",
"size": 20,
"all_terms": true,
"field": "sourceResource.subject.name.not_analyzed"
}
}
},
"sort": [
{
"_score": "desc"
}
],
"query": {
"filtered": {
"query": {
"match_all": {}
}
}
}
}' | jq -r '.facets["subject"].terms[] | [.term, .count] | #csv'
"United States",33755
"Charities--Massachusetts",8304
"Almshouses--Massachusetts--Tewksbury",8304
"Shields",4232
"Coat of arms",4214
"Springfield College",3422
"Men",3136
"Trees",3086
"Session Laws--Massachusetts",2668
"Baseball players",2543
"Animals",2527
"Books",2119
"Women",2004
"Landscape",1940
"Floral",1821
"Architecture, Domestic--Lowell (Mass)--History",1785
"Parks",1745
"Buildings",1730
"Houses",1611
"Snow",1579

I've used Python successfully, and the scripting approach is intuitive and concise. The ES client for python makes life easy. First grab the latest Elasticsearch client for Python here:
http://www.elasticsearch.org/blog/unleash-the-clients-ruby-python-php-perl/#python
Then your Python script can include calls like:
import elasticsearch
import unicodedata
import csv
es = elasticsearch.Elasticsearch(["10.1.1.1:9200"])
# this returns up to 500 rows, adjust to your needs
res = es.search(index="YourIndexName", body={"query": {"match": {"title": "elasticsearch"}}},500)
sample = res['hits']['hits']
# then open a csv file, and loop through the results, writing to the csv
with open('outputfile.tsv', 'wb') as csvfile:
filewriter = csv.writer(csvfile, delimiter='\t', # we use TAB delimited, to handle cases where freeform text may have a comma
quotechar='|', quoting=csv.QUOTE_MINIMAL)
# create column header row
filewriter.writerow(["column1", "column2", "column3"]) #change the column labels here
for hit in sample:
# fill columns 1, 2, 3 with your data
col1 = hit["some"]["deeply"]["nested"]["field"].decode('utf-8') #replace these nested key names with your own
col1 = col1.replace('\n', ' ')
# col2 = , col3 = , etc...
filewriter.writerow([col1,col2,col3])
You may want to wrap the calls to the column['key'] references in try / catch error handling, since documents are unstructured, and may not have the field from time to time (depends on your index).
I have a complete Python sample script using the latest ES python client available here:
https://github.com/jeffsteinmetz/pyes2csv

You can use elasticsearch head plugin.
You can install from elasticsearch head plugin
http://localhost:9200/_plugin/head/
Once you have the plugin installed, navigate to the structured query tab, provide query details and you can select 'csv' format from the 'Output Results' dropdown.

I don't think there is a plugin that will give you CSV results directly from the search engine, so you will have to query ElasticSearch to retrieve results and then write them to a CSV file.
Command line
If you're on a Unix-like OS, then you might be able to make some headway with es2unix which will give you search results back in raw text format on the command line and so should be scriptable.
You could then dump those results to text file or pipe to awk or similar to format as CSV. There is a -o flag available, but it only gives 'raw' format at the moment.
Java
I found an example using Java - but haven't tested it.
Python
You could query ElasticSearch with something like pyes and write the results set to a file with the standard csv writer library.
Perl
Using Perl then you could use Clinton Gormley's GIST linked by Rakesh - https://gist.github.com/clintongormley/2049562

Shameless plug. I wrote estab - a command line program to export elasticsearch documents to tab-separated values.
Example:
$ export MYINDEX=localhost:9200/test/default/
$ curl -XPOST $MYINDEX -d '{"name": "Tim", "color": {"fav": "red"}}'
$ curl -XPOST $MYINDEX -d '{"name": "Alice", "color": {"fav": "yellow"}}'
$ curl -XPOST $MYINDEX -d '{"name": "Brian", "color": {"fav": "green"}}'
$ estab -indices "test" -f "name color.fav"
Brian green
Tim red
Alice yellow
estab can handle export from multiple indices, custom queries, missing values, list of values, nested fields and it's reasonably fast.

If you are using kibana (app/discover in general), you can make your query in the UI, then save it and share -> CSV Reports. This creates a csv with a line for each record and columns will be comma separated

I have been using https://github.com/robbydyer/stash-query stash-query for this.
I find it quite convenient and working well, though i struggle with the install every time I redo it (this is due to me not being very fluent with gem's and ruby).
On Ubuntu 16.04 though, what seemed to work was:
apt install ruby
sudo apt-get install libcurl3 libcurl3-gnutls libcurl4-openssl-dev
gem install stash-query
and then you should be good to go
Installs Ruby
Install curl dependencies for Ruby, because the stash-query tool is working via the REST API of elasticsearch
Installs stash query
This blog post describes how to build it as well:
https://robbydyer.wordpress.com/2014/08/25/exporting-from-kibana/

you can use elasticsearch2csv is a small and effective python3 script that uses Elasticsearch scroll API and handle a big query response.

You can use GIST. Its simple.
Its in Perl and you can get some help from it.
Please download and see the usage on GitHub. Here is the link.
GIST GitHub
Or if you want in Java then go for elasticsearch-river-csv
elasticsearch-river-csv

Related

How to write a correct mongodb query for mongodump?

I'm trying to backup 3 articles from my database, I have their IDs but when I try to use mongodump I just can't seem to be able to write the proper json query. I get either a JSON error message, or some cryptic cannot decode objectID into a slice message.
Here's the command that I'm trying to run at the moment:
mongodump -d 'data' -c 'articles' -q '{"$oid": "5fa0bd32f7d5870029c7d421" }'
This is returning the ObjectID into a slice error, which I don't really understand. I also tried with ObjectId, like this:
mongodump -d 'data' -c 'articles' -q '{"_id": ObjectId("5fa0bd32f7d5870029c7d421") }'
But this one gives me a invalid JSON error.
I've tried all forms of escaping, escaping the double quotes, escaping the dollar, but nothing NOTHING seems to work. I'm desperate, and I hate mongodb. The closest I've been able to get to a working solution was this:
mongodump -d 'nikkei' -c 'articles' -q '{"_id": "ObjectId(5fa0bd32f7d5870029c7d421)" }'
And I say closest because this didn't fail, the command ran but it returned done dumping data.articles (0 documents) which means, if I understood correctly, that no articles were saved.
What would be the correct format for the query? I'm using mongodump version r4.2.2 by the way.
I have a collection with these 4 documents:
> db.test.find()
{ "_id" : ObjectId("5fab80615397db06f00503c3") }
{ "_id" : ObjectId("5fab80635397db06f00503c4") }
{ "_id" : ObjectId("5fab80645397db06f00503c5") }
{ "_id" : ObjectId("5fab80645397db06f00503c6") }
I make the binary export using the mongodump. This is using MongoDB v4.2 on Windows OS.
>> mongodump --db=test --collection=test --query="{ \"_id\": { \"$eq\" : { \"$oid\": \"5fab80615397db06f00503c3\" } } }"
2020-11-11T11:42:13.705+0530 writing test.test to dump\test\test.bson
2020-11-11T11:42:13.737+0530 done dumping test.test (1 document)
Here's an answer for those using Python:
Note: you must have mongo database tools installed on your system
import json
import os
# insert you query here
query = {"$oid": "5fa0bd32f7d5870029c7d421"}
# cast the query to a string
query = json.dumps(query)
# run the mongodump
command = f"mongodump --db my_database --collection my_collection --query '{query}'"
os.system(command)
If your query is for JSON than try this format.
mongodump -d=nikkei -c=articles -q'{"_id": "ObjectId(5fa0bd32f7d5870029c7d421)" }'
Is there nothing else you could query though, like a title? Might make things a little more simple.
I pulled this from mongoDB docs. It was pretty far down the page but here is the link.
https://docs.mongodb.com/database-tools/mongodump/#usage-in-backup-strategy

How can I format a json file into a bash environment variable?

I'm trying to take the contents of a config file (JSON format), strip out extraneous new lines and spaces to be concise and then assign it to an environment variable before starting my application.
This is where I've got so far:
pwr_config=`echo "console.log(JSON.stringify(JSON.parse(require('fs').readFileSync(process.argv[2], 'utf-8'))));" | node - config.json | xargs -0 printf '%q\n'` npm run start
This pipes a short node.js app into the node runtime taking an argument of the file name and it parses and stringifies the JSON file to validate it and remove any unnecessary whitespace. So far so good.
The result of this is then piped to printf, or at least it would be but printf doesn't support input in this way, apparently, so I'm using xargs to pass it in in a way it supports.
I'm using the %q formatter to format the string escaping any characters that would be a problem as part of a command, but when calling printf through xargs, printf claims it doesn't support %q. I think this is perhaps because there is more than one version of printf but I'm not exactly sure how to resolve that.
Any help would be appreciated, even if the solution is completely different from what I've started :) Thanks!
Update
Here's the output I get on MacOS:
$ cat config.json | xargs -0 printf %q
printf: illegal format character q
My JSON file looks like this:
{
"hue_host": "192.168.1.2",
"hue_username": "myUsername",
"port": 12000,
"player_group_config": [
{
"name": "Family Room",
"player_uuid": "ATVUID",
"hue_group": "3",
"on_events": ["media.play", "media.resume"],
"off_events": ["media.stop", "media.pause"]
},
{
"name": "Lounge",
"player_uuid": "STVUID",
"hue_group": "1",
"on_events": ["media.play", "media.resume"],
"off_events": ["media.stop", "media.pause"]
}
]
}
Two ways:
Use xargs to pick up bash's printf builtin instead of the printf(1) executable, probably in /usr/bin/printf(thanks to #GordonDavisson):
pwr_config=`echo "console.log(JSON.stringify(JSON.parse(require('fs').readFileSync(process.argv[2], 'utf-8'))));" | node - config.json | xargs -0 bash -c 'printf "%q\n"'` npm run start
Simpler: you don't have to escape the output of a command if you quote it. In the same way that echo "<|>" is OK in bash, this should also work:
pwr_config="$(echo "console.log(JSON.stringify(JSON.parse(require('fs').readFileSync(process.argv[2], 'utf-8'))));" | node - config.json )" npm run start
This uses the newer $(...) form instead of `...`, and so the result of the command is a single word stored as-is into the pwr_config variable.*
Even simpler: if your npm run start script cares about the whitespace in your JSON, it's fundamentally broken :) . Just do:
pwr_config="$(< config.json)" npm run start
The $(<...) returns the contents of config.json. They are all stored as a single word ("") into pwr_config, newlines and all.* If something breaks, either config.json has an error and should be fixed, or the code you're running has an error and needs to be fixed.
* You actually don't need the "" around $(). E.g., foo=$(echo a b c) and foo="$(echo a b c)" have the same effect. However, I like to include the "" to remind myself that I am specifically asking for all the text to be kept together.

Importing JSON dataset into Cassandra

I have a large dataset consisting of about 80,000 records. I want to import this into Cassandra. I only see documentation for CSV format. Is this possible for JSON?
In 2020, you can use DataStax Bulk Loader utility (DSBulk) for loading & unloading of Cassandra/DSE data in CSV and JSON formats. It's very flexible, and allows to load only part of data, flexibly map JSON fields into table fields, etc. It supports Cassandra 2.1+, and very fast.
In simplest case, data loading command would look as following:
dsbulk load -k keyspace -t table -c json -url your_file.json
DataStax blog has a series of articles about DSBulk: 1, 2, 3, 4, 5, 6
To insert JSON data, add JSON to the INSERT command.
refer to this link for details
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useInsertJSON.html
See dsbulk solution as the ultimate one, however you may consider this trick that converts json-formatted messages (one per line) to csv on the fly (no separate conversion necessary) and loads into Cassandra using cqlsh, i.e. :
cat file.json | jq -r '[.uid,.gender,.age] | #csv' | cqlsh -e 'COPY labdata.clients(uid,gender,age) from STDIN;'
Explanations:
This requires a jq utility, installed e.g. for ubuntu as apt install jq.
Here I have a file with the following messages:
{"uid": "d50192e5-c44e-4ae8-ae7a-7cfe67c8b777", "gender": "F", "age": 19}
{"uid": "d502331d-621e-4721-ada2-5d30b2c3801f", "gender": "M", "age": 32}
This is how I convert it to csv on the fly:
cat file | jq -r '[.uid,.gender,.age] | #csv'
where -r will remove some extra \", but you still end up with quoted strings:
"d50192e5-c44e-4ae8-ae7a-7cfe67c8b777","F",19
"d502331d-621e-4721-ada2-5d30b2c3801f","M",32
Now, if you create a table clients in keyspace labdata for this data using cqlsh:
CREATE TABLE clients ( uid ascii PRIMARY KEY, gender ascii, age int);
then you should be able to run the COPY ... FROM STDIN command above

Force mongodb to output strict JSON

I want to consume the raw output of some MongoDB commands in other programs that speak JSON. When I run commands in the mongo shell, they represent Extended JSON, fields in "shell mode", with special fields like NumberLong , Date, and Timestamp. I see references in the documentation to "strict mode", but I see no way to turn it on for the shell, or a way to run commands like db.serverStatus() in things that do output strict JSON, like mongodump. How can I force Mongo to output standards-compliant JSON?
There are several other questions on this topic, but I don't find any of their answers particularly satisfactory.
The MongoDB shell speaks Javascript, so the answer is simple: use JSON.stringify(). If your command is db.serverStatus(), then you can simply do this:
JSON.stringify(db.serverStatus())
This won't output the proper "strict mode" representation of each of the fields ({ "floatApprox": <number> } instead of { "$numberLong": "<number>" }), but if what you care about is getting standards-compliant JSON out, this'll do the trick.
I have not found a way to do this in the mongo shell, but as a workaround, mongoexport can run queries and its output uses strict mode and can be piped into other commands that expect JSON input (such as json_pp or jq). For example, suppose you have the following mongo shell command to run a query, and you want to create a pipeline using that data:
db.myItemsCollection.find({creationDate: {$gte: ISODate("2016-09-29")}}).pretty()
Convert that mongo shell command into this shell command, piping for the sake of example to `json_pp:
mongoexport --jsonArray -d myDbName -c myItemsCollection -q '{"creationDate": {"$gte": {"$date": "2016-09-29T00:00Z"}}}' | json_pp
You will need to convert the query into strict mode format, and pass the database name and collection name as arguments, as well as quote properly for your shell, as shown here.
In case of findOne
JSON.stringify(db.Bill.findOne({'a': '123'}))
In case of a cursor
db.Bill.find({'a': '123'}).forEach(r=>print(JSON.stringify(r)))
or
print('[') + db.Bill.find().limit(2).forEach(r=>print(JSON.stringify(r) + ',')) + print(']')
will output
[{a:123},{a:234},]
the last one will have a ',' after the last item...remove it
To build on the answer from #jbyler, you can strip out the numberLongs using sed after you get your data - that is if you're using linux.
mongoexport --jsonArray -d dbName -c collection -q '{fieldName: {$regex: ".*turkey.*"}}' | sed -r 's/\{ "[$]numberLong" : "([0-9]+)" }/"\1"/g' | json_pp
EDIT: This will transform a given document, but will not work on a list of documents. Changed find to findOne.
Adding
.forEach(function(results){results._id=results._id.toString();printjson(results)})`
to a findOne() will output valid JSON.
Example:
db
.users
.findOne()
.forEach(function (results) {
results._id = results._id.toString();
printjson(results)
})
Source: https://www.mydbaworld.com/mongodb-shell-output-valid-json/

Unix command-line JSON parser? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Can anyone recommend a Unix (choose your flavor) JSON parser that could be used to introspect values from a JSON response in a pipeline?
I prefer python -m json.tool which seems to be available per default on most *nix operating systems per default.
$ echo '{"foo":1, "bar":2}' | python -m json.tool
{
"bar": 2,
"foo": 1
}
Note: Depending on your version of python, all keys might get sorted alphabetically, which can or can not be a good thing. With python 2 it was the default to sort the keys, while in python 3.5+ they are no longer sorted automatically, but you have the option to sort by key explicitly:
$ echo '{"foo":1, "bar":2}' | python3 -m json.tool --sort-keys
{
"bar": 2,
"foo": 1
}
If you're looking for a portable C compiled tool:
https://stedolan.github.io/jq/
From the website:
jq is like sed for JSON data - you can use it to slice and filter and map and transform structured data with the same ease that sed, awk, grep and friends let you play with text.
jq can mangle the data format that you have into the one that you want with very little effort, and the program to do so is often shorter and simpler than you’d expect.
Tutorial: https://stedolan.github.io/jq/tutorial/
Manual: https://stedolan.github.io/jq/manual/
Download: https://stedolan.github.io/jq/download/
I have created a module specifically designed for command-line JSON manipulation:
https://github.com/ddopson/underscore-cli
FLEXIBLE - THE "swiss-army-knife" tool for processing JSON data - can be used as a simple pretty-printer, or as a full-powered Javascript command-line
POWERFUL - Exposes the full power and functionality of underscore.js (plus underscore.string)
SIMPLE - Makes it simple to write JS one-liners similar to using "perl -pe"
CHAINED - Multiple command invokations can be chained together to create a data processing pipeline
MULTI-FORMAT - Rich support for input / output formats - pretty-printing, strict JSON, etc [coming soon]
DOCUMENTED - Excellent command-line documentation with multiple examples for every command
It allows you to do powerful things really easily:
cat earthporn.json | underscore select '.data .title'
# [ 'Fjaðrárgljúfur canyon, Iceland [OC] [683x1024]',
# 'New town, Edinburgh, Scotland [4320 x 3240]',
# 'Sunrise in Bryce Canyon, UT [1120x700] [OC]',
# ...
# 'Kariega Game Reserve, South Africa [3584x2688]',
# 'Valle de la Luna, Chile [OS] [1024x683]',
# 'Frosted trees after a snowstorm in Laax, Switzerland [OC] [1072x712]' ]
cat earthporn.json | underscore select '.data .title' | underscore count
# 25
underscore map --data '[1, 2, 3, 4]' 'value+1'
# prints: [ 2, 3, 4, 5 ]
underscore map --data '{"a": [1, 4], "b": [2, 8]}' '_.max(value)'
# [ 4, 8 ]
echo '{"foo":1, "bar":2}' | underscore map -q 'console.log("key = ", key)'
# key = foo
# key = bar
underscore pluck --data "[{name : 'moe', age : 40}, {name : 'larry', age : 50}, {name : 'curly', age : 60}]" name
# [ 'moe', 'larry', 'curly' ]
underscore keys --data '{name : "larry", age : 50}'
# [ 'name', 'age' ]
underscore reduce --data '[1, 2, 3, 4]' 'total+value'
# 10
And it has one of the best "smart-whitespace" JSON formatters available:
If you have any feature requests, comment on this post or add an issue in github. I'd be glad to prioritize features that are needed by members of the community.
You can use this command-line parser (which you could put into a bash alias if you like), using modules built into the Perl core:
perl -MData::Dumper -MJSON::PP=from_json -ne'print Dumper(from_json($_))'
Checkout TickTick.
It's a true Bash JSON parser.
#!/bin/bash
. /path/to/ticktick.sh
# File
DATA=`cat data.json`
# cURL
#DATA=`curl http://foobar3000.com/echo/request.json`
tickParse "$DATA"
echo ``pathname``
echo ``headers["user-agent"]``
There is also JSON command line processing toolkit if you happen to have node.js and npm in your stack.
And another "json" command for massaging JSON on your Unix command line.
And here are the other alternatives:
jq: http://stedolan.github.io/jq/
fx: https://github.com/antonmedv/fx
json:select: https://github.com/dominictarr/json-select
json-command: https://github.com/zpoley/json-command
JSONPath: http://goessner.net/articles/JsonPath/, http://code.google.com/p/jsonpath/wiki/Javascript
jsawk: https://github.com/micha/jsawk
jshon: http://kmkeen.com/jshon/
json2: https://github.com/vi/json2
Related: Command line tool for parsing JSON input for Unix?
Anyone mentioned Jshon or JSON.sh?
https://github.com/keenerd/jshon
pipe json to it, and it traverses the json objects and prints out the path to the current object (as a JSON array) and then the object, without whitespace.
http://kmkeen.com/jshon/
Jshon loads json text from stdin, performs actions, then displays the last action on stdout and also was made to be part of the usual text processing pipeline.
You could try jsawk as suggested in this answer.
Really you could whip up a quick python script to do this though.
For Bash/Python, here is a basic wrapper around python's simplejson:
json_parser() {
local jsonfile="my_json_file.json"
local tc="import simplejson,sys; myjsonstr=sys.stdin.read(); "`
`"myjson=simplejson.loads(myjsonstr);"
# Build python print command based on $#
local printcmd="print myjson"
for (( argn=1; argn<=$#; argn++ )); do
printcmd="$printcmd['${!argn}']"
done
local result=$(python -c "$tc $printcmd.keys()" <$jsonfile 2>/dev/null \
|| python -c "$tc $printcmd" <$jsonfile 2>/dev/null)
# For returning space-separated values
echo $result|sed -e "s/[]|[|,|']//g"
#echo $result
}
It really only handles the nested-dictionary style of data, but it works for what I needed, and is useful for walking through the json. It could probably be adapted to taste.
Anyway, something homegrown for those not wanting to source in yet another external dependency. Except for python, of course.
Ex. json_parser {field1} {field2} would run print myjson['{field1}']['{field2}'], yielding either the keys or the values associated with {field2}, space-separated.
I just made jkid which is a small command-line json explorer that I made to easily explore big json objects. Objects can be explored "transversally" and a "preview" option is there to avoid console overflow.
$ echo '{"john":{"size":20, "eyes":"green"}, "bob":{"size":30, "eyes":"brown"}}' > test3.json
$ jkid . eyes test3.json
object[.]["eyes"]
{
"bob": "brown",
"john": "green"
}