I have a JSON file and I need to index it on ElasticSearch server.
JSON file looks like this:
{
"sku": "1",
"vbid": "1",
"created": "Sun, 05 Oct 2014 03:35:58 +0000",
"updated": "Sun, 06 Mar 2016 12:44:48 +0000",
"type": "Single",
"downloadable-duration": "perpetual",
"online-duration": "365 days",
"book-format": "ePub",
"build-status": "In Inventory",
"description": "On 7 August 1914, a week before the Battle of Tannenburg and two weeks before the Battle of the Marne, the French army attacked the Germans at Mulhouse in Alsace. Their objective was to recapture territory which had been lost after the Franco-Prussian War of 1870-71, which made it a matter of pride for the French. However, after initial success in capturing Mulhouse, the Germans were able to reinforce more quickly, and drove them back within three days. After forty-three years of peace, this was the first test of strength between France and Germany. In 1929 Karl Deuringer wrote the official history of the battle for the Bavarian Army, an immensely detailed work of 890 pages; First World War expert and former army officer Terence Zuber has translated this study and edited it down to more accessible length, to produce the first account in English of the first major battle of the First World War.",
"publication-date": "07/2014",
"author": "Deuringer, Karl",
"title": "The First Battle of the First World War: Alsace-Lorraine",
"sort-title": "First Battle of the First World War: Alsace-Lorraine",
"edition": "0",
"sampleable": "false",
"page-count": "0",
"print-drm-text": "This title will only allow printing of 2 consecutive pages at a time.",
"copy-drm-text": "This title will only allow copying of 2 consecutive pages at a time.",
"kind": "book",
"fro": "false",
"distributable": "true",
"subjects": {
"subject": [
{
"-schema": "bisac",
"-code": "HIS027090",
"#text": "World War I"
},
{
"-schema": "coursesmart",
"-code": "cs.soc_sci.hist.milit_hist",
"#text": "Social Sciences -> History -> Military History"
}
]
},
"pricelist": {
"publisher-list-price": "0.0",
"digital-list-price": "7.28"
},
"publisher": {
"publisher-name": "The History Press",
"imprint-name": "The History Press Ireland"
},
"aliases": {
"eisbn-canonical": "1",
"isbn-canonical": "1",
"print-isbn-canonical": "9780752460864",
"isbn13": "1",
"isbn10": "0750951796",
"additional-isbns": {
"isbn": [
{
"-type": "print-isbn-10",
"#text": "0752460862"
},
{
"-type": "print-isbn-13",
"#text": "97807524608"
}
]
}
},
"owner": {
"company": {
"id": "1893",
"name": "The History Press"
}
},
"distributor": {
"company": {
"id": "3658",
"name": "asc"
}
}
}
But when I try to index this JSON file using command
curl -XPOST 'http://localhost:9200/_bulk' -d #1.json
I get this error:
{"error":{"root_cause":[{"type":"action_request_validation_exception","reason":"Validation Failed: 1: no requests added;"}],"type":"action_request_validation_exception","reason":"Validation Failed: 1: no requests added;"},"status":400}
I don't know where I am making a mistake.
The bulk API of Elasticsearch use a special syntax, which is actually made of json documents written on single lines. Take a look to the documentation.
The syntax is pretty simple. For indexing, creating and updating you need 2 single line json documents. The first lines tells the action, the second gives the document to index/create/update. To delete a document, it is only needed the action line. For example (from the documentation):
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }
{ "create" : { "_index" : "test", "_type" : "type1", "_id" : "3" } }
{ "field1" : "value3" }
{ "update" : {"_id" : "1", "_type" : "type1", "_index" : "index1"} }
{ "doc" : {"field2" : "value2"} }
{ "delete" : { "_index" : "test", "_type" : "type1", "_id" : "2" } }
Don't forget to end your file with a new line.
Then, to call the bulk api use the command:
curl -s -XPOST localhost:9200/_bulk --data-binary "#requests"
From the documentation:
If you’re providing text file input to curl, you must use the --data-binary flag instead of plain -d
Adding next line(Enter in case of postman or "\n" if you are using json as body in client API) did my work
I had a similar issue in that I wanted to delete a specific document of a specific type and via the above answer I managed to get my simple bash script working finally!
I have a file that has a document_id per line (document_id.txt) and using the below bash script I can delete documents of a certain type with the mentioned document_id's.
This is what the file looks like:
c476ce18803d7ed3708f6340fdfa34525b20ee90
5131a30a6316f221fe420d2d3c0017a76643bccd
08ebca52025ad1c81581a018febbe57b1e3ca3cd
496ff829c736aa311e2e749cec0df49b5a37f796
87c4101cb10d3404028f83af1ce470a58744b75c
37f0daf7be27cf081e491dd445558719e4dedba1
The bash script looks like this:
#!/bin/bash
es_cluster="http://localhost:9200"
index="some-index"
doc_type="some-document-type"
for doc_id in `cat document_id.txt`
do
request_string="{\"delete\" : { \"_type\" : \"${doc_type}\", \"_id\" : \"${doc_id}\" } }"
echo -e "${request_string}\r\n\r\n" | curl -s -XPOST "${es_cluster}/${index}/${doc_type}/_bulk" --data-binary #-
echo
done
The trick, after lots of frustration, was to use the -e option to echo and append \n\n to the output of echo before I piped it into curl.
And then in curl I then have the --data-binary option set to stop it stripping out the \n\n needed for the _bulk endpoint followed by the #- option to get it to read from stdin!
Was a weird mistake in my case. I was creating the bulkRequest Object and clearing it before inserting into ElasticSearch.
The line that was creating the issue.
bulkRequest.requests().clear();
My issue was also the missing \n. If you print it, it will parse it & interpret it as a newline (so it looks like you're missing a \n). In case that helps anyone
pseudocode:
document = '{"index": {"_index": "users", "_id": "1"}} \n {"first_name": "Bob"}'
print(document)
will print
{"index": {"_index": "users", "_id": "1"}}
{"first_name": "Bob"}
but that's ok -- as long as the string contains the \n separator then it should work
Related
I'm currently trying to do a basic JSON file import into my ELK stack. I tried importing it directly via a POST request like this:
curl -XPOST http://localhost:9200/kwd_results/TS_Cart -d #/home/local/TS_Cart.json
ES says ok for the import, but when I'm trying to view the logs in Kibanna, they are not indexed by the nodes of the JSON file. I'm guessing I need like a template mapping to view it properly.
My JSON file looks like this:
{
"testResults": {
"FitNesseVersion": "v20160618",
"rootPath": "K1System.CountryDe.DriverFirefox.TestCases.MainFolder.TestVariants.SmokeTests_B2C.TS_Cart",
"result": [
{
"counts": {
"right": "16",
"wrong": "2",
"ignores": "3",
"exceptions": "1"
},
"date": "2017-05-10T00:01:11+02:00",
"runTimeInMillis": "117242",
"relativePageName": "TestCase_1",
"pageHistoryLink": "K1System.CountryDe.DriverFirefox.TestCases.MainFolder.TestVariants.SmokeTests_B2C.TS_Cart.B2CFreeCatalogueOrder?pageHistory&resultDate=20170510000111",
"tags": "de, at"
},
{
"counts": {
"right": "16",
"wrong": "0",
"ignores": "0",
"exceptions": "0"
},
"date": "2017-05-10T00:03:08+02:00",
"runTimeInMillis": "85680",
"relativePageName": "TestCase_2",
"pageHistoryLink": "K1System.CountryDe.DriverFirefox.TestCases.MainFolder.TestVariants.SmokeTests_B2C.TS_Cart.B2CGiftCardOrderWithAdvancePayment?pageHistory&resultDate=20170510000308",
"tags": "at, de"
}
],
"finalCounts": {
"right": "4",
"wrong": "1",
"ignores": "0",
"exceptions": "0"
},
"totalRunTimeInMillis": "482346"
}
}
Basically I would need rootPath to be used as an index, while having the following childs: counts, relativePageName, date and tags. Notice that I have two nodes that are childs of the result[] array.
Any help would be greatly appreciated!
Thank you.
Well, it's one JSON document so Elasticsearch treats it as such.
You'll need to (programmatically) split up the document into the right documents and then you can store them (potentially with one _bulk request).
For the index name:
Must be lowercase, so you'll need to cast that value.
Will you have many different root paths with jut a few docs each? Then you shouldn't make all of them an index since there is an overhead for each one of them (actually the underlying shards).
I have a multidimensional array that I want to index with CouchDB (really using Cloudant). I have users which have a list of the teams that they belong to. I want to search to find every member of that team. So, get me all the User objects that have a team object with id 79d25d41d991890350af672e0b76faed. I tried to make a json index on "Teams.id", but it didn't work because it isn't a straight array but a multidimensional array.
User
{
"_id": "683be6c086381d3edc8905dc9e948da8",
"_rev": "238-963e54ab838935f82f54e834f501dd99",
"type": "Feature",
"Kind": "Profile",
"Email": "gc#gmail.com",
"FirstName": "George",
"LastName": "Castanza",
"Teams": [
{
"id": "79d25d41d991890350af672e0b76faed",
"name": "First Team",
"level": "123"
},
{
"id": "e500c1bf691b9cfc99f05634da80b6d1",
"name": "Second Team Name",
"level": ""
},
{
"id": "4645e8a4958421f7d843d9b34c4cd9fe",
"name": "Third Team Name",
"level": "123"
}
],
"LastTeam": "79d25d41d991890350af672e0b76faed"
}
This is a lot like my response at Cloudant Selector Query but here's the deal, applied to your question:
The easiest way to run this query is using "Cloudant Query" (or "Mango", as it's called in the forthcoming CouchDB 2.0 release) -- and not the traditional MapReduce view indexing system in CouchDB. (This blog covers the differences: https://cloudant.com/blog/mango-json-vs-text-indexes/ and this one is an overview: https://developer.ibm.com/clouddataservices/2015/11/24/cloudant-query-json-index-arrays/).
Here's what your CQ index should look like:
{
"index": {
"fields": [
{"name": "Teams.[].id", "type": "string"}
]
},
"type": "text"
}
And what the subsequent query looks like:
{
"selector": {
"Teams": {"$elemMatch": {"id": "79d25d41d991890350af672e0b76faed"}}
},
"fields": [
"_id",
"FirstName",
"LastName"
]
}
You can try it yourself in the "Query" section of the Cloudant dashboard or via curl with something like this:
curl -H "Content-Type: application/json" -X POST -d '{"selector":{"Teams":{"$elemMatch":{"id":"79d25d41d991890350af672e0b76faed"}}},"fields":["_id","FirstName","LastName"]}' https://broberg.cloudant.com/teams_test/_find
That database is world-readable, so you can see the sample documents I created in there here: https://broberg.cloudant.com/teams_test/_all_docs?include_docs=true
Dig the Seinfeld theme :D
You simply need to loop through the Teams array and emit a view entry for each of the teams.
function (doc) {
if(doc.Kind === "Profile"){
for (var i=0; i<doc.Teams.length; i++) {
var team = doc.Teams[i];
emit(team.id, [doc.FirstName, doc.LastName]);
}
}
}
You can then query for all profiles with a specific team id by keying on the team id like this
.../view?key="79d25d41d991890350af672e0b76faed"
giving
{"total_rows":7,"offset":2,"rows":[
{"id":"0d15041f43b43ae07e8faa737f00032c","key":"79d25d41d991890350af672e0b76faed","value":["Adam","Alpha"]},
{"id":"68779729be3610fd8b52b22574000ae8","key":"79d25d41d991890350af672e0b76faed","value":["Bob","Bravo"]},
{"id":"9f97f1565f03aebae9ca73e207001ee1","key":"79d25d41d991890350af672e0b76faed","value":["Chuck","Charlie"]}
]}
or you can include the actual profiles in the result by adding &include_docs=true to the query.
How to update multiple documents in Solr 4.5.1 with JSON? I tried this but it does not work:
POST /solr/mycore/update/json:
{
"commit": {},
"add": {
"overwrite": true,
"doc": [{
"thumbnail": "/images/404.png",
"url": "/404.html?1",
"id": "demo:/404.html?1",
"channel": "demo",
"display_name": "One entry",
"description": "One entry is not enough."
}, {
"thumbnail": "/images/404.png",
"url": "/404.html?2",
"id": "demo:/404.html?2",
"channel": "demo",
"display_name": "Another entry",
"description": "Another entry is required."
}
]
}
}
Solr expects one "add"-key in the JSON-structure for each document (which might seem weird, if you think about the original meaning of the key in the object), since it maps directly to the XML format when doing the indexing - and this way you can have metadata for each document by itself.
{
"commit": {},
"add": {
"doc": {
"id": "321321",
"name": "barfoo"
}
},
"add": {
"doc": {
"id": "123123",
"name": "Foobar"
}
}
}
.. works. I think allowing an array as the element referenced by "add" would make more sense, but I haven't dug further into the source or know the reasoning behind this.
I understand that (at least) from versions 4.0 and older of solr, this has been fixed. Look at http://wiki.apache.org/solr/UpdateJSON.
In ./exampledocs/books.json there is an example of a json file with multiple documents.
[
{
"id" : "978-0641723445",
"cat" : ["book","hardcover"],
"name" : "The Lightning Thief",
"author" : "Rick Riordan",
"series_t" : "Percy Jackson and the Olympians",
"sequence_i" : 1,
"genre_s" : "fantasy",
"inStock" : true,
"price" : 12.50,
"pages_i" : 384
}
,
{
"id" : "978-1423103349",
"cat" : ["book","paperback"],
"name" : "The Sea of Monsters",
"author" : "Rick Riordan",
"series_t" : "Percy Jackson and the Olympians",
"sequence_i" : 2,
"genre_s" : "fantasy",
"inStock" : true,
"price" : 6.49,
"pages_i" : 304
},
...
]
While #fiskfisk answer is still a valid JSON, it is not easy to be serializable from a data structure. This one is.
elachell is correct that the array format will work if you are just adding documents with the default settings. Unfortunately, that won't work if, for instance, you need to add a custom boost to some of the documents or change the overwrite setting. You then have to use the full object structure with an "add" key for each of them, which as they pointed out, makes this frustratingly annoying to try to serialize from most languages which don't allow the same key more than once in an object:
{
"commit": {},
"add": {
"doc": {
"id": "321321",
"name": "barfoo"
},
"boost": 2.0
},
"add": {
"doc": {
"id": "123123",
"name": "Foobar"
},
"boost": 1.5,
"overwrite": false
}
}
Update for SOLR 8.8 (and maybe lower).
The following JSON works for /update/json:
{
'add': [
{'id': '123', 'field1': 'foo'},
{'id': '124', 'field1': 'foo'}
],
'delete': ['111', '106']
}
Another option if you are on Solr 4.10 or later is to use a custom JSON structure and tell Solr how to index it (not sure how to add boosts with this method either, but it's a nice option if you already have a data struct in JSON and don't want to convert it over to Solr's format). Here's the Solr documentation on this option:
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers#UploadingDatawithIndexHandlers-TransformingandIndexingCustomJSON
I am learning elasticsearch and following along with the tutorial. I uploaded three documents into an index. When I supply the following query:
curl 'localhost:9200/vehicles/_search?query=driver.name:Jon'
I as expected get back object two and object three. However when I try querying using json:
curl localhost:9200/vehicles/_search -d'
{
"query":{
"prefix":{
"driver.name":"Jon"
}}}'
I get no results back. I am following the tutorial very closely, so I don't understand what the issue is. Any help would be really appreciated. The uploaded objects are below.
Thank you!
id:one
'{
"color": "green",
"driver": {
"born":"1989-09-12",
"name": "Ben"
},
"make": "BMW",
"model": "Aztek",
"value": 3000.0,
"year": 2003
}'
id:two
'{
"color": "black",
"driver": {
"born":"1934-09-08",
"name": "Jon"
},
"make": "Mercedes",
"model": "Benz",
"value": 10000.0,
"year": 2012
}'
id:three
'{
"color": "green",
"driver": {
"born":"1934-09-08",
"name": "Jon"
},
"make": "BMW",
"model": "Benz",
"value": 10000.0,
"year": 2012
}'
The prefix-query "matches documents that have fields containing terms with a specified prefix (not analyzed)".
Note the "not analyzed"-part. Lucene is looking for anything starting with "Jon" in the index, but the standard analyzer lowercases terms. That is, "jon" is in the index, but "Jon" is not.
Thus, if you lowercase the text in your prefix-query, it should work. Here is a runnable example: https://www.found.no/play/gist/7629456
Try:
curl -XGET "http://localhost:9200/vehicles/_search" -d '
{
"query": {"query_string" : { "query" : "driver.name:Jon" }}
}'
In any case, If you are new to elasticsearch I really recommend you read the documentation because there are lots of types of queries. Besides, the results of queries also depends on how you index the documents, how you define the mapping, etc.
In order to use the prefix query, you need to hit a non-analyzed field. In your mappings for driver.name, if you set "index" to "not_analyzed", you can use the prefix query. Otherwise, you should use a match query or something similar.
I am trying to design a JSON object that would work with Jersey and Jackson.
Am fairly new to JSON / Restful programming, so I am wondering if the following is viable.
{
"name": "myservice",
"orders": [
{
"name": "iphone",
"description": "iPhone 5",
"providers": [
{
"name": "a",
"description": "AT&T",
"pricing": ["$40", "$70", "$120"]
},
{
"name": "b",
"description": "Verizon",
"pricing": ["$45", "$60", "$85"]
}
]
},
{
"name": "galaxy3",
"description": "Samsung Galaxy 3",
"providers": [
{
"name": "a",
"description": "AT&T",
"pricing": ["$45", "$60", "$85"]
}
]
}
]
}
Get all information regarding iPhone's Verizon provider:
curl GET -H'Content-Type: application/json' https://mydomain/myservice/iphone/b
would return:
{
"name": "b",
"description": "Verizon",
"pricing": ["$45", "$60", "$85"]
}
Get list of pricing for iPhone's AT&T provider:
curl GET -H'Content-Type: application/json' https://mydomain/myservice/iphone/a?pricing
Would return:
{
["$40", "$70", "$120"]
}
Any examples or feedback will be greatly appreciated!
Here is a good discussion about defining a REST API: REST Complex/Composite/Nested Resources
Here is what I would change in your json:
1. orders -> order, because resources are declared as singular nouns
2. providers -> provider, because of the same
This is how I would call from a client if I know what I need to get (using composite resources):
https://<mydomain>/myservice/order/iphone/provider/b
https://<mydomain>/myservice/order/iphone/provider/a/pricing
In case you need to search for an order, you can define the request like:
https://<mydomain>/myservice/order?name=iphone -> it would return the 1st element in the "order" list
The assumption is that "name" is a key for the respective resouces (order and provider)