Couchbase FTS Index update through REST API - couchbase

From latest couchbase doc,Could see FTS index can be created/updated using below
PUT /api/index/{indexName}
Creates/updates an index definition.
I have created index with name fts-idx and created successfully.
But looks like update of index is failing with REST API.
Response:
responseMessage : ,{"error":"rest_create_index: error creating index: fts-idx, err: manager_api: cannot create index because an index with the same name already exists: fts-idx"
Anything i have missed here.

I was able to replicate this issue, and I think figured it out. It's not a bug, but it should really be documented better.
You need to pass in the index's UUID as part of the PUT (I think this is a concurrency check). You can get the index's current uuid via GET /api/index/fts-index (it's in indexDef->uuid)
And once you have that, make it part of your update PUT body:
{
"name": "fts-index",
"type": "fulltext-index",
"params": {
// ... etc ...
},
"sourceType": "couchbase",
"sourceName": "travel-sample",
"sourceUUID": "307a1042c094b7314697980312f4b66b",
"sourceParams": {},
"planParams": {
// ... etc ...
},
"uuid": "89a125824b012319" // <--- right here
}
Once I did that, the update PUT went through just fine.

Related

Can you SQL populate a BigQuery table and set the table column modes in the same API call?

I'm using Google App Script to migrate data through BigQuery and I've run into an issue because the SQL I'm using to perform a WRITE_TRUNCATE load is causing the destination table to be recreated with column modes of NULLABLE rather than their previous mode of REQUIRED.
Attempting to change the modes to REQUIRED after the data is loaded using a metadata patch causes an error even though the columns don't contain any null values.
I considered working around the issue by dropping the table and recreating it again with the same REQUIRED modes, then loading the data using WRITE_APPEND instead of WRITE_TRUNCATE. But this isn't possible because a user wants to have the same source and destination table in their SQL.
Does anyone know if it's possible to define a BigQuery.Jobs.insert request that includes the output schema information/metadata?
If it's not possible the only alternative I can see is to use my original work around of a WRITE_APPEND but add a temporary table into the process, to allow for the destination table appearing in the source SQL. But if this can be avoid that would be nice.
Additional Information:
I did experiment with different ways of setting the schema information but when they didn't return an error message the schema seemed to get ignored.
I.e. this is the json I'm passing into BigQuery.Jobs.insert
jsnConfig =
{
"configuration":
{
"query":
{
"destinationTable":
{
"projectId":"my-project",
"datasetId":"sandbox_dataset",
"tableId":"hello_world"
},
"writeDisposition":"WRITE_TRUNCATE",
"useLegacySql":false,
"query":"SELECT COL_A, COL_B, '1' AS COL_C, COL_TIMESTAMP, COL_REQUIRED FROM `my-project.sandbox_dataset.hello_world_2` ",
"allowLargeResults":true,
"schema":
{
"fields":
[
{
"description":"Desc of Column A",
"type":"STRING",
"mode":"NULLABLE",
"name":"COL_A"
},
{
"description":"Desc of Column B",
"type":"STRING",
"mode":"REQUIRED",
"name":"COL_B"
},
{
"description":"Desc of Column C",
"type":"STRING",
"mode":"REPEATED",
"name":"COL_C"
},
{
"description":"Desc of Column Timestamp",
"type":"INTEGER",
"mode":"NULLABLE",
"name":"COL_TIMESTAMP"
},
{
"description":"Desc of Column Required",
"type":"STRING",
"mode":"REQUIRED",
"name":"COL_REQUIRED"
}
]
}
}
}
}
var job = BigQuery.Jobs.insert(jsnConfig, "my-project");
The result is that the new or existing hello_world table is truncated and loaded with the data specified in the query (so part of the json package is being read), but the column descriptions and modes aren't added as defined in the schema section. They're just blank and NULLABLE in the table.
More
When I tested the REST request above using Googles API page for BigQuery.Jobs.Insert it highlighted the "schema" property in the request as invalid. I think it appears the schema can be defined if you're loading the data from a file, i.e. BigQuery.Jobs.Load but it doesn't seem to support that functionality if you're putting the data in using an SQL source.
See the documentation here: https://cloud.google.com/bigquery/docs/schemas#specify-schema-manual-python
You can pass a schema object with your load job, meaning you can set fields to mode=REQUIRED
this is the command you should use:
bq --location=[LOCATION] load --source_format=[FORMAT] [PROJECT_ID]:[DATASET].[TABLE] [PATH_TO_DATA_FILE] [PATH_TO_SCHEMA_FILE]
as #Roy answered, this is done via load only. Can you output the logs of this command?

Confusion about Couchbase keys and indexes

I have imported a dataset into Couchbase that looks like so:
{
"CLUSTER": "M1M",
"CLUSTER_NAME": "MARTIN MARIETTA",
"PRIMARY": "",
"SET_NUM": "10000163",
SHORTENED_NAME": "MARTIN MARIETTA MATERIALS",
"TYPE": "SET",
"_class": "com.company.aad.xref.model.ClusterCodeXref"
}
I had to provide a key-generation strategy, and I made the strategy what I ultimately want my index to look like, %SET_NUM%::%TYPE%. So I have a couple of questions:
Does the key-generation automatically create a field called ID with those 2 elements, or do I need to create an ID column in my CSV dataset?
How can I create an index out of those 2 fields? I understand how to use the CREATE INDEX command with composite fields, but will that index look like the key generated by %SET_NUM%::%TYPE%? I need them to be the same, with the :: in the middle.
I hope my question is clear! Would appreciate any help.
In Couchbase, the ID/key of a document is not actually in the document itself. If you use the --generate-key template, your document would look something like:
key = "10000163::SET"
{
"CLUSTER": "M1M",
"CLUSTER_NAME": "MARTIN MARIETTA",
"PRIMARY": "",
"SET_NUM": "10000163",
SHORTENED_NAME": "MARTIN MARIETTA MATERIALS",
"TYPE": "SET",
"_class": "com.company.aad.xref.model.ClusterCodeXref"
}
There is no designated "id" field in Couchbase. You can certainly create an id field, but it will be just like any other field.
As for an index, it depends on what kind of query you want to run. You can CREATE INDEX idx_setnumandtype ON bucketname (SET_NUM, TYPE) as you mentioned. This is going to be a useful index for queries like: SELECT b.* FROM bucketname WHERE SET_NUM = 'foo' AND TYPE = 'bar';
But, if you know those two values and just need to do a lookup of a single document, you don't necessary need to create an index or use N1QL. You can simply do a key/value GET operation. In Java for instance: bucket.get("10000163::SET")

How to avoid Keys with Duplicate Values in Couchbase.Lite

Is it possible to tell CB.Lite to reject documents that contain values from a certain key repeated?
For instance, if i have the next document already in CB.Lite:
{
"Dog": {
"Name": "Dug",
"Color": "Blue",
"Age": 2
}
}
Is it possible to tell CB.Lite to reject any document with repeated Key "Name", so that if i try to add the next one:
{
"Dog": {
"Name": "Dug",
"Color": "Green",
"Age": 5
}
}
it would reject it?
I know It would be not much hassle to implement this functionality myself, but i was wondering if CB.Lite has already something Out of the Box.
Currently not at commit time (this is as of 1.4.x). The closest you could where Couchbase would do most of the work would be to create a View emitting the value you don't want repeated, then query and do the enforcement yourself.
This is assuming the docs themselves have different IDs. If you had what you showed using the same document ID, there are other possibilities. For example, you could trap this and reject it in Sync Gateway.

Error while adding new documents to Azure Search index

I have an index with a couple of fields of type Edm.String and Collection(Edm.String). I want to have another index with the same fields plus another field of type Edm.Double. When I create such an index and try to add the same values (plus the newly added Edm.Double value) as I did to the first index, I'm getting the following error:
{
"error": {
"code": "",
"message": "The request is invalid. Details: parameters : An unexpected 'StartArray' node was found when reading from the JSON reader. A 'PrimitiveValue' node was expected.\r\n"
}
}
Does anyone know what this error means? I tried looking for it on the Internet but I couldn't find anything related to my situation. A sample request I'm sending to the new index looks like this:
POST https://myservicename.search.windows.net/indexes/newindexname/docs/index?api-version=2016-09-01
{
"value": [{
"#search.action": "upload",
"keywords": ["red", "lovely", "glowing", "cute"],
"name": "sample document",
"weight": 0.5,
"id": "67"
}]
}
The old index is the same but it doesn't have the "weight" parameter.
Edit: I created the index using the portal, so I don't have the exact JSON to create the index, but the fields are roughly like this:
Field Type Attributes Analyzer
---------------------------------------------------------------------------------------
id Edm.String Key, Retrievable
name Edm.String Searchable, Filterable, Retrievable Eng-Microsoft
keywords Collection(Edm.String) Searchable, Filterable, Retrievable Eng-Microsoft
weight Edm.Double Filterable, Sortable
The reason I got the error was because I made a mistake and was trying to send a Collection(Edm.String) when the actual type on the index was Edm.String.

Alternative to preventing duplicates in importing CSV to CouchDB

I have 2 big CSV file with millions of rows.
Because those 2 CSVs are from MySQL, I want to merge those 2 tables into one Document in couch DB.
What is the most efficient way to do this?
My current method is:
import 1st CSV
import 2nd CSV
To prevent duplication, the program will search the Document with the key for each row. after the row is found, then the Document is updated with the columns from the 2nd CSV
The problem is, it really takes a long time to search for each row.
while importing the 2nd CSV, it updates 30 documents/second and I have about 7 million rows. rough calculation, it will take about 64 hours to complete the whole importing.
Thank You
It sounds like you have a "primary key" that you know from the row (or you can compute it from the row). That is ideal as the document _id.
The problem is, you get 409 Conflict if you try to add the 2nd CSV data but there was already a document with the same _id. Is that correct? (If so, please correct me so I can fix the answer.)
I think there is a good answer for you:
Use _bulk_docs to import everything, then fix the conflicts.
Begin with a clean database.
Use the Bulk docuent API to insert all the rows from the 1st and then 2nd CSV set—as many as possible per HTTP query, e.g. 1,000 at a time. (Bulk docs are much faster than inserting one-by-one.)
Always add "all_or_nothing": true in your _bulk_docs POST data. That will guarantee that every insertion will be successful (assuming no disasters such as power loss or full HD).
When you are done, some documents will be conflicted, which means you inserted twice for the same _id value. That is no problem. Simply follow this procedure to merge the two versions:
For each _id that has conflicts, fetch it from couch by GET /db/the_doc_id?conflicts=true.
Merge all the values from the conflicting versions into a new final version of the document.
Commit the final, merged document into CouchDB and delete the conflicted revisions. See the CouchDB Definitive Guide section on conflict resolution. (You can use _bulk_docs to speed this up too.)
Example
Hopefully this will clarify a bit. Note, I installed the *manage_couchdb* couchapp from http://github.com/iriscouch/manage_couchdb. It has a simple view to show conflicts.
$ curl -XPUT -Hcontent-type:application/json localhost:5984/db
{"ok":true}
$ curl -XPOST -Hcontent-type:application/json localhost:5984/db/_bulk_docs --data-binary #-
{ "all_or_nothing": true
, "docs": [ { "_id": "some_id"
, "first_value": "This is the first value"
}
, { "_id": "some_id"
, "second_value": "The second value is here"
}
]
}
[{"id":"some_id","rev":"1-d1b74e67eee657f42e27614613936993"},{"id":"some_id","rev":"1-d1b74e67eee657f42e27614613936993"}]
$ curl localhost:5984/db/_design/couchdb/_view/conflicts?reduce=false\&include_docs=true
{"total_rows":2,"offset":0,"rows":[
{"id":"some_id","key":["some_id","1-0cb8fd1fd7801b94bcd2f365ce4812ba"],"value":{"_id":"some_id","_rev":"1-0cb8fd1fd7801b94bcd2f365ce4812ba"},"doc":{"_id":"some_id","_rev":"1-0cb8fd1fd7801b94bcd2f365ce4812ba","first_value":"This is the first value"}},
{"id":"some_id","key":["some_id","1-d1b74e67eee657f42e27614613936993"],"value":{"_id":"some_id","_rev":"1-d1b74e67eee657f42e27614613936993"},"doc":{"_id":"some_id","_rev":"1-d1b74e67eee657f42e27614613936993","second_value":"The second value is here"}}
]}
$ curl -XPOST -Hcontent-type:application/json localhost:5984/db/_bulk_docs --data-binary #-
{ "all_or_nothing": true
, "docs": [ { "_id": "some_id"
, "_rev": "1-0cb8fd1fd7801b94bcd2f365ce4812ba"
, "first_value": "This is the first value"
, "second_value": "The second value is here"
}
, { "_id": "some_id"
, "_rev": "1-d1b74e67eee657f42e27614613936993"
, "_deleted": true
}
]
}
[{"id":"some_id","rev":"2-df5b9dc55e40805d7f74d1675af29c1a"},{"id":"some_id","rev":"2-123aab97613f9b621e154c1d5aa1371b"}]
$ curl localhost:5984/db/_design/couchdb/_view/conflicts?reduce=false\&include_docs=true
{"total_rows":0,"offset":0,"rows":[]}
$ curl localhost:5984/db/some_id?conflicts=true\&include_docs=true
{"_id":"some_id","_rev":"2-df5b9dc55e40805d7f74d1675af29c1a","first_value":"This is the first value","second_value":"The second value is here"}
The final two commands show that there are no conflicts, and the "merged" document is now served as "some_id".
Another option is simply to do what you are doing already, but use the bulk document API to get a performance boost.
For each batch of documents:
POST to /db/_all_docs?include_docs=true with a body like this:
{ "keys": [ "some_id_1"
, "some_id_2"
, "some_id_3"
]
}
Build your _bulk_docs update depending on the results you get.
Doc already exists, you must update it: {"key":"some_id_1", "doc": {"existing":"data"}}
Doc does not exist, you must create it: {"key":"some_id_2", "error":"not_found"}
POST to /db/_bulk_docs with a body like this:
{ "docs": [ { "_id": "some_id_1"
, "_rev": "the _rev from the previous query"
, "existing": "data"
, "perhaps": "some more data I merged in"
}
, { "_id": "some_id_2"
, "brand": "new data, since this is the first doc creation"
}
]
}