Alternative to preventing duplicates in importing CSV to CouchDB - csv

I have 2 big CSV file with millions of rows.
Because those 2 CSVs are from MySQL, I want to merge those 2 tables into one Document in couch DB.
What is the most efficient way to do this?
My current method is:
import 1st CSV
import 2nd CSV
To prevent duplication, the program will search the Document with the key for each row. after the row is found, then the Document is updated with the columns from the 2nd CSV
The problem is, it really takes a long time to search for each row.
while importing the 2nd CSV, it updates 30 documents/second and I have about 7 million rows. rough calculation, it will take about 64 hours to complete the whole importing.
Thank You

It sounds like you have a "primary key" that you know from the row (or you can compute it from the row). That is ideal as the document _id.
The problem is, you get 409 Conflict if you try to add the 2nd CSV data but there was already a document with the same _id. Is that correct? (If so, please correct me so I can fix the answer.)
I think there is a good answer for you:
Use _bulk_docs to import everything, then fix the conflicts.
Begin with a clean database.
Use the Bulk docuent API to insert all the rows from the 1st and then 2nd CSV set—as many as possible per HTTP query, e.g. 1,000 at a time. (Bulk docs are much faster than inserting one-by-one.)
Always add "all_or_nothing": true in your _bulk_docs POST data. That will guarantee that every insertion will be successful (assuming no disasters such as power loss or full HD).
When you are done, some documents will be conflicted, which means you inserted twice for the same _id value. That is no problem. Simply follow this procedure to merge the two versions:
For each _id that has conflicts, fetch it from couch by GET /db/the_doc_id?conflicts=true.
Merge all the values from the conflicting versions into a new final version of the document.
Commit the final, merged document into CouchDB and delete the conflicted revisions. See the CouchDB Definitive Guide section on conflict resolution. (You can use _bulk_docs to speed this up too.)
Example
Hopefully this will clarify a bit. Note, I installed the *manage_couchdb* couchapp from http://github.com/iriscouch/manage_couchdb. It has a simple view to show conflicts.
$ curl -XPUT -Hcontent-type:application/json localhost:5984/db
{"ok":true}
$ curl -XPOST -Hcontent-type:application/json localhost:5984/db/_bulk_docs --data-binary #-
{ "all_or_nothing": true
, "docs": [ { "_id": "some_id"
, "first_value": "This is the first value"
}
, { "_id": "some_id"
, "second_value": "The second value is here"
}
]
}
[{"id":"some_id","rev":"1-d1b74e67eee657f42e27614613936993"},{"id":"some_id","rev":"1-d1b74e67eee657f42e27614613936993"}]
$ curl localhost:5984/db/_design/couchdb/_view/conflicts?reduce=false\&include_docs=true
{"total_rows":2,"offset":0,"rows":[
{"id":"some_id","key":["some_id","1-0cb8fd1fd7801b94bcd2f365ce4812ba"],"value":{"_id":"some_id","_rev":"1-0cb8fd1fd7801b94bcd2f365ce4812ba"},"doc":{"_id":"some_id","_rev":"1-0cb8fd1fd7801b94bcd2f365ce4812ba","first_value":"This is the first value"}},
{"id":"some_id","key":["some_id","1-d1b74e67eee657f42e27614613936993"],"value":{"_id":"some_id","_rev":"1-d1b74e67eee657f42e27614613936993"},"doc":{"_id":"some_id","_rev":"1-d1b74e67eee657f42e27614613936993","second_value":"The second value is here"}}
]}
$ curl -XPOST -Hcontent-type:application/json localhost:5984/db/_bulk_docs --data-binary #-
{ "all_or_nothing": true
, "docs": [ { "_id": "some_id"
, "_rev": "1-0cb8fd1fd7801b94bcd2f365ce4812ba"
, "first_value": "This is the first value"
, "second_value": "The second value is here"
}
, { "_id": "some_id"
, "_rev": "1-d1b74e67eee657f42e27614613936993"
, "_deleted": true
}
]
}
[{"id":"some_id","rev":"2-df5b9dc55e40805d7f74d1675af29c1a"},{"id":"some_id","rev":"2-123aab97613f9b621e154c1d5aa1371b"}]
$ curl localhost:5984/db/_design/couchdb/_view/conflicts?reduce=false\&include_docs=true
{"total_rows":0,"offset":0,"rows":[]}
$ curl localhost:5984/db/some_id?conflicts=true\&include_docs=true
{"_id":"some_id","_rev":"2-df5b9dc55e40805d7f74d1675af29c1a","first_value":"This is the first value","second_value":"The second value is here"}
The final two commands show that there are no conflicts, and the "merged" document is now served as "some_id".

Another option is simply to do what you are doing already, but use the bulk document API to get a performance boost.
For each batch of documents:
POST to /db/_all_docs?include_docs=true with a body like this:
{ "keys": [ "some_id_1"
, "some_id_2"
, "some_id_3"
]
}
Build your _bulk_docs update depending on the results you get.
Doc already exists, you must update it: {"key":"some_id_1", "doc": {"existing":"data"}}
Doc does not exist, you must create it: {"key":"some_id_2", "error":"not_found"}
POST to /db/_bulk_docs with a body like this:
{ "docs": [ { "_id": "some_id_1"
, "_rev": "the _rev from the previous query"
, "existing": "data"
, "perhaps": "some more data I merged in"
}
, { "_id": "some_id_2"
, "brand": "new data, since this is the first doc creation"
}
]
}

Related

How to pull keys and values in jq csv when you don't know the key name

I have a semi working jq filter that pulls out .key, .value.sitename and shows it in a one line csv format. I'd like to add any tag_* keys and values (if found) at the end of the csv line.
[.key , .value.hostname, .value.attributes.sitename ] | #csv
I need help with the .value.attributes.tag* keys since they might be completely missing or different tag_ names. They all start with tag_ but could be anything. I'd like to pair the found tag name and value together if possible and append it on the csv line with the host.
{
"key": "computer1.domain.com",
"value": {
"attributes": {
"TESTID": "23423423",
"sitename": "siteidname",
"tag_robo_equip": "boopbeep",
"tag_modern": "cybertruck"
},
"hostname": "computer1.domain.com",
}
}
At the end of the line "hostname": "computer1.domain.com", the comma is an error.
If you correct this, the solution could be:
Proposal
At the you have 2 additional columns one with the keys an other with the values in your csv file.

What JSON format does STRIP_OUTER_ARRAY support?

I have a file composed of a single array containing multiple records.
{
"Client": [
{
"ClientNo": 1,
"ClientName": "Alpha",
"ClientBusiness": [
{
"BusinessNo": 1,
"IndustryCode": "12345"
},
{
"BusinessNo": 2,
"IndustryCode": "23456"
}
]
},
{
"ClientNo": 2,
"ClientName": "Bravo",
"ClientBusiness": [
{
"BusinessNo": 1,
"IndustryCode": "34567"
},
{
"BusinessNo": 2,
"IndustryCode": "45678"
}
]
}
]
}
I load it with the following code:
create or replace stage stage.test
url='azure://xxx/xxx'
credentials=(azure_sas_token='xxx');
create table if not exists stage.client (json_data variant not null);
copy into stage.client_test
from #stage.test/client_test.json
file_format = (type = 'JSON' strip_outer_array = true);
Snowflake imports the entire file as one row.
I would like the the COPY INTO command to remove the outer array structure and load the records into separate table rows.
When I load larger files, I hit the size limit for variant and get the error Error parsing JSON: document is too large, max size 16777216 bytes.
If you can import the file into Snowflake, into a single row, then you can use LATERAL FLATTEN on the Clients field to generate one row per element in the array.
Here's a blog post on LATERAL and FLATTEN (or you could look them up in the snowflake docs):
https://support.snowflake.net/s/article/How-To-Lateral-Join-Tutorial
If the format of the file is, as specified, a single object with a single property that contains an array with 500 MB worth of elements in it, then perhaps importing it will still work -- if that works, then LATERAL FLATTEN is exactly what you want. But that form is not particularly great for data processing. You might want to use some text processing script to massage the data if that's needed.
RECOMMENDATION #1:
The problem with your JSON is that it doesn't have an outer array. It has a single outer object containing a property with an inner array.
If you can fix the JSON, that would be the best solution, and then STRIP_OUTER_ARRAY will work as you expected.
You could also try to recompose the JSON (an ugly business) after reading line for line with:
CREATE OR REPLACE TABLE X (CLIENT VARCHAR);
COPY INTO X FROM (SELECT $1 CLIENT FROM #My_Stage/Client.json);
User Response to Recommendation #1:
Thank you. So from what I gather, COPY with STRIP_OUTER_ARRAY can handle a file starting and ending with square brackets, and parse the file as if they were not there.
The real files don't have line breaks, so I can't read the file line by line. I will see if the source system can change the export.
RECOMMENDATION #2:
Also if you would like to see what the JSON parser does, you can experiment using this code, I have parsed JSON on the copy command using similar code. Working with your JSON data in small project can help you shape the Copy command to work as intended.
CREATE OR REPLACE TABLE SAMPLE_JSON
(ID INTEGER,
DATA VARIANT
);
INSERT INTO SAMPLE_JSON(ID,DATA)
SELECT
1,parse_json('{
"Client": [
{
"ClientNo": 1,
"ClientName": "Alpha",
"ClientBusiness": [
{
"BusinessNo": 1,
"IndustryCode": "12345"
},
{
"BusinessNo": 2,
"IndustryCode": "23456"
}
]
},
{
"ClientNo": 2,
"ClientName": "Bravo",
"ClientBusiness": [
{
"BusinessNo": 1,
"IndustryCode": "34567"
},
{
"BusinessNo": 2,
"IndustryCode": "45678"
}
]
}
]
}');
SELECT
C.value:ClientNo AS ClientNo
,C.value:ClientName::STRING AS ClientName
,ClientBusiness.value:BusinessNo::Integer AS BusinessNo
,ClientBusiness.value:IndustryCode::Integer AS IndustryCode
from SAMPLE_JSON f
,table(flatten( f.DATA,'Client' )) C
,table(flatten(c.value:ClientBusiness,'')) ClientBusiness;
User Response to Recommendation #2:
Thank you for the parse_json example!
Trouble is, the real files are sometimes 500 MB, so the parse_json function chokes.
Follow-up on Recommendation #2:
The JSON needs to be in the NDJSON http://ndjson.org/ format. Otherwise the JSON will be impossible to parse because of the potential for large files.
Hope the above helps other running into similar questions!

Elastic Search Document Update taking too long when retrieved by alias

Here is the situation.
1) There is an existing document (let's say the index is baseball-a
2) baseball-a, baseball-b, and baseball-c are aliased to baseball
3) Update a document in baseball-a
POST /baseball-a/1/_update?pretty'
{
"doc": { "my_name": "Casey at the bat2"}
}'
4) now if I do a GET baseball-a/1/ everything is updated
5) but if I do a search
POST /baseball/_search?pretty
{
"query": { "match": { "id": "1" } }
}
then the document that is returned has the old my_name of "Casey at the bat" (missing the '2') but 15 minutes later it shows up... how do I fix this or speed it up?
I think I figured it out. Basically you need to look at the refresh_interval for the alias by doing
GET /baseball/_settings
Mine was set to -1 and it should be set to either 1s or 5s
Additionally after I manually ran this command
POST /baseball/_refresh
it also worked but that is just a hassle... let elastic do it for you automatically. Now if I could only figure out why I can't set the refresh inteval correctly Updating ElasticSearch interval_refresh when aliased

Solr/Lucene update query deletes attribute from data

I am running into an issue with an attribute missing in the data after running an update query.
I run a select query like this:
curl "http://localhost:8080/solr/collection1/select?q=title%3AHans+head:true&fl=title,uid,articleId,missing_Attribute,my_otherAttribute&wt=json&indent=true"
It returns an article:
{
"title":"Hans",
"uid":"18_UNIQUEID_123",
"articleId":"123123123",
"missing_Attribute":"M",
}
So missing_Attribute = M, my_otherAttribute is not present yet. Which is fine.
Then I run an update query on this document using:
curl http://localhost:8080/solr/collection1/update?commit=true --data-binary #MyUpdate.json -H 'Content-type:application/json'
with MyUpdate.json as:
[
{
"uid": "18_UNIQUEID_123",
"my_otherAttribute": {
"set": "12"
}
}
]
And run the select query again, results in:
{
"title":"Hans",
"uid":"18_UNIQUEID_123",
"articleId":"123123123",
"my_otherAttribute":"12",
}
my_otherAttribute = 12 but missing_Attribute is gone!
Why is missing_Attribute gone when I update my_otherAttribute?
Why does it not affect any of the other fields ?
To answer my own question, the answer is:
https://wiki.apache.org/solr/Atomic_Updates
The issue I face is that I want to make a partial update of a document. I am using Solr 4.10, so in theory it would work. But only if the schema would support it. And ours does not. So that is why attributes disappear.

Multiple Fields for Prefix in ElasticSearch and auto From and Size limit

I got 2 questions related to ElasticSearch.
1) First Question, I have this query with prefix
"must": [
{
"prefix": {
"user": "John"
}
}
]
with this query, I can Prefix John with User field so it results those documents in which John is found in User field. Now how can I make this query to see if John is prefixed in either any of User or email fields.
2) Second Question, I know we can apply Size and from in ElasticSearch to limit Result but I want to know that do I have to explicitly provide Size and from every time I query in ElasticSearch to continue last result or is there any other way to let the ElasticSearch do it for me that I just query and it will give results in a series left from previous.
First, note that the prefix-query does not do any text analysis, so you will not be matching e.g. john with your query.
You should look into the multi_match-query which takes the options of the match-query as well. Thus, you can combine multi_match with phrase_prefix and get the best of both: matching on multiple fields, and text analysis.
Here is a runnable example you can play with: https://www.found.no/play/gist/8197442
#!/bin/bash
export ELASTICSEARCH_ENDPOINT="http://localhost:9200"
# Index documents
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_bulk?refresh=true" -d '
{"index":{"_index":"play","_type":"type"}}
{"user":"John Smith","email":"john.smith#gmail.com"}
{"index":{"_index":"play","_type":"type"}}
{"user":"Alice Smith","email":"john.smith#gmail.com"}
'
# Do searches
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '
{
"query": {
"multi_match": {
"fields": [
"user",
"email"
],
"query": "john",
"operator": "and",
"type": "phrase_prefix"
}
}
}
'
For your second question, look into the scroll API.