Create a CSV file from marklogic using Java Client Api(DMSDK) - csv

I want to create a csv file for 1.3M records from my marklogic db . I tried using CORB for that but it had taken more time than i expected.
My data is like this
{
"One": {
"Name": "One",
"Country": "US"
},
"Two": {
"State": "kentucky"
},
"Three": {
"Element1": "value1",
"Element2": "value2",
"Element3": "value3",
"Element4": "value4",
so on ...
}
}
Below are the my Corb modules
Selector.xqy
var total = cts.uris("", null, cts.collectionQuery("data"));
fn.insertBefore(total,0,fn.count(total))
Transform.xqy(Where i am keeping all the elements in an array )
var name = fn.tokenize(URI, ";");
const node = cts.doc(name);
var a= node.xpath("/One/*");
var b= node.xpath("/Two/*");
var c= node.xpath("/Three/*");
fn.stringJoin([a, b, c,name], " , ")
my properties file
THREAD-COUNT=16
BATCH-SIZE=1000
URIS-MODULE=selector.sjs|ADHOC
PROCESS-MODULE=transform.sjs|ADHOC
PROCESS-TASK=com.marklogic.developer.corb.ExportBatchToFileTask
EXPORT-FILE-NAME=Report.csv
PRE-BATCH-TASK=com.marklogic.developer.corb.PreBatchUpdateFileTask
EXPORT-FILE-TOP-CONTENT=Col1,col2,....col16 -- i have 16 columns
It had taken more than 1 hour for creating a csv file . And also for trying in cluster i need to configure a load balancer first. Whereas Java Client api will distribute the work among all nodes without any load balancer.
How can i implement the same in Java Client APi , i know i can trigger transform module using ServerTransform and ApplyTransformListener .
public static void main(String[] args) {
// TODO Auto-generated method stub
DatabaseClient client = DatabaseClientFactory.newClient
("localhost", pwd, "x", "x", DatabaseClientFactory.Authentication.DIGEST);
ServerTransform txform = new ServerTransform("tsm"); -- Here i am implementing same logic of above `tranform module` .
QueryManager qm = client.newQueryManager();
StructuredQueryBuilder query = qm.newStructuredQueryBuilder();
query.collection();
DataMovementManager dmm = client.newDataMovementManager();
QueryBatcher batcher = dmm.newQueryBatcher(query.collections("data"));
batcher.withBatchSize(2000)
.withThreadCount(16)
.withConsistentSnapshot()
.onUrisReady(
new ApplyTransformListener().withTransform(txform))
.onBatchSuccess(batch-> {
System.out.println(
batch.getTimestamp().getTime() +
" documents written: " +
batch.getJobWritesSoFar());
})
.onBatchFailure((batch,throwable) -> {
throwable.printStackTrace();
});
// start the job and feed input to the batcher
dmm.startJob(batcher);
batcher.awaitCompletion();
dmm.stopJob(batcher);
client.release();
}
But how can i send the csv file header like that one in CORB(i.e. EXPORT-FILE-TOP-CONTENT) . Is there any documentation for implementing CSV file ? Which class will implement that ?
Any help is appreciated
Thanks

Probably the easiest option is ml-gradle Exporting data to CSV which uses Java Client API and DMSDK under the hood.
Note that you'll probably want to install a server-side REST transform to extract only the data you want in the CSV output, rather than download the entire doc contents then extract on the Java side.
For a working example of the code required to use DMSDK and create an aggregate CSV (one CSV for all records), see ExporToWriterListenerTest.testMassExportToWriter. For the sake of SO, here's the key code snippet (with a couple a minor simplification changes, including writing column headers (untested code)):
try (FileWriter writer = new FileWriter(outputFile)) {
writer.write("uri,collection,contents");
writer.flush();
ExportToWriterListener exportListener = new ExportToWriterListener(writer)
.withRecordSuffix("\n")
.withMetadataCategory(DocumentManager.Metadata.COLLECTIONS)
.onGenerateOutput(
record -> {
String uri = record.getUri();
String collection = record.getMetadata(new DocumentMetadataHandle()).getCollections().iterator().next();
String contents = record.getContentAs(String.class);
return uri + "," + collection + "," + contents;
}
);
QueryBatcher queryJob =
moveMgr.newQueryBatcher(query)
.withThreadCount(5)
.withBatchSize(10)
.onUrisReady(exportListener)
.onQueryFailure( throwable -> throwable.printStackTrace() );
moveMgr.startJob( queryJob );
queryJob.awaitCompletion();
moveMgr.stopJob( queryJob );
}
However, unless you know your content has no double quotes, newlines, or non-ascii characters, a CSV library is recommended to make sure your output is properly escaped. To use a CSV library, you can of course use any tutorial out there for your library. You don't need to worry about thread safety because ExportToWriterListener runs your listeners in a synchronized block to prevent overlapping writes to the writer. Here's an example of using one CSV library, Jackson CsvMapper.
Please note that you don't have to use ExportToWriterListener . . . you can use it as a starting point to write your own listener. In particular, since your major concern is performance, you may want to have your listeners write to one file per thread, then post-process to combine things together. It's up to you.

Related

Store and update JSON Data on a Server

My web-application should be able to store and update (also load) JSON data on a Server.
However, the data may contain some big arrays where every time they are saved only a new entry was appended.
My solution:
send updates to the server with a key-path within the json data.
Currently I'm sending the data with an xmlhttprequest by jquery, like this
/**
* Asynchronously writes a file on the server (via PHP-script).
* #param {String} file complete filename (path/to/file.ext)
* #param content content that should be written. may be a js object.
* #param {Array} updatePath (optional), json only. not the entire file is written,
* but the given path within the object is updated. by default the path is supposed to contain an array and the
* content is appended to it.
* #param {String} key (optional) in combination with updatePath. if a key is provided, then the content is written
* to a field named as this parameters content at the data located at the updatePath from the old content.
*
* #returns {Promise}
*/
io.write = function (file, content, updatePath, key) {
if (utils.isObject(content)) content = JSON.stringify(content, null, "\t");
file = io.parsePath(file);
var data = {f: file, t: content};
if (typeof updatePath !== "undefined") {
if (Array.isArray(updatePath)) updatePath = updatePath.join('.');
data.a = updatePath;
if (typeof key !== "undefined") data.k = key;
}
return new Promise(function (resolve, reject) {
$.ajax({
type: 'POST',
url: io.url.write,
data: data,
success: function (data) {
data = data.split("\n");
if (data[0] == "ok") resolve(data[1]);
else reject(new Error((data[0] == "error" ? "PHP error:\n" : "") + data.slice(1).join("\n")));
},
cache: false,
error: function (j, t, e) {
reject(e);
//throw new Error("Error writing file '" + file + "'\n" + JSON.stringify(j) + " " + e);
}
});
});
};
On the Server, a php script manages the rest like this:
recieves the data and checks if its valid
check if the given file path is writable
if the file exists and is .json
read it and decode the json
return an error on invalid json
if there is no update path given
just write the data
if there is an update path given
return an error if the update path in the JSON data can't be traversed (or file didn't exist)
update the data at update-path
write the pretty-printed json to file
However I'm not perfectly happy and problems kept coming for the last weeks.
My Questions
Generally: How would you approach this problem? alternative suggestions, databases? any libraries that could help?
Note: I would prefer solutions, that just use php or some standart apache stuff.
One problem was, that sometimes, multiple writes on the same file were triggered. To avoid this I used the Promises (wrapped it because I read jquerys deferred stuff isnt Promise/A compliant) client side, but I dont feel 100% sure it is working. Is there a (file) lock in php that works across multiple requests?
Every now and then the JSON files break and its not clear to me how to reproduce the problem. At the time it breaks, I don't have a history of what happened. Any general debugging strategies with a client/server saving/loading process like this?
I wrote a comet enable web server that does diffs on updates of json data structures. For the exactly same reason. The server keeps a few version of a json document and serves client with different version of the json document with the update they need to get to the most reason version of the json data.
Maybe you could reuse some of my code, written in C++ and CoffeeScript: https://github.com/TorstenRobitzki/Sioux
If you have concurrent write accesses to your data structure, are your sure, that who ever writes to the file has the right version of the file in mind when reading the file?

How to use update function to upload attachment in CouchDB

I would like to know what can I do to upload attachments in CouchDB using the update function.
here you will find an example of my update function to add documents:
function(doc, req){
if (!doc) {
if (!req.form._id) {
req.form._id = req.uuid;
}
req.form['|edited_by'] = req.userCtx.name
req.form['|edited_on'] = new Date();
return [req.form, JSON.stringify(req.form)];
}
else {
return [null, "Use POST to add a document."]
}
}
example for remove documents:
function(doc, req){
if (doc) {
for (var i in req.form) {
doc[i] = req.form[i];
}
doc['|edited_by'] = req.userCtx.name
doc['|edited_on'] = new Date();
doc._deleted = true;
return [doc, JSON.stringify(doc)];
}
else {
return [null, "Document does not exist."]
}
}
thanks for your help,
It is possible to add attachments to a document using an update function by modifying the document's _attachments property. Here's an example of an update function which will add an attachment to an existing document:
function (doc, req) {
// skipping the create document case for simplicity
if (!doc) {
return [null, "update only"];
}
// ensure that the required form parameters are present
if (!req.form || !req.form.name || !req.form.data) {
return [null, "missing required post fields"];
}
// if there isn't an _attachments property on the doc already, create one
if (!doc._attachments) {
doc._attachments = {};
}
// create the attachment using the form data POSTed by the client
doc._attachments[req.form.name] = {
content_type: req.form.content_type || 'application/octet-stream',
data: req.form.data
};
return [doc, "saved attachment"];
}
For each attachment, you need a name, a content type, and body data encoded as base64. The example function above requires that the client sends an HTTP POST in application/x-www-form-urlencoded format with at least two parameters: name and data (a content_type parameter will be used if provided):
name=logo.png&content_type=image/png&data=iVBORw0KGgoA...
To test the update function:
Find a small image and base64 encode it:
$ base64 logo.png | sed 's/+/%2b/g' > post.txt
The sed script encodes + characters so they don't get converted to spaces.
Edit post.txt and add name=logo.png&content_type=image/png&data= to the top of the document.
Create a new document in CouchDB using Futon.
Use curl to call the update function with the post.txt file as the body, substituting in the ID of the document you just created.
curl -X POST -d #post.txt http://127.0.0.1:5984/mydb/_design/myddoc/_update/upload/193ecff8618678f96d83770cea002910
This was tested on CouchDB 1.6.1 running on OSX.
Update: #janl was kind enough to provide some details on why this answer can lead to performance and scaling issues. Uploading attachments via an upload handler has two main problems:
The upload handlers are written in JavaScript, so the CouchDB server may have to fork() a couchjs process to handle the upload. Even if a couchjs process is already running, the server has to stream the entire HTTP request to the external process over stdin. For large attachments, the transfer of the request can take significant time and system resources. For each concurrent request to an update function like this, CouchDB will have to fork a new couchjs process. Since the process runtime will be rather long because of what is explained next, you can easily run out of RAM, CPU or the ability to handle more concurrent requests.
After the _attachments property is populated by the upload handler and streamed back to the CouchDB server (!), the server must parse the response JSON, decode the base64-encoded attachment body, and write the binary body to disk. The standard method of adding an attachment to a document -- PUT /db/docid/attachmentname -- streams the binary request body directly to disk and does not require the two processing steps.
The function above will work, but there are non-trivial issues to consider before using it in a highly-scalable system.

cscript jscript JSON

This is a very very (very!!!) strange problem.
I have this JSCRIPT that runs on windows XP and 7 using dos CSCRIPT in a file called testJSON.js.
if ( ! this.JSON ) WScript.Echo("JSON DOESN'T EXISTS");
And, well, the message appear, but is an unexpected behavior of JSCRIPT because JSON (as the MSDN documentation says) is one of the default object in the JSCRIPT 5.8 and my system on Windows 7 runs exactly JSCRIPT 5.8.
Now, I have temporary solved this problem (in a little complex script) by creating a new text file and MANUALLY composing a valid JSON string (and, obviously this makes everything works fine even if the system doesn't have the JSCRIPT 5.8 as requested for JSON) but I like to know two things mainly:
1st Why I can't use the JSON object even if my JSCRIPT version is the one that supports that object?
2nd I have read something about the "enabling" of the JSON (and other) unavailable object in my JSCRIPT environment, but all examples is for C# and I like to know if some equivalent code for JSCRIPT exists or not.
You can use eval() to achieve an effect similar to JSON.parse().
eval('obj = {' + JSONstring + '}');
And afterwards, obj.toString() will let you retrieve the data similar to JSON.stringify() (just without the beautify options). See this answer for an example in the wild. The point is, you can create an object from JSON text without having to load any external libraries or switch the interpreter engine.
BIG FAT WARNING!!!
This introduces a vulnerability into the workstation running your code. If you do not control the generation of the JSON you wish to parse, or if it is possible that a 3rd party might modify the JSON between its generation and its interpretation, then consider following Helen's advice. If bad things are in the JSON, it can cause your WScript to do bad things. For example, if your JSON string or file contains the following:
};
var oSH = WSH.CreateObject("wscript.shell"),
cmd = oSH.Exec("%comspec%");
WSH.Sleep(250);
cmd.StdIn.WriteLine("net user pwnd password /add");
WSH.Sleep(250);
cmd.StdIn.WriteLine("net group Administrators pwnd /add");
WSH.Sleep(250);
cmd.Terminate();
var obj = {
"objName": {
"item1": "value 1",
"item2": "value 2"
}
... then parsing it with eval will have just added a new administrator to your computer without any visual indication that it happened.
My advice is to feel free to employ eval for private or casual use; but for widespread deployment, consider including json2.js as Helen suggests. Edit: Or...
htmlfile COM object
You can import the JSON methods by invoking the htmlfile COM object and forcing it into IE9 (or higher) compatibility mode by means of a <META> tag like this:
var htmlfile = WSH.CreateObject('htmlfile'), JSON;
htmlfile.write('<meta http-equiv="x-ua-compatible" content="IE=9" />');
htmlfile.close(JSON = htmlfile.parentWindow.JSON);
With those three lines, the JSON object and methods are copied into the JScript runtime, letting you parse JSON without using eval() or downloading json2.js. You can now do stuff like this:
var pretty = JSON.stringify(JSON.parse(json), null, '\t');
WSH.Echo(pretty);
Here's a breakdown:
// load htmlfile COM object and declare empty JSON object
var htmlfile = WSH.CreateObject('htmlfile'), JSON;
// force htmlfile to load Chakra engine
htmlfile.write('<meta http-equiv="x-ua-compatible" content="IE=9" />');
// The following statement is an overloaded compound statement, a code golfing trick.
// The "JSON = htmlfile.parentWindow.JSON" statement is executed first, copying the
// htmlfile COM object's JSON object and methods into "JSON" declared above; then
// "htmlfile.close()" ignores its argument and unloads the now unneeded COM object.
htmlfile.close(JSON = htmlfile.parentWindow.JSON);
See this answer for other methods (json2.js download via XHR, InternetExplorer.Application COM object, an HTA hybrid method, and another example of htmlfile).
Why I can't use the JSON object even if my JSCRIPT version is the one that supports that object?
According to MSDN, Windows Script Host uses the JScript 5.7 feature set by default for backward compatibility. The JScript 5.8 feature set is only used in Internet Explorer in the IE8+ Standards document modes.
You have the following options:
Include json2.js in your script. See this question for options for including external scripts in JScript scripts.
Modify the registry to expose IE9's JScript engine to Windows Script Host. UPD: This solution uses IE's JScript DLLs, but doesn't activate the 5.8 feature set.
Create a JScript execution host programmatically using the Active Script interfaces and use IActiveScriptProperty::SetProperty to force the JScript 5.8 feature set (SCRIPTLANGUAGEVERSION_5_8). Here's a C++ example.
I have read something about the "enabling" of the JSON (and other) unavailable object in my JSCRIPT environment, but all examples is for C# and I like to know if some equivalent code for JSCRIPT exists or not.
Custom script execution hosts can be created only using languages with proper COM support, such as C++, C# etc. JScript can't be used for that, because, for example, it doesn't support out parameters.
JSON encode, decode without default parser: https://gist.github.com/gnh1201/e372f5de2e076dbee205a07eb4064d8d
var $ = {};
/**
* Decode JSON
*
* #param string jsonString - JSON text
*
* #return object
*/
$.json.decode = function(jsonString) {
return (new Function("return " + jsonString)());
};
/**
* Encode JSON
*
* #param object obj - Key/Value object
*
* #return string
*/
$.json.encode = function(obj) {
var items = [];
var isArray = (function(_obj) {
try {
return (_obj instanceof Array);
} catch (e) {
return false;
}
})(obj);
var _toString = function(_obj) {
try {
if(typeof(_obj) == "object") {
return $.json.encode(_obj);
} else {
var s = String(_obj).replace(/"/g, '\\"');
if(typeof(_obj) == "number" || typeof(_obj) == "boolean") {
return s;
} else {
return '"' + s + '"';
}
}
} catch (e) {
return "null";
}
};
for(var k in obj) {
var v = obj[k];
if(!isArray) {
items.push('"' + k + '":' + _toString(v));
} else {
items.push(_toString(v));
}
}
if(!isArray) {
return "{" + items.join(",") + "}";
} else {
return "[" + items.join(",") + "]";
}
};
/**
* Test JSON
*
* #param object obj - Key/Value object
*
* #return boolean
*/
$.json.test = function(obj) {
var t1 = obj;
var t2 = $.json.encode(obj);
$.echo($.json.encode(t1));
var t3 = $.json.decode(t2);
var t4 = $.json.encode(t3);
$.echo(t4);
if(t2 == t4) {
$.echo("success");
return true;
} else {
$.echo("failed");
return false;
}
};
/**
* Echo
*
* #param string txt
*
* #return void
*/
$.echo = function(txt) {
if($.isWScript()) {
WScript.Echo(txt);
} else {
try {
window.alert(txt);
} catch (e) {
console.log(txt);
}
}
};
/**
* Check if WScript
*
* #return bool
*/
$.isWScript = function() {
return typeof(WScript) !== "undefined";
}
// test your data
var t1 = {"a": 1, "b": "banana", "c": {"d": 2, "e": 3}, "f": [100, 200, "3 hundreds", {"g": 4}]};
$.json.test(t1);

how to send the data in Json structure

I have a rest service for which I am sending the Json data as ["1","2","3"](list of strings) which is working fine in firefox rest client plugin, but while sending the data in application the structure is {"0":"1","1":"2","2":"3"} format, and I am not able to pass the data, how to convert the {"0":"1","1":"2","2":"3"} to ["1","2","3"] so that I can send the data through application, any help would be greatly appreciated.
If the format of the json is { "index" : "value" }, is what I'm seeing in {"0":"1","1":"2","2":"3"}, then we can take advantage of that information and you can do this:
var myObj = {"0":"1","1":"2","2":"3"};
var convertToList = function(object) {
var i = 0;
var list = [];
while(object.hasOwnProperty(i)) { // check if value exists for index i
list.push(object[i]); // add value into list
i++; // increment index
}
return list;
};
var result = convertToList(myObj); // result: ["1", "2", "3"]
See fiddle: http://jsfiddle.net/amyamy86/NzudC/
Use a fake index to "iterate" through the list. Keep in mind that this won't work if there is a break in the indices, can't be this: {"0":"1","2":"3"}
You need to parse out the json back into a javascript object. There are parsing tools in the later iterations of dojo as one of the other contributors already pointed out, however most browsers support JSON.parse(), which is defined in ECMA-262 5th Edition (the specification that JS is based on). Its usage is:
var str = your_incoming_json_string,
// here is the line ...
obj = JSON.parse(string);
// DEBUG: pump it out to console to see what it looks like
a.forEach(function(entry) {
console.log(entry);
});
For the browsers that don't support JSON.parse() you can implement it using json2.js, but since you are actually using dojo, then dojo.fromJson() is your way to go. Dojo takes care of browser independence for you.
var str = your_incoming_json_string,
// here is the line ...
obj = dojo.fromJson(str);
// DEBUG: pump it out to console to see what it looks like
a.forEach(function(entry) {
console.log(entry);
});
If you're using an AMD version of Dojo then you will need to go back to the Dojo documentation and look at dojo/_base/json examples on the dojo.fromJson page.

dojo - data from sql server for datagrid and charts

I have just started getting familiar with dojo and creating widgets and have a web UI which I would like to now populate with data. My question is merely to get some references or ideas on how to do this. My databases are all sql server 2008 and I usually work with microsoft.net. I thought that I would probably have to create a service that calls the sql queries and converts the results into json and feed that into the widgets whether it be the datagrid or charts. Just not sure how to do this and if it is indeed possible to do that. Any ideas appreciated.
EDIT:
store = new dojo.data.ItemFileWriteStore({
url: "hof-batting.json"
});
ngrid = new dojox.grid.DataGrid({
store: store,
id: 'ngrid',
structure: [
{ name: "Search Term", field: "searchterm", width: "10%" },
{ name: "Import Date", field: "importDate", width: "10%" }
]
}, "grid");
ngrid.startup();
I want to add data returned from my web service to this datagrid and use the same principle to add data to a chart.
Your describe exactly what you need to do.
We use C# to query our database to get the data and then convert it to json. We use multiple techniques right now for json serialization. I would recommend using JSON.NET. It is what the .NET MVC team is going to use. I would not use the DataContractSerialization that is currently part of .NET.
http://json.codeplex.com/
We sometimes put JSON right on the page and the javascript accesses it as a page variable. Other times we call services in .NET. We use WCF and we have also used an .ashx file to give the web client json data.
The structure of the json will be the contract between your dojo widgets and web server. I would use what the chart widgets or store will need to start the process of defining the contract.
EDIT
WCF Interface
[OperationContract]
[WebInvoke(Method="POST", UriTemplate = "/data/{service}/",
BodyStyle = WebMessageBodyStyle.WrappedRequest)]
String RetrieveData(string service, Stream streamdata);
The implementation returns a string that is the json. This gets sent to the browser as json, but it's wrapped by .NET by an xml node. I have a utility function that cleans it.
MyUtil._xmlPrefix =
'<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">';
MyUtil._xmlPostfix = '</string>';
MyUtil.CleanJsonResponse = function(data) {
// summary:
// a method that cleans a .NET response and converts it
// to a javascript object with the JSON.
// The .NET framework, doesn't easily allow for custom serialization,
// so the results are shipped as a string and we need to remove the
// crap that Microsoft adds to the response.
var d = data;
if (d.startsWith(MyUtil._xmlPrefix)) {
d = d.substring(MyUtil._xmlPrefix.length);
}
if (d.endsWith(MyUtil._xmlPostfix)) {
d = d.substring(0, d.length - MyUtil._xmlPostfix.length);
}
return dojo.fromJson(d);
};
// utility methods I have added to string
String.prototype.startsWith = function(str) {
return this.slice(0, str.length) == str;
};
String.prototype.endsWith = function(str) {
return this.slice(-str.length) == str;
};