I want to use CouchBase to store lots of data. I have that data in the form:
[
{
"foo": "bar1"
},
{
"foo": "bar2"
},
{
"foo": "bar3"
}
]
I have that in a json file that I zipped into data.zip. I then call:
cbdocloader.exe -u Administrator -p **** -b mybucket C:\data.zip
However, this creates a single item in my bucket; not three as I expected. This actually makes sense as I should be able to store arrays and I did not "tell" CouchBase to expect multiple items instead of one.
The temporary solution I have is to split every items in multiplejson files, then add the lot of them in a single zip file and call cbdocloader again. The problem is that I might have lots of these entries and creating all the files might take too long. Also, I saw in the doc that cbdocloader uses the filename as a key. That might be problematic in my case...
I obviously missed a step somewhere but couldn't find what in the documentation. How should I format my json file?
You haven't missed any steps. The cbdocloader script is very limited at the moment. Couchbase will be adding a cbimport and cbexport tool in the near future that will allow you to add json files with various formats (including the one you mentioned). In the meantime you will need to use the current workaround you are using to get your data loaded.
Related
I have a system where we collect a lot of JSON configuration from different parties to configure our overall service.
The repository looks like a directory of formatted JSON files. For example foo.json:
{
"id": "3bd0e397-d8cc-46ff-9e0d-26fa078a37f3",
"name": "Example",
"logo": "https://example/foo.png"
}
We have a pipeline whereby the owner of foo.json can overwrite this file by committing a new file at any time, since fast updates are required.
However we require unfortunately to skip whole files or override some values for various $reasons.
Hence we commit something like touch foo.json.skip when we want the file to be skipped before publishing. And similarly, we have a foo.json.d/override.json to perhaps override the logo because it's poorly formatted or something.
Is there a name for this sort of JSON pipeline that we have? It's inspired by systemd configuration, but maybe system configuration was inspired by something else?
I'm making my first videogame in Unity and I was messing around with storing data in JSON files. More specifically, language localization files. Each file stores key-value pairs of strings. Keys are grouped in categories (like "menu.pause.quit").
Now, since each language file is essentially going to have the same keys (but different values), it would be nice if VS code recognized and helped me write these keys with tooltips.
For example, if you open settings.json in VS code, if you try to write something, there's some kind of autocompletion going on:
How does that work?
The auto completion for json files is done with json schemas. It is described in the documentation:
https://code.visualstudio.com/docs/languages/json#_json-schemas-and-settings
Basically you need to create a json schema which describes your json file and you can map it to json files via your user or workspace settings:
// settings.json
{
// ... other settings
"json.schemas": [
{
"fileMatch": [
"/language-file.json"
],
"url": "./language-file-schema.json"
}
]
}
This will provide auto completion for language-file.json (or any other file matched by the pattern) based on language-file-schema.json(in your workspace folder, but you can also use absolute paths)
The elements of fileMatch may be patterns which will match against multiple files.
I have noticed there is a feature in web interface of ArangoDB which allows users to Download or Upload data as JSON file. However, I find nothing similar for CSV exporting. How can an existing Arango DB collection be exported to a .csv file?
If you want to export data from ArangoDB to CSV, then you should use Arangoexport. It is included in the full packages as well as the client-only packages. You find it next to the arangod server executable.
Basic usage:
https://docs.arangodb.com/3.4/Manual/Programs/Arangoexport/Examples.html#export-csv
Also see the CSV example with AQL query:
https://docs.arangodb.com/3.4/Manual/Programs/Arangoexport/Examples.html#export-via-aql-query
Using an AQL query for a CSV export allows you to transform the data if desired, e.g. to concatenate an array to a string or unpack nested objects. If you don't do that, then the JSON serialization of arrays/objects will be exported (which may or may not be what you want).
The default Arango install includes the following file:
/usr/share/arangodb3/js/contrib/CSV_export/CSVexport.js
It includes this comment:
// This is a generic CSV exporter for collections.
//
// Usage: Run with arangosh like this:
// arangosh --javascript.execute <CollName> [ <Field1> <Field2> ... ]
Unfortunately, at least in my experience, that usage tip is incorrect. Arango team, if you are reading this, please correct the file or correct my understanding.
Here's how I got it to work:
arangosh --javascript.execute "/usr/share/arangodb3/js/contrib/CSV_export/CSVexport.js" "<CollectionName>"
Please specify a password:
Then it sends the CSV data to stdout. (If you with to send it to a file, you have to deal with the password prompt in some way.)
Currently I'm loading pretty massive data into our redshift clusters from s3 (10k rows per second or so?).
This is become a problem with trying to run any queries on the data, as even when trying to rollup a couple hours worth of data we run into out of memory errors.
What I'd like to do is run a map reduce job on the data, and just load in the aggregates. I know this is suppose to be a fairly easy task, but I'm completely new to hadoop, and I'm sorta stuck on one of the first steps.
Setup EMR Cluster (done)
Load Data into HDFS (I think this is what I'm suppose to do)
Currently all the data is being loaded into S3 gzipped JSON files (makes it easy to load into redshift). Do I have to change the file format to get it into hadoop? Each S3 File takes on something similar to this form:
{
"timestamp":"2015-06-10T11:54:34.345Z",
"key":"someguid",
"device": { "family" : "iOS", "versions" : "v8.4" }
}
{
"timestamp":"2015-06-11T15:56:44.385Z",
"key":"some second key",
"device": { "family" : "Android", "versions" : "v2.2" }
}
Where each JSON object is 1 record/row. (Notice the JSON objects are one after another, in the real files there is no whitespace, no commas separating the json objects or anything like that either).
It's not a huge deal for me to change the format of these files to something that will just work, but I'm not sure what that format would be (plain CSV files? Can I still gzip them?).
So questions are:
Is it possible to work with these files as is? If so would I have less headaches just changing them anyway
After I get my files right and I can import them, what is the easiest way to accomplish my goal of simply rolling up this data by hour, and saving the file back to S3 so I can load it into redshift? I ideally would like this job to run every hour, so my redshift tables are updated hourly with the previous hours of data. What technologies should I be reading about to accomplish this? Hive? Impala? Pig? Again, just looking for the simple solution.
From sample data it is clear that your data is into JSON format. You can use any one from Map/Reduce, Pig, Hive to read and retrieve record.
Pig and Hive are more simpler then Map/Reduce as you don't need to write too much code.
If you are planning to read data from Hive then you can use Hive JSON sarde.
More detail about implementation is available on How do you make a HIVE table out of JSON data?
If you are planning to use pig then you ca use JsonLoader during the pig load statement.You can get more detail about JsonLoader on this link http://joshualande.com/read-write-json-apache-pig/
You can also write your costume UDF in pig and hive to read JSON data.
I have a Couchbase Cluster with only one node (let's call it localhost) and I need to export all the data from a very big bucket (let's call it XXX) into a CSV file.
Now this seems to be a pretty easy task but I can't find the way to make it work.
According to the (really bad) documentation on the cbtransfer toold from Couchbase http://docs.couchbase.com/admin/admin/CLI/cbtransfer_tool.html they say this is possible but they don't explain it clearly. They just add a flag if you want the transfer to occur in csv format (?) but it is not working. Maybe someone who already did this can give me a hand?
Using the documentation I've been able to make an approach to the result I want to obtain (a clean CSV file with all the documents in the XXX bucket) using this command:
/opt/couchbase/bin/cbtransfer http://localhost:8091 /path/to/export/output.csv -b XXX
But what I get is that /path/to/export/output.csv is actually a folder with a lot of folders inside and it is storing some kind of json metadata that can be used to restore the XXX bucket in another instance of Couchbase.
Has anyone been able to export data from a Couchbase bucket (Json documents) into a CSV file?
From looking at the documentation, you have to put a slightly different syntax to export to a CSV. http://docs.couchbase.com/admin/admin/CLI/cbtransfer_tool.html
It needs to look like so:
cbtransfer http://[localhost]:8091 csv:./data.csv -b default -u Administrator -p password
Notice the "csv:" before the name of the csv file.
I tested this and it does export a CSV. Just be forwarned that you need a relatively flat document structure for this to work really well, as JSON can represent far more complex data structures than CSV obviously, e.g. arrays, sub-documents, etc. cbtransfer will not unravel those. For example, if there is a subdocument, cbtransfer will represent it as a JSON doc in the line of each CSV.
So depending on what your document structure is, exporting to CSV is not an ideal format. It is a step backwards.