Importing JSON dataset into Cassandra - json

I have a large dataset consisting of about 80,000 records. I want to import this into Cassandra. I only see documentation for CSV format. Is this possible for JSON?

In 2020, you can use DataStax Bulk Loader utility (DSBulk) for loading & unloading of Cassandra/DSE data in CSV and JSON formats. It's very flexible, and allows to load only part of data, flexibly map JSON fields into table fields, etc. It supports Cassandra 2.1+, and very fast.
In simplest case, data loading command would look as following:
dsbulk load -k keyspace -t table -c json -url your_file.json
DataStax blog has a series of articles about DSBulk: 1, 2, 3, 4, 5, 6

To insert JSON data, add JSON to the INSERT command.
refer to this link for details
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useInsertJSON.html

See dsbulk solution as the ultimate one, however you may consider this trick that converts json-formatted messages (one per line) to csv on the fly (no separate conversion necessary) and loads into Cassandra using cqlsh, i.e. :
cat file.json | jq -r '[.uid,.gender,.age] | #csv' | cqlsh -e 'COPY labdata.clients(uid,gender,age) from STDIN;'
Explanations:
This requires a jq utility, installed e.g. for ubuntu as apt install jq.
Here I have a file with the following messages:
{"uid": "d50192e5-c44e-4ae8-ae7a-7cfe67c8b777", "gender": "F", "age": 19}
{"uid": "d502331d-621e-4721-ada2-5d30b2c3801f", "gender": "M", "age": 32}
This is how I convert it to csv on the fly:
cat file | jq -r '[.uid,.gender,.age] | #csv'
where -r will remove some extra \", but you still end up with quoted strings:
"d50192e5-c44e-4ae8-ae7a-7cfe67c8b777","F",19
"d502331d-621e-4721-ada2-5d30b2c3801f","M",32
Now, if you create a table clients in keyspace labdata for this data using cqlsh:
CREATE TABLE clients ( uid ascii PRIMARY KEY, gender ascii, age int);
then you should be able to run the COPY ... FROM STDIN command above

Related

Splitting large JSON data using Unix command Split

Issue with Unix Split command for splitting large data: split -l 1000 file.json myfile. Want to split this file into multiple files of 1000 records each. But Im getting the output as single file - no change.
P.S. File is created converting Pandas Dataframe to JSON.
Edit: It turn outs that my JSON is formatted in a way that it contains only one row. wc -l file.json is returning 0
Here is the sample: file.json
[
{"id":683156,"overall_rating":5.0,"hotel_id":220216,"hotel_name":"Beacon Hill Hotel","title":"\u201cgreat hotel, great location\u201d","text":"The rooms here are not palatial","author_id":"C0F"},
{"id":692745,"overall_rating":5.0,"hotel_id":113317,"hotel_name":"Casablanca Hotel Times Square","title":"\u201cabsolutely delightful\u201d","text":"I travelled from Spain...","author_id":"8C1"}
]
Invoking jq once per partition plus once to determine the number of partitions would be extremely inefficient. The following solution suffices to achieve the partitioning deemed acceptable in your answer:
jq -c ".[]" file.json | split -l 1000
If, however, it is deemed necessary for each file to be pretty-printed, you could run jq -s . for each file, which would still be more efficient than running .[N:N+S] multiple times.
If each partition should itself be a single JSON array, then see Splitting / chunking JSON files with JQ in Bash or Fish shell?
After asking elsewhere, the file was, in fact a single line.
Reformatting with JQ (in compact form), would enable the split, though to process the file would at least need the first and last character to be deleted (or add '[' & ']' to the split files)
I'd recommend spliting the JSON array with jq (see manual).
cat file.json | jq length # get length of an array
cat file.json | jq -c '.[0:999]' # first 1000 items
cat file.json | jq -c '.[1000:1999]' # second 1000 items
...
Notice -c for compact result (not pretty printed).
For automation, you can code a simple bash script to split your file into chunks given the array length (jq length).

Pulling data out of JSON Object via jq to CSV in Bash

I'm working on a bash script (running via gitBash on Windows technically but I don't think that matters) that will convert some JSON API data into CSV files. Most of it has gone fairly well, especially since I'm not particularly familiar with JQ as this is my first time using it.
I've got some JSON data that looks like the array below. What I'm trying to do is select the cardType,MaskedPan,amount and datetime out of the data.
this is probably the first time in life that my google searching has failed me. I know(or should I say think) that that is actually an object and not just a simple array.
I've not really found anything that helps me know how to grab that data I need and export it into a CSV file. I've had no issue grabbing the other data that I need but these few pieces are proving to be a big problem for me.
The script I'm trying basically can be boiled down to this:
jq='/c/jq-win64.exe -r';
header='("cardType")';
fields='[.TransactionDetails[0].Value[0].cardType]';
$jq ''$header',(.[] | '$fields' | #csv)' < /t/API_Data/JSON/GetByDate-082719.json >
/t/API_Data/CSV/test.csv;
If I do .TransactionDetails[0].Value I can get that whole chunk of data. But that is problematic in a CSV as it contains commas.
I suppose I could make this a TSV and import it into the database as one big string and sub string it out. But that isn't the "right" solution. I'm sure there is a way JQ can give me what I need.
"TransactionDetails": [
{
"TransactionId": 123456789,
"Name": "BlacklinePaymentDetail",
"Value": "{\"cardType\":\"Visa\",\"maskedPan\":\"1234\",\"paymentDetails\":{\"reference\":\"123456789012\",\"amount\":99.99,\"dateTime\":\"2019/08/27 08:41:09\"}}",
"ShowOnTill": false,
"PrintOnOrder": false,
"PrintOnReceipt": false
}
]
Ideally I'd be able to just have a field in the CSV for cardType,MaskedPan,amount and datetime instead of pulling the "Value" that contains all of it.
Any advice would be appreciated.
The ingredient you're missing is fromjson, which converts a stringified JSON to JSON. Adding enclosing braces around your sample input,
the invocation:
jq -r -f program.jq input.json
produces:
"Visa","1234",99.99,"2019/08/27 08:41:09"
where program.jq is:
.TransactionDetails[0].Value
| fromjson
| [.cardType, .maskedPan] + (.paymentDetails | [.amount, .dateTime])
| #csv

how to import a ".csv " data-file into the Redis database

How do I import a .csv data-file into the Redis database? The .csv file has "id, time, latitude, longitude" columns in it. Can you suggest to me the best possible way to import the CSV file and be able to perform spatial queries?
This is a very broad question, because we don't know what data structure you want to have. What queries do you expect, etc. In order to solve your question you need:
Write down expected queries. Write down expected partitions. Is this file your complete dataset?
Write down your data structure. It will heavily depend on answers from p1.
Pick any (scripting) language you are most comfortable with. Load your file, process it in CSV library, map to your data structure from p2, push to Redis. You can do the latter with client library or with redis-cli.
The if for example, you want to lay your data in sorted sets where your id is zset's key, timestamp is score and lat,lon is the payload, you can do this:
$ cat data.csv
id1,1528961481,45.0,45.0
id1,1528961482,45.1,45.1
id2,1528961483,50.0,50.0
id2,1528961484,50.1,50.0
cat data.csv | awk -F "," '{print $1" "$2" "$3" "$4}' | xargs -n4 sh -c 'redis-cli -p 6370 zadd $1 $2 "$3,$4"' sh
127.0.0.1:6370> zrange id2 0 -1
1) "50.0,50.0"
2) "50.1,50.0"

Is there any way in Elasticsearch to get results as CSV file in curl API?

I am using elastic search.
I need results from elastic search as a CSV file.
Any curl URL or any plugins to achieve this?
I've done just this using cURL and jq ("like sed, but for JSON"). For example, you can do the following to get CSV output for the top 20 values of a given facet:
$ curl -X GET 'http://localhost:9200/myindex/item/_search?from=0&size=0' -d '
{"from": 0,
"size": 0,
"facets": {
"sourceResource.subject.name": {
"global": true,
"terms": {
"order": "count",
"size": 20,
"all_terms": true,
"field": "sourceResource.subject.name.not_analyzed"
}
}
},
"sort": [
{
"_score": "desc"
}
],
"query": {
"filtered": {
"query": {
"match_all": {}
}
}
}
}' | jq -r '.facets["subject"].terms[] | [.term, .count] | #csv'
"United States",33755
"Charities--Massachusetts",8304
"Almshouses--Massachusetts--Tewksbury",8304
"Shields",4232
"Coat of arms",4214
"Springfield College",3422
"Men",3136
"Trees",3086
"Session Laws--Massachusetts",2668
"Baseball players",2543
"Animals",2527
"Books",2119
"Women",2004
"Landscape",1940
"Floral",1821
"Architecture, Domestic--Lowell (Mass)--History",1785
"Parks",1745
"Buildings",1730
"Houses",1611
"Snow",1579
I've used Python successfully, and the scripting approach is intuitive and concise. The ES client for python makes life easy. First grab the latest Elasticsearch client for Python here:
http://www.elasticsearch.org/blog/unleash-the-clients-ruby-python-php-perl/#python
Then your Python script can include calls like:
import elasticsearch
import unicodedata
import csv
es = elasticsearch.Elasticsearch(["10.1.1.1:9200"])
# this returns up to 500 rows, adjust to your needs
res = es.search(index="YourIndexName", body={"query": {"match": {"title": "elasticsearch"}}},500)
sample = res['hits']['hits']
# then open a csv file, and loop through the results, writing to the csv
with open('outputfile.tsv', 'wb') as csvfile:
filewriter = csv.writer(csvfile, delimiter='\t', # we use TAB delimited, to handle cases where freeform text may have a comma
quotechar='|', quoting=csv.QUOTE_MINIMAL)
# create column header row
filewriter.writerow(["column1", "column2", "column3"]) #change the column labels here
for hit in sample:
# fill columns 1, 2, 3 with your data
col1 = hit["some"]["deeply"]["nested"]["field"].decode('utf-8') #replace these nested key names with your own
col1 = col1.replace('\n', ' ')
# col2 = , col3 = , etc...
filewriter.writerow([col1,col2,col3])
You may want to wrap the calls to the column['key'] references in try / catch error handling, since documents are unstructured, and may not have the field from time to time (depends on your index).
I have a complete Python sample script using the latest ES python client available here:
https://github.com/jeffsteinmetz/pyes2csv
You can use elasticsearch head plugin.
You can install from elasticsearch head plugin
http://localhost:9200/_plugin/head/
Once you have the plugin installed, navigate to the structured query tab, provide query details and you can select 'csv' format from the 'Output Results' dropdown.
I don't think there is a plugin that will give you CSV results directly from the search engine, so you will have to query ElasticSearch to retrieve results and then write them to a CSV file.
Command line
If you're on a Unix-like OS, then you might be able to make some headway with es2unix which will give you search results back in raw text format on the command line and so should be scriptable.
You could then dump those results to text file or pipe to awk or similar to format as CSV. There is a -o flag available, but it only gives 'raw' format at the moment.
Java
I found an example using Java - but haven't tested it.
Python
You could query ElasticSearch with something like pyes and write the results set to a file with the standard csv writer library.
Perl
Using Perl then you could use Clinton Gormley's GIST linked by Rakesh - https://gist.github.com/clintongormley/2049562
Shameless plug. I wrote estab - a command line program to export elasticsearch documents to tab-separated values.
Example:
$ export MYINDEX=localhost:9200/test/default/
$ curl -XPOST $MYINDEX -d '{"name": "Tim", "color": {"fav": "red"}}'
$ curl -XPOST $MYINDEX -d '{"name": "Alice", "color": {"fav": "yellow"}}'
$ curl -XPOST $MYINDEX -d '{"name": "Brian", "color": {"fav": "green"}}'
$ estab -indices "test" -f "name color.fav"
Brian green
Tim red
Alice yellow
estab can handle export from multiple indices, custom queries, missing values, list of values, nested fields and it's reasonably fast.
If you are using kibana (app/discover in general), you can make your query in the UI, then save it and share -> CSV Reports. This creates a csv with a line for each record and columns will be comma separated
I have been using https://github.com/robbydyer/stash-query stash-query for this.
I find it quite convenient and working well, though i struggle with the install every time I redo it (this is due to me not being very fluent with gem's and ruby).
On Ubuntu 16.04 though, what seemed to work was:
apt install ruby
sudo apt-get install libcurl3 libcurl3-gnutls libcurl4-openssl-dev
gem install stash-query
and then you should be good to go
Installs Ruby
Install curl dependencies for Ruby, because the stash-query tool is working via the REST API of elasticsearch
Installs stash query
This blog post describes how to build it as well:
https://robbydyer.wordpress.com/2014/08/25/exporting-from-kibana/
you can use elasticsearch2csv is a small and effective python3 script that uses Elasticsearch scroll API and handle a big query response.
You can use GIST. Its simple.
Its in Perl and you can get some help from it.
Please download and see the usage on GitHub. Here is the link.
GIST GitHub
Or if you want in Java then go for elasticsearch-river-csv
elasticsearch-river-csv

Proper way to import json file to mongo

I've been trying to use mongo with some data imported, but I'm not able to use it properly with my document description.
This is an example of the .json I import using mongoimport: https://gist.github.com/2917854
mongoimport -d test -c example data.json
I noticed that all my document it's imported to a unique object in spite of creating one of object for each shop.
That's why when I try to find a shop or anything I want to query, all the document is returned.
db.example.find({"shops.name":"x"})
I want to be able to query the db to obtain products by the id using dot notation something similar to:
db.example.find({"shops.name":"x","categories.type":"shirts","clothes.id":"1"}
The problem is that all the document is imported like a single object. The question is: How
do I need to import the object to obtain my desired result?
Docs note that:
This utility takes a single file that contains 1 JSON/CSV/TSV string per line and inserts it.
In the structure you are using -assuming the errors on the gist are fixed- you are essentially importing one document with only shops field.
After breaking the data into separate shop docs, import using something like (shops being the collection name, makes more sense than using example):
mongoimport -d test -c shops data.json
and then you can query like:
db.shops.find({"name":x,"categories.type":"shirts"})
There is a parameter --jsonArray:
Accept import of data expressed with multiple MongoDB document within a single JSON array
Using this option you can feed it an array, so you only need to strip the outer object syntax i.e. everything at the beginning until and including "shops" :, and the } at the end.
Myself I use a little tool called jq that can extract the array from command line:
./jq '.shops' shops.json
IMPORT FROM JSON
mongoimport --db "databaseName" --collection "collectionName" --type json --file "fileName.json" --jsonArray
JSON format should be in this format. (Array of Objects)
[
{ name: "Name1", msg: "This is msg 1" },
{ name: "Name2", msg: "This is msg 2" },
{ name: "Name3", msg: "This is msg 3" }
]
IMPORT FROM CSV
mongoimport --db "databaseName" --collection "collectionName" --type csv --file "fileName.csv" --headerline
More Info
https://docs.mongodb.com/getting-started/shell/import-data/
Importing a JSON
The command mongoimport allows us to import human readable JSON in a specific database & a collection. To import a JSON data in a specific database & a collection, type mongoimport -d databaseName -c collectionName jsonFileName.json