Storing large amounts of queryable JSON - json

I am trying to find a database solution that is capable of the following.
Store flat, random, JSON structures separated by a table name(random_json_table_1, random_json_table_2 for example).
Capable of handling a large number of insert operations(+10000/second).
Able to query the random json structures(SELECT * FROM random_json_table_1 WHERE JSON_SELECT('data', '$.city.busses') NOT NULL AND JSON_SELECT('data', '$.city.busStops', 'length') > 5) for example.
SELECT queries must run fast over gigabytes of data.
I had a look at Amazon Athena and it looks a bit promising but I am curious if there are any other solutions out there.

You may consider BigQuery.
Regarding 2), there is BigQuery streaming interface.
And 4), you can play with BigQuery public data (e.g. the popular BitCoin transaction table) to see how fast BigQuery can be.
Below is sample query using BigQuery standardSQL, showing how to filter data which is stored in JSON string.
#standardSQL
SELECT JSON_EXTRACT(json_text, '$') AS student
FROM UNNEST([
'{"age" : 1, "class" : {"students" : [{"name" : "Jane"}]}}',
'{"age" : 2, "class" : {"students" : []}}',
'{"age" : 10,"class" : {"students" : [{"name" : "John"}, {"name": "Jamie"}]}}'
]) AS json_text
WHERE CAST(JSON_EXTRACT_SCALAR(json_text, '$.age') AS INT64) > 5;

It feels like Google's BigQuery managed database might be of value to you. Reading here we seem to find that there is a soft limit of 100,000 rows per second and the ability to insert 10,000 rows per single request. For performing queries, BigQuery advertises itself as being able to process petabyte sized tables within acceptable limits.
Here is a link to the main page for BigQuery:
https://cloud.google.com/bigquery/

Related

Upload XML into Exact Online exceeding maximum size or response time

I have several XML files generated by a industry solution with new data to be uploaded into Exact Online using the XML API, either directly or using the UploadXmlTopics table in Invantive SQL with the Exact Online driver.
However, the Exact Online XML API poses a limit of approx. 10 MB per upload and even then the load time can be long when the system is heavily loaded.
When the load time exceeds ten minutes, part of the transactions have been applied and a part has failed. With a timeout there is no message returned which states what remains to be loaded.
I can not change the XML files since they are automatically generated by the industry solution.
What is the best way to reliably upload the XML files into Exact Online?
Of course it is best to have the XML files changed, but there are various alternatives. Please note first of all that performance of Exact Online varies across the day and weekend. Best time to upload massive amounts of data is Sunday between 13:00 and 23:00 based upon experience.
When uploading manually, you can split the XML files by hand into several XML files. Always split on the main topic at the path /eExact/TOPIC.
When uploading through UploadXmlTopics table, you can use two approaches:
Calculated
Automated
Calculated XML size for Exact Online
The calculated approach is the only one available in older versions. It works as follows:
In a in memory table or a file based table put a record per XML topic that you wish to upload. I normally name them 'xml' as column name.
Then determine how many fragments you need, for instance using:
select ceil(log(xmlsize / 10000, 16)) + 1
from ( select sum(length(xml)) xmlsize from xmlaccounts#inmemorystorage )
Replace 10000 by the maximum fragment size. Choose a smaller one during periods of heavy load and 1000000 for the weekend.
Memorize the outcome using for instance:
local define xmlaccountsparts "${outcome:0,0}"
Then construct the new XML to insert into UploadXmlTopics as follows:
select filenamepostfix, xml
from ( select filenamepostfix, listagg(xml, '') xml
from ( select substr(md5(xml), 1, ${xmlaccountsparts}) filenamepostfix
, xml
from xmlaccounts#inmemorystorage
)
group
by filenamepostfix
)
And insert this payload into Exact Online using UploadXMLTopics.
What it effectively does is to first determine approximately how many files you need using the logarithmic function with 16 as base. Then use MD5 to associate somewhat randomly distributed hexadecimal (16 base) values for each XML topic to upload. Take a number of left-side characters of the MD5 values equal to the logarithmic function outcome to get approximately this number of files, each with approximately the same payload size. And then reconstruct the XML.
Automatic XML size for Exact Online
Newer releases have an auto fragment option which do the heavy lifting for you. Use a SQL like:
insert into UploadXMLTopics#eol
--
-- Upload seed data into Exact Online.
--
( topic
, payload
, division_code
, orig_system_reference
, fragment_payload_flag
, fragment_max_size_characters
)
select topic
, filecontents
, division_code
, filename
, true
, 10000 /* This one is in characters. You can also specify in number. */
from ...
The loaded fragments can be queried using:
select *
--
-- Check results and reload.
--
from UploadXMLTopicFragments#eol
And looks something like:

mySQL table with {"Twitter": 28, "Total": 28, "Facebook": 1}

There is a table with one column, named "info", with content like {"Twitter": 28, "Total": 28, "Facebook": 1}. When I write sql, I want to test whether "Total" is larger than 10 or not. Could someone help me write the query? (table name is landslides_7d)
(this is what I have)
SELECT * FROM landslides_7d WHERE info.Total > 10;
Thanks.
The data format seems to be JSON. If you have MySQL 5.7 you can use JSON_EXTRACT or the short form ->. Those functions don't exist in older versions.
SELECT * FROM landslides_7d WHERE JSON_EXTRACT(info, '$.total') > 10;
or
SELECT * FROM landslides_7d WHERE info->total > 10;
See http://dev.mysql.com/doc/refman/5.7/en/json-search-functions.html#function_json-extract
Mind that this is a full table scan. On a "larger" table you want to create an index.
If you're on an older version of MySQL you should create an extra column to your table and manually add the total value to that column.
You probably are storing the JSON in a single blob or string column. This is very inefficient, since you can't make use of indexes, and will need to parse the entire JSON structure on every where query. I'm not sure how much flexibility you need, but if the JSON attributes are relatively fixed, I recommend running a script (ruby, Python, etc.) on the table contents and storing "total" in a traditional columnar format. For example, you could add a new column "total" which contains the total attribute as an INT.
A side benefit of using a script is that you can catch any improperly formatted JSON - something you can't do in a single query.
You can also keep "total" column maintained with a trigger (on update/insert of "info"), using the JSON_EXTRACT function referenced in #johannes answer.

Converting CSV string to multiple columns in Apache Drill

Using: Apache Drill
I am trying to bring the following data in a more structured form:
"apple","juice", "box:12,shipment_id:143,pallet:B12"
"mango", "pulp", "box:7,shipment_id:133,pallet:B19,route:11"
"grape", "jam", "box:10"
Desired output:
fruit, product, box_id, shipment_id, pallet_id, route_id
apple,juice, 12, 143, B12, null
mango, pulp, 7, 133, B19, 11
grape, jam, 10, null, null, null
Dataset runs into couple of GBs. Drill reads the input into three columns with the last string in one single column. Have successfully achieved the desired output by performing string manipulation operations (REGEXP_REPLACE and CONCAT) on the last column, then reading the column as json (CONVERT_FROM), and finally separating into different columns using KVGEN and FLATTEN.
The execution time is pretty high due to the regex functions. Is there a better approach?
(PS: execution time is compared to using a pyspark job to achieve the desired output).
I do not see any other way to do it 100% with Apache Drill, without any intermediate storage
You may try with a Custom Function in Java, to make it easier to write.
Since you have done the work,
have you tried to save the data in a Parquet file? CTAS command: http://drill.apache.org/docs/create-table-as-ctas-command/
This would make subsequent queries a lot faster.

PostgreSQL: How to check if any of the elements in a JSON array match criteria

Assuming the following schema in PostgreSQL 9.3:
create table test (data json);
insert into test values ('{
"id" : 1,
"name" : "abc",
"values" : [{"id" : 2, "name" : "cde"}, {"id" : 3, "name" : "def"}]
}'::json);
insert into test values ('{
"id" : 4,
"name" : "efg",
"values" : [{"id" : 5, "name" : "fgh"}, {"id" : 6, "name" : "ghi"}]
}'::json);
What is the best way to query for documents where at least one of the objects in the "values" array satisfy a criteria? The only way I could come up with is this:
select data
from (select
data,
json_array_elements(data->'values')->>'name' as valueName
from test) a
where valueName = 'ghi';
Is there a way to do this without the nested query? In MongoDB I could simply say:
db.test.find({values : {$elemMatch : {name : "ghi"}}});
Well... you could do something like this if you prefer subqueries:
select value
from (
select json_array_elements(data -> 'values')
from test
) s(value)
where value ->> 'name' = 'ghi'
But beyond that there is no function available to do what you want. You could easily create your own operator or stored procedure to take care of this however.
Here's a fiddle btw: http://sqlfiddle.com/#!15/fb529/32
PostgreSQL is a object-relational database system, MongoDB is a NoSQL database, so there is no kind of similarity among them (except for the obvious part that they are both used to store data).
You cannot do your search without the nested query. You can create a postgres function or custom type (or both) to aid you, but still, internally a nested query is needed.
It is very important to understand that postgres structured data columns like json are not meant to be used this way. It is mostly a shortcut and a utility to convert data when inserting or selecting, so that your program that executes the sql queries will not need to do extra conversions.
However, you shouldn't be searching in the fields of structured data like this. It is extremely inefficient, you are putting a big load on your server for no reason and you can't use indexes as efficiently (corrected after Igor's comment). Your database will become unusable with a few thousand rows. (I could say more....). I strongly suggest you take the time rearrange your data in more columns and tables so you can easily select from them with the use of indexes and without nested queries. Then you can easily use to_json() to get your data in json format.
EDIT/NOTE:
This answer was written when the current version of Postgres was 9.3 and applies to 9.3 and prior. It is almost certain that Postgres will be able to completely support document store with fully indexed and efficient search in elements, in the (near) future. Each upgrade since 9.0 has been a step to that direction.

Mysql INSTR like operation in mongodb

Completely new to mongo here.
I have following fields in on of my mysql tables:
id (BIGINT), text (LONGTEXT) #this would contain a long description
I am hoping to change my project from Mysql to MongoDB, but before I do that there is very crucial query that needs to be resolved.
My current query looks for various terms in the description and returns all ids, e.g.
select id from <table> where instr(<table>.text, 'value') or instr(<table>.text, 'value2)
Is it possible for this to be recreated in Mongo? if so how? right now using either the $or or $in seems that I need to have those specific values in some kind of an array in my document.
MongoDB does not natively support full text search at the moment.
You could use regular expressions but it would be slow (due to not using indexes unless they are rooted).
Query would be like:
db.collection.find({ $or: [{description: /value1/}, {description: /value2/}] })
You could do some preprocessing to insert each word into a searchable array of keywords but if the text is really long you probably don't want to go this route.