MongoDB vs MySQL Performance - Simple Query - mysql

I am doing a comparison of mongodb with respect to mysql and imported the mysql data into the mongodb collection (>500000 records).
the collection looks like this:
{
"_id" : ObjectId(""),
"idSequence" : ,
"TestNumber" : ,
"TestName" : "",
"S1" : ,
"S2" : ,
"Slottxt" : "",
"DUT" : ,
"DUTtxt" : "",
"DUTver" : "",
"Voltage" : ,
"Temperature" : ,
"Rate" : ,
"ParamX" : "",
"ParamY" : "",
"Result" : ,
"TimeStart" : new Date(""),
"TimeStop" : new Date(""),
"Operator" : "",
"ErrorNumber" : ,
"ErrorText" : "",
"Comments" : "",
"Pos" : ,
"SVNURL" : "",
"SVNRev" : ,
"Valid" :
}
When comparing the queries (which both return 15 records):
mysql -> SELECT TestNumber FROM db WHERE Valid=0 AND DUT=68 GROUP BY TestNumber
with
mongodb -> db.results.distinct("TestNumber", {Valid:0, DUT:68}).sort()
The results are equivalent, but it takes (iro) 17secs from mongodb, compared with 0.03 secs from mysql.
I appreciate that it is difficult to make a comparison between the two db architectures and i further appreciate one of the skills of mongodb admin is to organise the data structure accordingly (therefore it is not a fair test to just import the mysql structure) Ref: MySQL vs MongoDB 1000 reads
But the time to return difference is too great to be a tuning issue.
My (default) mongodb log file reads:
Wed Mar 05 04:56:36.415 [conn4089] command NTV_Results.$cmd command: { distinct: "results", key: "TestNumber", query: { Valid: 0.0, DUT: 68.0 } } ntoreturn:1 keyUpdates:0 numYields: 6 locks(micros) r:21764672 reslen:250 16525ms
I have also tried the query:
db.results.group( {
key: { "TestNumber": 1 },
cond: {"Valid": 0, "DUT": 68 },
reduce: function ( curr, result ) { },
initial: { }
} )
With similar (17 seconds) results, any clues as to what I am doing wrong?
Both services are running on the same octo-core i7 3770 desktop PC with Windows 7 and 16Gb RAM.

There can be many reasons for slow performance, much of which is too much detail to go into here. But I can offer you a "starter pack" as it were.
Creating Indexes on your Valid and DUT fields are going to improve results for these and other queries. Consider this compound form this case using the ensureIndex command
db.collection.ensureIndex({ "Valid": 1, "DUT": 1})
Also the use of aggregate is recommended for these types of operations:
db.collection.aggregate([
{$match: { "Valid": 0, "DUT": 68 }},
{$group: { _id: "$TestNumber" }}
])
Should be the equivalent of the SQL you are referring to.
There is a SQL to Aggregation Mapping Chart that may give you some assistance with the thinking. Also worth familiarizing yourself with the difference aggregation operators in order to write effective queries.
I have spent many years writing very complex SQL for advanced tasks. And I find the aggregation framework a breath of fresh air for various problem solving cases.
Worth your time to learn.
Also worth noting. Your "default" MongoDB log file is reporting those operations because they are considered to be "slow queries" and are then brought to your attention by "default". You can also see more or less information, as you require by tuning the database profiler to meet your needs.

Related

sql query for searching within columns with json documents

I am using MySQL 5.7 and one of the columns in my table contains multiple JSON documents. Some thing like:
'[ {
"animal" : "dog",
"data" : {
"body" : "This sentence does not contain any thing about grooming",
}
},
{
"animal" : "cat",
"data" : {
"body" : "No grooming needed"
}
},
{
"animal" : "horse",
"data" : {
"body" : "He is grooming his horse after the ride."
}
}
]'
I want to return all rows where $.data.body contains grooming more than once, but only if $.animal == horse. So in the example given above it should not return the row since grooming is used only once in the section $.data.body where $.animal == horse.
Is there a good way to query this in MySql/SQL? I can do it in python but interested in knowing if there's a way to do this in SQL/MySQL. Thanks!
Searching JSON requires complex queries, and it is hard to optimize:
SELECT ...
FROM mytable
CROSS JOIN JSON_TABLE(myjsoncolumn, '$[*]' COLUMNS(
animal varchar(20) PATH '$.animal',
body text PATH '$.data.body'
)) AS j
WHERE j.animal = 'horse' AND j.body LIKE '%grooming%';
The JSON_TABLE() function is available in MySQL 8.0.4, but not earlier versions of MySQL.
The bottom line is that if you are trying to search the content of JSON documents, your use of SQL is going to be a lot more difficult and less efficient.
This would be far easier if you did not store the data in JSON, but instead stored data in normal rows and columns. From the example you show, there's no reason it needs to be JSON.

Recommendation for storing and querying DataFactory run log?

I'd like to store and query the OUTPUT and ERROR data generated during a DataFactory run. The data is returned when calling Get-AzDataFactoryV2ActivityRun.
The intention is to use it to monitore possible pipeline execution error, duration, etc in a easy and fast way.
The data ressembles JSON format. What would be nice is to visualize the summary of each execution through some html. Should I store this log into a MongoDB?
Is there an easy and better way to centralize the log info of the multiple execution of different pipelines?
ResourceGroupName : Test
DataFactoryName : DFTest
ActivityRunId : 00000000-0000-0000-0000-000000000000
ActivityName : If Condition1
PipelineRunId : 00000000-0000-0000-0000-000000000000
PipelineName : Test
Input : {}
Output : {}
LinkedServiceName :
ActivityRunStart : 03/07/2019 11:27:21
ActivityRunEnd : 03/07/2019 11:27:21
DurationInMs : 000
Status : Succeeded
Error : {errorCode, message, failureType, target}
Activity 'Output' section:
"firstRow": {
"col1": 1
}
"effectiveIntegrationRuntime": "DefaultIntegrationRuntime (West Europe)"
This is probably not the best way how you can monitor your ADF pipelines.
Have you considered to use Azure Monitor?
Find out more:
- https://learn.microsoft.com/en-us/azure/data-factory/monitor-using-azure-monitor
- https://learn.microsoft.com/en-us/azure/azure-monitor/visualizations

couchbase N1ql query select with non-group by fields

I am new to couchbase and I have been going through couchbase documents and other online resources for a while but I could't get my query working. Below is the data structure and my query:
Table1:
{
"jobId" : "101",
"jobName" : "abcd",
"jobGroup" : "groupa",
"created" : " "2018-05-06T19:13:43.318Z",
"region" : "dev"
},
{
"jobId" : "102",
"jobName" : "abcd2",
"jobGroup" : "groupa",
"created" : " "2018-05-06T22:13:43.318Z",
"region" : "dev"
},
{
"jobId" : "103",
"jobName" : "abcd3",
"jobGroup" : "groupb",
"created" : " "2018-05-05T19:11:43.318Z",
"region" : "test"
}
I need to get the jobId which has the latest job information (max on created timestamp) for a given jobGroup and region (group by jobGroup and region).
My sql query doesn't help me using self-join on jobId.
Query:
/*
Idea is to pull out the job which was executed latest for all possible groups and region and print the details of that particular job
select * from (select max(DATE_FORMAT_STR(j.created,'1111-11-11T00:00:00+00:00')) as latest, j.jobGroup, j.region from table1 j
group by jobGroup, region) as viewtable
join table t
on keys meta(t).id
where viewtable.latest in t.created and t.jobGroup = viewtable.jobGroup and
viewtable.region = t.region
Error Result: No result displayed
Desired result :
{
"jobId" : "102",
"jobName":"abcd2",
"jobGroup":"groupa",
"latest" :"2018-05-06T22:13:43.318Z",
"region":"dev"
},
{
"jobId" : "103",
"jobName" : "abcd3",
"jobGroup" : "groupb",
"created" : " "2018-05-05T19:11:43.318Z",
"region" : "test"
}
If I understand your query correctly, this can be answered using 'group by' and no join. I tried entering your sample data and the following query gives the correct result:
select max([created,d])[1] max_for_group_region
from default d
group by jobGroup, region;
How does it work? It uses 'group by' to group documents by jobGroup and region, then creates a two-element array holding, for every document in the group:
the 'created' timestamp field
the document where the timestamp came from
It then applies the max function on the set of 2-element arrays. The max of a set of arrays looks for the maximum value in the first array position, and if there's a tie look at the second position, and so on. In this case we are getting the two-element array with the max timestamp.
Now we have an array [ timestamp, document ], so we apply [1] to extract just the document.
I'm seeing some inconsistencies and invalid JSON in your examples, so I'm going to do the best I can. First off, I'm using Couchbase Server 5.5 which provides the new ANSI JOIN syntax. There might be a way to do this in an earlier version of Couchbase Server.
Next, I created an index on the created field: CREATE INDEX ix_created ON bucketname(created).
Then, I use a subquery to get the latest date, aggregated by jobGroup and region. I then join the latest date from this query to the entire bucket and select the fields that (I think) you want in your desired result:
SELECT k.jobId, k.jobName, k.jobGroup, k.created AS latest, k.region
FROM (
SELECT j.jobGroup, j.region, MAX(j.created) as latestDate
FROM so j
GROUP BY j.jobGroup, j.region
) dt
LEFT JOIN so k ON k.created = dt.latestDate;
Problems with this approach:
If two documents have the exact same date, this isn't a reliable way to determine the latest. You can add a LIMIT 1 to the subquery, which would just pick one arbitrarily, or you could ORDER BY whatever your preference is.
Subquery performance: I don't know how large your data set is, but this could be pretty slow.
Requires Couchbase Server 5.5, which is currently in beta.
If you are using a different version of Couchbase Server, you may want to consider asking in the Couchbase N1QL Forums for a more expert answer.

MongoDB queries return no results

I'm having a problem with querying a MongoDB dataset ("On Street Crime in Camden" from data.gov.uk)
The database name is Crime_Data_in_Camden and the collection name is Street_Crime_Camden. The query to find all records, db.Street_Crime_Camden.find(), works fine but anything else returns nothing at
all. Here is a portion of the metadata:
{
"id" : 509935,
"name" : "Ward Name",
"dataTypeName" : "text",
"fieldName" : "ward_name",
"position" : 13,
"renderTypeName" : "text",
"tableColumnId" : 258836,
"width" : 100,
"cachedContents" : {
"largest" : "West Hampstead",
"non_null" : 79813,
"null" : 0,
"top" : [ {
"item" : "Regent's Park",
"count" : 20
}, {
"item" : "Swiss Cottage",
"count" : 19
}, {
"item" : "Holborn and Covent Garden",
"count" : 18
}
}
}
I've tried 3 attempts at a basic query:
db.Street_Crime_Camden.find({"ward_name":"West Hampstead"});
db.Street_Crime_Camden.find({'meta.ward_name':'West Hampstead'});
db.Street_Crime_Camden.find({meta:{ward_name:"West Hampstead"} });
According to any documentation or tutorial that I've seen any of these approaches should be valid. And I know that there are hundreds of rows (or documents) that match those terms, so why are these queries returning nothing? Advice would be appreciated.
The common theme in the three aproaches you tried is some form of ward_name = West Hampstead but there is no attribute named ward_name in the document you shared with us.
Based on the document you show in your question the only way of addressing an attribute with the value West Hampstead is:
db.Street_Crime_Camden.find({"cachedContents.largest": "West Hampstead"});
For background; you address attributes in your documents by using dot notation so the document you included in your question could be found by any of the following find commands:
db.Street_Crime_Camden.find({"name": "Ward Name"});
db.Street_Crime_Camden.find({"position": 13});
db.Street_Crime_Camden.find({"cachedContents.top.item": "Swiss Cottage"});
db.Street_Crime_Camden.find({"cachedContents.top.1.count": 20});
... etc
These examples might help you to understand how to form find criteria. The MongoDB docs are also useful.

Is it possible to query JSON data in DynamoDB?

Let's say my JSON looks like this (example provided here) -
{
"year" : 2013,
"title" : "Turn It Down, Or Else!",
"info" : {
"directors" : [
"Alice Smith",
"Bob Jones"
],
"release_date" : "2013-01-18T00:00:00Z",
"rating" : 6.2,
"genres" : [
"Comedy",
"Drama"
],
"image_url" : "http://ia.media-imdb.com/images/N/O9ERWAU7FS797AJ7LU8HN09AMUP908RLlo5JF90EWR7LJKQ7##._V1_SX400_.jpg",
"plot" : "A rock band plays their music at high volumes, annoying the neighbors.",
"rank" : 11,
"running_time_secs" : 5215,
"actors" : [
"David Matthewman",
"Ann Thomas",
"Jonathan G. Neff"
]
}
}
I would like to query all movies where genres contains Drama.
I went through all of the examples but it seems that I can query only on hash key and sort key. I can't have JSON document as key itself as that is not supported.
You cannot. DynamoDB requires that all attributes you are filtering for have an index.
As you want to query independently of your main index, you are limited to Global Secondary Indexes.
The documentation lists on what kind of attributes indexes are supported:
The index key attributes can consist of any top-level String, Number, or Binary attributes from the base table; other scalar types, document types, and set types are not allowed.
Your type would be an array of Strings. So this query operation isn't supported by DynamoDB at this time.
You might want to consider other NoSQL document based databases which are more flexible like MongoDB Atlas, if you need this kind of querying functionality.
String filterExpression = "coloumnname.info.genres= :param";
Map valueMap = new HashMap();
valueMap.put(":param", "Drama");
ItemCollection scanResult = table
.scan(new ScanSpec().
withFilterExpression(filterExpression).
withValueMap(valueMap));
One example that I took from AWS Developer Forums is as follows.
We got some hints for you from our team. Filter/condition expressions for maps have to have key names at each level of the map specified separately in the expression attributeNames map.
Your expression should look like this:
{
"TableName": "genericPodcast",
"FilterExpression": "#keyone.#keytwo.#keythree = :keyone",
"ExpressionAttributeNames": {
"#keyone": "attributes",
"#keytwo": "playbackInfo",
"#keythree": "episodeGuid"
},
"ExpressionAttributeValues": {
":keyone": {
"S": "podlove-2018-05-02t19:06:11+00:00-964957ce3b62a02"
}
}
}