I'm currently testing some databases to my application. The main functionality is data aggregation (similar to this guy here: Data aggregation mongodb vs mysql).
I'm facing the same problem. I've created a sample test data. No joins on the mysql side, it's a single innodb table. It's a 1,6 milion rows data set and I'm doing a sum and a count on the full table, without any filter, so I can compare the performance of the aggregation engine of each one. All data fits in memory in both cases. In both cases, there is no write load.
With MySQL (5.5.34-0ubuntu0.12.04.1) I'm getting results always around 2.03 and 2.10 seconds.
With MongoDB (2.4.8, linux 64bits) I'm getting results always between 4.1 and 4.3 seconds.
If I do some filtering on indexed fields, MySQL result time drops to around 1.18 and 1.20 (the number of rows processed drops to exactly half the dataset).
If I do the same filtering on indexed fields on MongoDB, the result time drops only to around 3.7 seconds (again processing half the dataset, which I confirmed with an explain on the match criteria).
My conclusion is that:
1) My documents are extremely bad designed (truly can be), or
2) The MongoDB aggregation framework realy does not fit my needs.
The questions are: what can I do (in terms of especific mongoDB configurations, document modeling, etc) to make Mongo's results faster? Is this a case where MongoDB is not suited to?
My table and documento schemas:
| events_normal |
CREATE TABLE `events_normal` (
`origem` varchar(35) DEFAULT NULL,
`destino` varchar(35) DEFAULT NULL,
`qtd` int(11) DEFAULT NULL,
KEY `idx_orides` (`origem`,`destino`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
{
"_id" : ObjectId("52adc3b444ae460f2b84c272"),
"data" : {
"origem" : "GRU",
"destino" : "CGH",
"qtdResultados" : 10
}
}
The indexed and filtered fields when mentioned are "origem" and "destino".
select sql_no_cache origem, destino, sum(qtd), count(1) from events_normal group by origem, destino;
select sql_no_cache origem, destino, sum(qtd), count(1) from events_normal where origem="GRU" group by origem, destino;
db.events.aggregate( {$group: { _id: {origem: "$data.origem", destino: "$data.destino"}, total: {$sum: "$data.qtdResultados" }, qtd: {$sum: 1} } } )
db.events.aggregate( {$match: {"data.origem":"GRU" } } , {$group: { _id: {origem: "$data.origem", destino: "$data.destino"}, total: {$sum: "$data.qtdResultados" }, qtd: {$sum: 1} } } )
Thanks!
Aggregation is not really what MongoDB was originally designed for, so it's not really its fastest feature.
When you really want to use MongoDB, you could use sharding so that each shard can process its share of the aggregation (make sure to select the shard-key in a way that each group is on only one cluster, or you will achieve the opposite). This, however, wouldn't be a fair comparison to MySQL anymore because the MongoDB cluster would use a lot more hardware.
Related
I have a table that contains a JSON array column (nvarchar(max)), has millions of rows expected to be billions of rows in the future.
The table structure is like this:
[SnapshotId] - PK,
[BuildingId],
......................
[MeterData],
MeterData contains Json array like this:
[{
"MeterReadingId": 0,
"BuildingMeterId": 1,
"Value": 1.0,
}, {
"MeterReadingId": 0,
"BuildingMeterId": 2,
"Value": 1.625,
}]
I need to filter by "HourlySnapshot" table where "BuildingMeterId = 255" is for example, wrote the below query
SELECT *
FROM [HourlySnapshot] h
CROSS APPLY OPENJSON(h.MeterData)
WITH (BuildingMeterId int '$.BuildingMeterId') AS MeterDataJson
WHERE MeterDataJson.BuildingMeterId = 255
Works fine, but performance is bad due to parse of JSON. I read you can overcome the performance issue by creating indexes. I created a clustered index like below
CREATE CLUSTERED INDEX CL_MeterDataModel
ON [HourlySnapshot] (MeterDataModel)
But can't see any improvements in terms of speed. Have I done it wrong ? what is the best way to improve the speed.
Thanks
NeroIsNoHero
The combination of a computed column and an index may help.
ALTER TABLE [HourlySnapshot]
ADD [BuildingMeterId] AS JSON_VALUE([MeterData], '$.BuildingMeterId');
CREATE NONCLUSTERED INDEX IX_ParsedBuildingMeterId ON [HourlySnapshot] (BuildingMeterId)
This actually causes SQL Server to parse and index the value at insert/update time. When reading, it can use the index and not do a full table scan.
I'm doing some test on Wikipedia's pagecount data. This consists of around 7 million lines that look like this:
es London 13 173367
The 3rd column is the the count and I want to sum this across articles that have the same name (2nd column). So, on the command line:
paste -sd + <(grep ' London ' pagecounts | cut -d ' ' -f 3) | bc
Which works great and takes 0.53s
I thought that using a DB to query the information would be faster so I loaded it all into a MongoDB database, then:
db["pagecounts"].aggregate({
$match: { "article": "London" }
}, {
$group: { _id: "London", "total": { $sum: "$count" } }
});
This works, but takes a horrifying 8.96s
Confused and disappointed, I turned to MySQL:
SELECT SUM(count) FROM pagecounts WHERE article='London';
Which took 5.08s
I don't know a great deal about the internals of databases, but I wouldn't have thought that command line tools like grep would be faster at this kind of thing. What's going on? And what can be improved?
UPDATE
As Cyrus and Michael suggested, creating and index made this WAY faster: ~0.002s.
As #Cyrus has suggested, you need an index.
ALTER TABLE pagecount ADD KEY (article);
Then try the query again.
You should, while benchmarking, use SELECT SQL_NO_CACHE ... to avoid seeing query times that are deceptively faster than the server will consistently deliver.
I have an online game currently using MySQL. I have a Player table looking like this:
create table player (
id integer primary key,
name varchar(50),
score integer
);
I have an index on "score" column and display the rankings like this:
select id, name, score from player order by score desc limit 100
I'd like to migrate my system to Redis (or, if some other NoSQL is more applicable to this kind of problem, please tell). So I wonder what is the way to display this kind of rankings table efficiently?
AFAICT, this could be a Map/Reduce job? I know next to nothing about Map/Reduce although I read some docs I still don't quite understand as I haven't been able to find any real-life examples.
Can someone please give me a rought example how to do the above query in Redis?
In redis you can use Sorted sets ( http://redis.io/commands#sorted_set )
When you have scored items in sorted set you can get top N by invoke ZRANGE players 0 N
Good question - In MongoDB you would have to use the group() function to return this type of query:
select id, name, score from player order by score desc limit 100
Might look something like this:
db.player.group(
{key: { id:true, name:true },
reduce: function(obj,prev) { if(prev.cmax<obj.score) prev.cmax = obj.score; },
initial: { cmax: 0 } // some initial value
});
Using a MapReduce based approach is probably best, see:
http://www.mongodb.org/display/DOCS/Aggregation#Aggregation-Group
http://cookbook.mongodb.org/patterns/finding_max_and_min_values_for_a_key/
How do I convert the following into MongoDB query ?
sets_progress = Photo.select('count(status) as count, status, photoset_id')
.where('photoset_id IN (?)', sets_tracked_array)
.group('photoset_id, status')
There is no 1 to 1 mapping of a SQL query to a NoSQL implementation. You'll need to precalculate your data to match the way you want to access that data.
If it is small enough, then this query will need to change into a map-reduce job. More here: http://www.mongodb.org/display/DOCS/MapReduce
Here's a decent tutorial that takes a query that GROUP's and converts to map-reduce: http://www.mongovue.com/2010/11/03/yet-another-mongodb-map-reduce-tutorial/
What's wrong with:
$term = $_POST['search'];
function buildQuery($exploded,$count,$query)
{
if(count($exploded)>$count)
{
$query.= ' AND column LIKE "%'. $exploded[$count] .'%"';
return buildQuery($exploded,$count+1,$query);
}
return $query;
}
$exploded = explode(' ',$term);
$query = buildQuery($exploded,1,
'SELECT * FROM table WHERE column LIKE "%'. $exploded[0] .'%"');
and then query the db to retrieve the results in a certain order, instead of using the myIsam-only sql match...against?
Would it dawdle performance dramatically?
The difference is in the algorithm's that MySQL uses behind the scenes find your data. Fulltext searches also allow you sort based on relevancy. The LIKE search in most conditions is going to do a full table scan, so depending on the amount of data, you could see performance issues with it. The fulltext engine can also have performance issues when dealing with large row sets.
On a different note, one thing I would add to this code is something to escape the exploded values. Perhaps a call to mysql_real_escape_string()
You can check out my recent presentation I did for MySQL University:
http://forge.mysql.com/wiki/Practical_Full-Text_Search_in_MySQL
Slides are also here:
http://www.slideshare.net/billkarwin/practical-full-text-search-with-my-sql
In my test, using LIKE '%pattern%' was more than 300x slower than using a MySQL FULLTEXT index. My test data was 1.5 million posts from the StackOverflow October data dump.