Why is MongoDB and MySQL slower than grep? - mysql

I'm doing some test on Wikipedia's pagecount data. This consists of around 7 million lines that look like this:
es London 13 173367
The 3rd column is the the count and I want to sum this across articles that have the same name (2nd column). So, on the command line:
paste -sd + <(grep ' London ' pagecounts | cut -d ' ' -f 3) | bc
Which works great and takes 0.53s
I thought that using a DB to query the information would be faster so I loaded it all into a MongoDB database, then:
db["pagecounts"].aggregate({
$match: { "article": "London" }
}, {
$group: { _id: "London", "total": { $sum: "$count" } }
});
This works, but takes a horrifying 8.96s
Confused and disappointed, I turned to MySQL:
SELECT SUM(count) FROM pagecounts WHERE article='London';
Which took 5.08s
I don't know a great deal about the internals of databases, but I wouldn't have thought that command line tools like grep would be faster at this kind of thing. What's going on? And what can be improved?
UPDATE
As Cyrus and Michael suggested, creating and index made this WAY faster: ~0.002s.

As #Cyrus has suggested, you need an index.
ALTER TABLE pagecount ADD KEY (article);
Then try the query again.
You should, while benchmarking, use SELECT SQL_NO_CACHE ... to avoid seeing query times that are deceptively faster than the server will consistently deliver.

Related

Two ways to select ranges in SQL, only one in MongoDB?

I have the following SELECT statement for SQL.
SELECT TransAmount FROM STOCK WHERE TransAmount between 100 and 110;
However, this statement generates an error from querymongo.com. It says "Failure parsing MySQL query: Unable to parse WHERE clause due to unrecognized operator ". I assume it is talking about the between clause.
Correct me if I'm wrong, but does this SQL statement do the exact same thing as the one above?
SELECT TransAmount FROM STOCK WHERE TransAmount > 100 and TransAmount < 110;
This statement generates the following MongoDB code.
db.STOCK.find({
"TransAmount": {
"$gt": 100,
"$lt": 110
}
}, {
"TransAmount": 1
});
It looks like MongoDB doesn't have a 'between' operator. Does MongoDB handle selection within ranges with a different keyword, or do you have to set it up like so $gt/%lt?
Between is just a shortcut (a sort of symlink) to your second query, I guess it makes life easier.
MongoDB has not yet implemented such a shortcut, I have looked around a bit for a JIRA declaring someone wants such an operator however, no luck.
The one and only way of doing ranges in MongoDB is to use $gt and $lt (you could count $in etc but that is a different kind of range, not what your looking for).

MySQL Vs MongoDB aggregation performance

I'm currently testing some databases to my application. The main functionality is data aggregation (similar to this guy here: Data aggregation mongodb vs mysql).
I'm facing the same problem. I've created a sample test data. No joins on the mysql side, it's a single innodb table. It's a 1,6 milion rows data set and I'm doing a sum and a count on the full table, without any filter, so I can compare the performance of the aggregation engine of each one. All data fits in memory in both cases. In both cases, there is no write load.
With MySQL (5.5.34-0ubuntu0.12.04.1) I'm getting results always around 2.03 and 2.10 seconds.
With MongoDB (2.4.8, linux 64bits) I'm getting results always between 4.1 and 4.3 seconds.
If I do some filtering on indexed fields, MySQL result time drops to around 1.18 and 1.20 (the number of rows processed drops to exactly half the dataset).
If I do the same filtering on indexed fields on MongoDB, the result time drops only to around 3.7 seconds (again processing half the dataset, which I confirmed with an explain on the match criteria).
My conclusion is that:
1) My documents are extremely bad designed (truly can be), or
2) The MongoDB aggregation framework realy does not fit my needs.
The questions are: what can I do (in terms of especific mongoDB configurations, document modeling, etc) to make Mongo's results faster? Is this a case where MongoDB is not suited to?
My table and documento schemas:
| events_normal |
CREATE TABLE `events_normal` (
`origem` varchar(35) DEFAULT NULL,
`destino` varchar(35) DEFAULT NULL,
`qtd` int(11) DEFAULT NULL,
KEY `idx_orides` (`origem`,`destino`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
{
"_id" : ObjectId("52adc3b444ae460f2b84c272"),
"data" : {
"origem" : "GRU",
"destino" : "CGH",
"qtdResultados" : 10
}
}
The indexed and filtered fields when mentioned are "origem" and "destino".
select sql_no_cache origem, destino, sum(qtd), count(1) from events_normal group by origem, destino;
select sql_no_cache origem, destino, sum(qtd), count(1) from events_normal where origem="GRU" group by origem, destino;
db.events.aggregate( {$group: { _id: {origem: "$data.origem", destino: "$data.destino"}, total: {$sum: "$data.qtdResultados" }, qtd: {$sum: 1} } } )
db.events.aggregate( {$match: {"data.origem":"GRU" } } , {$group: { _id: {origem: "$data.origem", destino: "$data.destino"}, total: {$sum: "$data.qtdResultados" }, qtd: {$sum: 1} } } )
Thanks!
Aggregation is not really what MongoDB was originally designed for, so it's not really its fastest feature.
When you really want to use MongoDB, you could use sharding so that each shard can process its share of the aggregation (make sure to select the shard-key in a way that each group is on only one cluster, or you will achieve the opposite). This, however, wouldn't be a fair comparison to MySQL anymore because the MongoDB cluster would use a lot more hardware.

Mysql (SQL RDB) vs MongoDB (NoSQL)for storing/querying url parameters from GET requests

Given billions of the following variable length URLs, where the number of parameters depends on the parameter "type":
test.com/req?type=x&a=1&b=test
test.com/req?type=x&a=2&b=test2
test.com/req?type=y&a=4&b=cat&c=dog&....z=0
I would like to extract and store its parameters in a database to basically execute queries like "get number of occurrences of each possible value for parameter "a" when "type" is x" as fast as possible, taking into account that:
There are 100 possible values for "type".
There will NOT be concurrent writes/reads in the DB. First I fill the DB, then I execute queries.
There will be ~10 clients querying the DB.
There is only one machine for storing the DB (no clusters/ distributed computing)
Which of the following options for the DB would be the fastest option?
1) MySQL using an EAV pattern
table 1
columns: id, type.
rows:
0 | x
1 | x
2 | y
table 2
columns: table1_id, param, value
rows:
0 | a | 1
0 | b | test
2) NoSql (mongoDb)
Please feel free to suggest any other option.
Thanks in advance.
I think you can try use ElasticSearch. It's very fast search engine which can be used as a document-oriented (JSON) NoSQL database. If the insertion speed does not play a decisive role, it will be a good solution for your problem.
It's structure of json document. {url: "your url", type: "type from url", params: {a:"val", b:"val"...}} or more simple {url: "your url", type: "type from url", a:"val", b:"val"...}
Size of params is not fixed, because it's scheme-free.

Select MySQL vs find MongoDB

i have 2 dbs: one in mySql and one in MongoDB with the same data inside...
i do the follow in mySQL:
Select tweet.testo From tweet Where tweet.testo like ‘%pizza%’
and this is the result:
1627 rows in set (2.79 sec)
but if i exec in mongo:
Db.tweets.find({text: /pizza/ }).explain()
this is the result:
nscannedObjects" : 1606334,
"n" : 1169,
or if i exec:
Db.tweets.find({text: /pizza/i }).explain()
this is the result:
"nscannedObjects" : 1606334,
"n" : 1641,
Why the number of rows/document in mysql/mongo find is different?
Why the number of rows/document in mysql/mongo find is different??
There could be 1000000000000000 reasons including the temperature of the sun on that particular day.
MongoDB and MySQL are two completely separate techs as such if you expect to keep both in synch you will need some kind of replicator between the two. You have not made us aware as to whether this is the case.
Also we have no idea of your coding, server setup, network setup and everything else so really we cannot even begin to answer this.
A good answer would be to say that the reason you are seeing this is because the data between the two is different...
As for the difference between:
Db.tweets.find({text: /pizza/ }).explain()
and
Db.tweets.find({text: /pizza/i }).explain()
This is because MySQL, by default, queries in lower case I believe and MongoDB (I know) does not as such it is case sensitive (this i makes it case insensitive).
However about replicators, here is a good one: https://docs.continuent.com/wiki/display/TEDOC/Replicating+from+MySQL+to+MongoDB
the mysql command
Select tweet.testo From tweet Where tweet.testo like ‘%pizza%’
is equivalent to MongoDB's
Db.tweets.find({text: /pizza/i })
I realized they both contain the same data, but in some cases the text in mysql was cut-off, so it resulted in less rows being returned.
To begin with your SQL query like '%pizza%' may not pickup entries that begin with the string 'pizza' because of the wildcard on the front. Try the following SQL query to rule out any syntactical differences with the matching logic in SQL and the Regex used by MongoDB
Select tweet.testo From tweet Where lower(tweet.testo) like ‘%pizza%’ or lower(tweet.testo) like ‘pizza%’
Disclaimer: I don't have mySQL in front of me just now so can't verify the leading wildcard behaviour described above, however this is consistent with other RDBMS so it's worth checking

Mysql match...against vs. simple like "%term%"

What's wrong with:
$term = $_POST['search'];
function buildQuery($exploded,$count,$query)
{
if(count($exploded)>$count)
{
$query.= ' AND column LIKE "%'. $exploded[$count] .'%"';
return buildQuery($exploded,$count+1,$query);
}
return $query;
}
$exploded = explode(' ',$term);
$query = buildQuery($exploded,1,
'SELECT * FROM table WHERE column LIKE "%'. $exploded[0] .'%"');
and then query the db to retrieve the results in a certain order, instead of using the myIsam-only sql match...against?
Would it dawdle performance dramatically?
The difference is in the algorithm's that MySQL uses behind the scenes find your data. Fulltext searches also allow you sort based on relevancy. The LIKE search in most conditions is going to do a full table scan, so depending on the amount of data, you could see performance issues with it. The fulltext engine can also have performance issues when dealing with large row sets.
On a different note, one thing I would add to this code is something to escape the exploded values. Perhaps a call to mysql_real_escape_string()
You can check out my recent presentation I did for MySQL University:
http://forge.mysql.com/wiki/Practical_Full-Text_Search_in_MySQL
Slides are also here:
http://www.slideshare.net/billkarwin/practical-full-text-search-with-my-sql
In my test, using LIKE '%pattern%' was more than 300x slower than using a MySQL FULLTEXT index. My test data was 1.5 million posts from the StackOverflow October data dump.