Here is my sample query:
SELECT userid,count(*)
FROM hits
GROUP BY userid
My table is something like this
id | userid | time ...etc
Where id is the primary key and I use this table to store every visit on a page.
Which means my table has 200,000+ rows.
For a userid lets say X i want to find out on which rank it is in the query that means how many users have visited that page more than the user with that userid.
I know there are many questions LIKE this but they aren't same because
My Query has group by
I tried quite a few answers here some don't even return anything while others take 5-10 mins. I need it to be faster.
for any further doubts pls clarify in comments
Thanks
Count/Group by in a query that's expected to return multiple rows will get progressively slower because the query will still have to touch every row in the table. Generally, if you expect to do reports like this often, and you expect your table to continue to grow, you should begin rolling that value into a cached value (so you should store every result still, but you should also add to a counter on that user's user record). Of course, this also begs the question of whether you have an index on your user_id column and a foreign key into your users table, which should speed that query up considerably.
Related
I have a table 'user_plays_track' that keeps track of how many times a user has 'played' a track.
I use the following query to either insert a new track a user has played, or update the number of times an existing track has been played:
INSERT INTO user_plays_track
(user_id, track_id) VALUES (x,y)
ON duplicate key UPDATE play_count = play_count+1
Here is the structure of my table:
user_id track_id play_count
1 5 2
4 2 1
3 5 7
From this information, I can infer things such as the total number of times a track has been played, or the total number of plays an artist has had, by finding the sum of all the track counts.
With a thousand or so records, this would soon become messy and the semantics unclear. What I wish to do, is use triggers to produce what could be described as cache.
For example, when a record is updated or inserted into 'user_plays_track', the 'tracks' table will increment its play_count column, indicating the total number of plays from all users for that track.
track_id artist_id track_name play_count
2 1 Hey 1
5 1 Test 9
Furthering this, another trigger should be applied, to infer new knowledge such as the total number of artist plays. This would again be triggered when a new track is added, it will find the artist_id the track belongs to and update the 'artist' table accordingly.
artist_id artist_name play_count
1 Bob 10
How would I go about implementing the relevant triggers, to provide a incrementing totals when a user 'plays' a track?
The more you want to calculate at query time, the more you want views, calculated columns and stored or user routines. The more you want to calculate at normalized base update time, the more you want cascades and triggers. The more you want to calculate at some other (scheduled or ad hoc) time, the more you use snapshots aka materialized views and updated denormalized bases. You can combine these. Any time the database is accessed it can be enabled by and restricted by stored routines or other api.
Until you can show that they are in adequate, views and calculated columns are the simplest.
The whole idea of a DBMS is to store a representation of your application state as the database (which normalization reduces the redundancy of) and then you query and let the DBMS implement and optimize calculation of the answer. You haven't presented a reason for not doing that in the most straightforward way possible.
I have a table 'tbl' something like that:
ID bigint(20) - primary key, autoincrement
field1
field2
field3
That table has 600k+ rows.
Query:
SELECT * from tbl ORDER by ID LIMIT 600000, 1 takes 1.68 second
Query:
SELECT ID, field1 from tbl ORDER by ID LIMIT 600000, 1 takes 1.69 second
Query:
SELECT ID from tbl ORDER by ID LIMIT 600000, 1 takes 0.16 second
Query:
SELECT * from tbl WHERE ID = xxx takes 0.005 second
Those queries are tested in phpmyadmin.
And the result is query 3 and query 4 together return necessarily data.
Query 1 does the same jobs but much slower...
This doesn't look right for me.
Could anyone give any advice?
P.S. I'm sorry for formatting.. I'm new to this site.
New test:
Q5 : CREATE TEMPORARY TABLE tmptable AS (SELECT ID FROM tbl WHERE ID LIMIT 600030, 30);
SELECT * FROM tbl WHERE ID IN (SELECT ID FROM tmptable); takes 0.38 sec
I still don't understand how it's possible. I recreated all indexes.. what else can I do with that table? Delete and refill it manually? :)
Query 1 looks at the table's primary key index, finds the correct 600,000 ids and their corresponding locations within the table, then goes to the table and fetches everything from those 600k locations.
Query 2 looks at the table's primary key index, finds the correct 600k ids and their corresponding locations within the table, then goes to the table and fetches whichever subset of fields are asked for from those 600k rows.
Query 3 looks at the table's primary key index, finds the correct 600k ids, and returns them. It doesn't need to look at the table at all.
Query 4 looks at the table's primary key index, finds the single entry requested, goes to the table, reads that single entry, and returns it.
Time-wise, let's build backwards:
(Q4) The table index allows lookup of a key (id) in O(log n) time, meaning every time the table doubles in size it only takes one extra step to find the key in the index*. If you have 1 million rows, then, it would only take ~20 steps to find it. A billion rows? 30 steps. The index entry includes data on where in the table to go to find the data for that row, so MySQL jumps to that spot in the table and reads the row. The time reported for this is almost entirely overhead.
(Q3) As I mentioned, the table index is very fast; this query finds the first entry and just traverses the tree until it has the requested number of rows. I'm sure I could calculate the precise number of steps it would take, but as a maximum we'll say 20 steps x 600k rows = 12M steps; since it's traversing a tree it would likely be more like 1M steps, but the precise number is largely irrelevant. The most important thing to realize here is that once MySQL has walked the index to pull the ids it needs, it has everything you asked for. There's no need to go look at the table. The time reported for this one is essentially the time it takes MySQL to walk the index.
(Q2) This begins with the same tree-walking as discussed for query 3, but while pulling the IDs it needs, MySQL also pulls their location within the table files. It then has to go to the table file (probably already cached/mmapped in memory), and for every entry it pulled, seek to the proper place in the table and get the fields requested out of those rows. The time reported for this query is the time it takes to walk the index (as in Q3) plus the time to visit every row specified in the index.
(Q1) This is identical to Q2 when all fields are specified. As the time is essentially identical to Q2, we can see that it doesn't really take measurably more time to pull more fields out of the database, any time there is dwarfed by crawling the index and seeking to the rows.
*: Most databases use an indexing data structure (B-trees for MySQL) that has a log base much higher than 2, meaning that instead of an extra step every time the table doubles, it's more like an extra step every time the table size goes up by a factor of hundreds to thousands. This means that instead of the 20-30 steps I stated in the example, it's more like 2-5.
My database knowledge is reasonable I would say, im using MySQL (InnoDb) for this and have done some Postgres work as well. Anyway...
I have a large amount of Yes or No questions.
A large amount of people can contribute to the same poll.
A user can choose either option and this will be recorded in the database.
User can change their mind later and swap choices which will require an update to the data stored.
My current plan for storing this data:
POLLID, USERID, DECISION, TIMESTAMP
Obviously user data is in another table.
To add their choice, I would have to query to see if they have voted before and insert, otherwise, update.
If I want to see the poll results I would need to go iterate through all decisions (albeit indexed portions) every time someone wants to see the poll.
My questions are
Is there any more efficient way to store/query this?
Would I have an index on POLLID, or POLLID & USERID (maybe just a unique constraint)? Or other?
Additional side question: Why dont I have an option to choose HASH vs BTREE indexes on my tables like i would in Postgres?
The design sounds good, a few ideas:
A table for polls: poll id, question.
A table for choices: choice id, text.
A table to link polls to choices: poll id->choice ids.
A table for users: user details, user ids.
A votes table: (user id, poll id), choice id, time stamp. (brackets are a unique pair)
Inserting/updating for a single user will work fine, as you can just check if an entry exists for the user id and the poll id.
You can view the results much easier than iterating through by using COUNT.
e.g.: SELECT COUNT(*) FROM votes WHERE pollid = id AND decision = choiceid
That would tell you how many people voted for "choiceid" in the poll "pollid".
Late Edit:
This is a way of inserting if it doesn't exist and updating if it does:
IF EXISTS (SELECT * FROM TableName WHERE UserId='Uid' AND PollId = 'pollid')
UPDATE TableName SET (set values here) WHERE UserId='Uid' AND PollId = 'pollid'
ELSE
INSERT INTO TableName VALUES (insert values here)
THE INFO
Currently I have two tables I am working with- a POST table that holds data for a individual posts, and a FAVORITES table that holds data for users that opt to save favorite posts in their profile.
The tables look like this:
On the POSTS table there is only a primary key on id, no indexes that I have set. On Favorites I have a combined index that I was testing of (postid, deviceid).
The POSTS table contains approx. 10,000 entries.
The FAVORITES table contains approx. 4,680,500 entries.
The query I use to grab the favorites from a particular deviceid is:
SELECT post FROM POSTS
WHERE id IN
(SELECT postid FROM favourites WHERE deviceid="12d4a4a4a4a4a4a");
THE PROBLEM:
With the amount of data being returned, and several devices having multiple favorites, the query can take upwards of 7-10 seconds to both COUNT favorites for a particular device and/or SELECT using the above query and subquery. When this happens during peak times, you can obviously imagine the issues that can cause.
Caching the query results is an option, but since the data is pretty specific in that the same user is not calling the query multiple times, but rather unique instances, I think there is a better solution. On another note, caching would need to be short lived, which would nullify its benefit.
I know the method of indexing, and I am familiar with foreign keys, but I'm not sure practically if and how they could be implemented between the query and the subquery to enhance performance.
Any advice/guidance is much appreciated.
Cheers,
Jared
SELECT post FROM POSTS
INNER JOIN favourites ON POSTS.id=favourites.postid
WHERE favourites.deviceid="12d4a4a4a4a4a4a");
split the index in favourites in 2 indices one on deviceid and one on postid
Why use a subquery? Have you tried a join?
SELECT post FROM posts INNER JOIN favourites ON posts.id=favourites.postid WHERE deviceid="12d4a4a4a4a4a4a"
You won't be using (only) your indices to retrieve the query results since the post field is not in any index. So you might actually end up saving time by making one query to get all the matching IDs from posts, then a second to get the post values.
Using EXPLAIN SELECT... will also help you optimize this query. Have you tried that?
On MySQL, composite indexes can only be used in the order the keys are defined. So for index (postid, deviceid), you can only use the index if you have a postid and need the deviceid. In your query here you're doing the opposite--you have a constant deviceid and want corresponding postid. So your query is not using any indexes.
More information on mysql composite indexes.
You should either add a deviceid index or reverse the index so that it's (deviceid, postid).
By the way, your favorites table looks a lot like a junction table. Consider whether you need the id column at all.
A couple of things you could do to improve performance:
Separate the device_id out to a device table with a surrogate primary key (an int) and a non-clustered index on the device_id varchar. The favorites table should only include the device table surrogate key. This should make the favorites table smaller and should make your favorites table index smaller. The smaller the index and smaller the table, the faster it will be to search.
Your favorites table index is wrong. It should not be (post_id,device_id). It should be (device_id,post_id) as your query needs to search by device_id first. As your favorites table row is so small, I question the value of including the post_id in the index. It just isn't worth the extra space for a possible marginal improvement in query speed.
EDIT: You need the post_id in the index to keep the entries unique (just make sure device_id is first).
Here is what im trying to do explained in a query
DELETE FROM table ORDER BY dateRegistered DESC LIMIT 1000 *
I want to run such query in a script which i have already designed. Every time it finds older records that are 1001th record or above it deletes
So kinda of setting Max Row size but deleting all the older records.
Actually is there a way to set that up in the CREATE statement.
Therefore: If i have 9023 rows in the database, when i run that query it should delete 8023 rows and leave me with 1000
If you have a unique ID for rows here is the theoretically correct way, but it is not very efficient (not even if you have an index on the dateRegistered column):
DELETE FROM table
WHERE id NOT IN (
SELECT id FROM table
ORDER BY dateRegistered DESC
LIMIT 1000
)
I think you would be better off by limiting the DELETE directly by date instead of number of rows.
I don't think there is a way to set that up in the CREATE TABLE statement, at least not a portable one.
The only way that immediately occurs to me for this exact job is to do it manually.
First, get a lock on the table. You don't want the row count changing while you're doing this. (If a lock is not practical for your app, you'll have to work out a more clever queuing system rather than using this method.)
Next, get current row count:
SELECT count(*) FROM table
Once you have that, you should with simple maths be able to figure out how many rows need deleting. Let's say it said 1005 - you need to delete 5 rows.
DELETE FROM table ORDER BY dateRegistered ASC LIMIT 5
Now, unlock the table.
If a lock isn't practical for your scenario, you'll have to be a bit more clever - for example, select the unique ID of all the rows that need deleting, and queue them for gradual deletion. I'll let you work that out yourself :)