Keep first of duplicate records and delete the rest

Keep first of duplicate records and delete the rest - duplicates

This question does pretty much what I want to accomplish, but my table is more complicated and does not have a primary key. I also don't quite understand the top answer, what the t1 and t2 mean. If this answer can be applicable to me, would appreciate if someone explain the code.
I have several months' tables that contain info on clients and the policies they hold. Every client has a unique policy ID, but they can have multiple policies, resulting in multiple records under the same policy ID. The duplicate records can be completely different or exactly the same in every field.
For my purposes, I want to keep only one record for each policy ID. Ideally the record kept is the one with the highest Age, but does not need to if it's too complicated. Note there may be more than one record with the age that is the max for that particular Policy ID, then it doesn't matter which one of those we keep.
I do not plan on creating a primary key because there are some cases when I will be keeping two records under the same policy ID, which I will make the modification to the code myself. I also don't want to create another table because I am working with 10+ tables. Someone suggested using first(), but I'm not sure how to incorporate it into a query.
Please let me know if you need any additional information, and thank you for your help in advance!
=========UPDATE #1
Okay, looks like my question was a bit unrealistic, so I will add an autonumber primary key. How will I proceed with that?

Something on these lines:
DELETE Policies.*
FROM Policies
WHERE Policies.ID Not In (
SELECT TOP 1 id
FROM policies p
WHERE p.policyid = policies.policyid
ORDER BY createdate DESC, id )

Related

How to efficiently design MySQL database for my particular case

I am developing a forum in PHP MySQL. I want to make my forum as efficient as I can.
I have made these two tables
tbl_threads
tbl_comments
Now, the problems is that there is a like and dislike button under the each comment. I have to store the user_name which has clicked the Like or Dislike Button with the comment_id. I have made a column user_likes and a column user_dislikes in tbl_comments to store the comma separated user_names. But on this forum, I have read that this is not an efficient way. I have been advised to create a third table to store the Likes and Dislikes and to comply my database design with 1NF.
But the problem is, If I make a third table tbl_user_opinion and make two fields like this
1. comment_id
2. type (like or dislike)
So, will I have to run as many sql queries as there are comments on my page to get the like and dislike data for each comment. Will it not inefficient. I think there is some confusion on my part here. Can some one clarify this.

You have a Relational Scheme like this:
There are two ways to solve this. The first one, the "clean" one is to build your "like" table, and do "count(*)'s" on the appropriate column.
The second one would be to store in each comment a counter, indicating how many up's and down's have been there.
If you want to check, if a specific user has voted on the comment, you only have to check one entry, wich you can easily handle as own query and merge them two outside of your database (for this use a query resulting in comment_id and kind of the vote the user has done in a specific thread.)
Your approach with a comma-seperated-list is not quite performant, due you cannot parse it without higher intelligence, or a huge amount of parsing strings. If you have a database - use it!
("One Information - One Dataset"!)

The comma-separate list violates the principle of atomicity, and therefore the 1NF. You'll have hard time maintaining referential integrity and, for the most part, querying as well.
Here is one way to do it in a normalized fashion:
This is very clustering-friendly: it groups up-votes belonging to the same comment physically close together (ditto for down-votes), making the following query rather efficient:
SELECT
COMMENT.COMMENT_ID,
<other COMMENT fields>,
COUNT(DISTINCT UP_VOTE.USER_ID) - COUNT(DISTINCT DOWN_VOTE.USER_ID) SCORE
FROM COMMENT
LEFT JOIN UP_VOTE
ON COMMENT.COMMENT_ID = UP_VOTE.COMMENT_ID
LEFT JOIN DOWN_VOTE
ON COMMENT.COMMENT_ID = DOWN_VOTE.COMMENT_ID
WHERE
COMMENT.COMMENT_ID = <whatever>
GROUP BY
COMMENT.COMMENT_ID,
<other COMMENT fields>;
[SQL Fiddle]
Please measure on realistic amounts of data if that works fast enough for you. If not, then denormalize the model and cache the total score in the COMMENT table, and keep it current it through triggers every time a new row is inserted to or deleted from *_VOTE tables.
If you also need to get which comments a particular user voted on, you'll need indexes on *_VOTE {USER_ID, COMMENT_ID}, i.e. the reverse of the primary/clustering key above.1
1 This is one of the reasons why I didn't go with just one VOTE table containing an additional field that can be either 1 (for up-vote) or -1 (for down-vote): it's less efficient to cover with secondary indexes.

Keeping id's unique Client Side and Server Side

i am scrubbing my head now for hours to solve thw following situation:
Several Html Forms on a webpage are identified by an id. Users can create forms on the clients side themselves and fill in data. How can I guarantee that the id of the form the user generates is unique and that there doesnt occure any collision in the saving process because the same id was generated by the client of someone else.
The problems/questions:
A random function on the client side could return identical id's on two clients
Looking up the SQL table for free id wouldnt solve the problem
Autoincrement a new id would complicate the whole process because DOM id and SQL id differ so we come to the next point:
A "left join" to combine dom_id and user_id to identify the forms in the database looks like a performance killer because i expect these tables will be huge
The question (formed as simple as i can):
Is there a way that the client can create/fetch a unique id which will be later used as the primary key for a database entry without any collisions? Whats the best practice?
My current solution (bad):
No unique id's at all to identify the forms. Always a combination through a left join to identify the forms generated by the specific user. But what happens if the user says: Delete my account (and my user_id) but leave the data on the server. I would loose the user id and this query qouldn't work anymore...
I am really sorry that i couldn't explain it in another way. But i hope someone understood what i am faced with and could give me at least a hint
THANK YOU VERY MUCH!

GUIDs (Globally Unique IDentifiers) might help. See http://en.wikipedia.org/wiki/GUID
For each form the client could generate a new GUID. Theoretically it should be unique.

I just don't show IDs to the user until they've submitted something, at which point they get to see the generated auto-increment id. It keeps things simple. If you however really need it, you could use a sequence table, but it has some caveats which make me advise against it:
CREATE TABLE sequence (id integer default 0, sequencename varchar(32));
Incrementing:
UPDATE sequence
SET id = #generated := id + 1
WHERE sequencename = 'yoursequencename';
Getting:
SELECT #generated;

Where to store users visited pages?

I have a project, where I have posts for example.
The task is next: I must show to user his last posts visit.
This is my solution: every time user visits new (for him) topic, I create a new record in table visits.
Table visits has next structure: id, user_id, post_id, last_visit.
Now my tables visits has ~14,000,000 records and its still growing every day..
May be my solution isnt optimal and exists another way how to store users visits?
Its important to save every visit as standalone record, because I also have feature to select and use users visits. And I cant purge this table, because data could be needed later month, year. How I could optimize this situation?

Nope, you don't really have much choice other than to store your visit data in a table with columns for (at a bare minimum) user id, post id, and timestamp if you need to track the last time that each user visited each post.
I question whether you need an id field in that table, rather than using a composite key on (user_id, post_id), but I'd expect that to have a minor effect, provided that you already have a unique index on (user_id, post_id). (If you don't have an index on that pair of fields, adding one should improve query performance considerably and making it a unique index or composite key will protect against accidentally inserting duplicate records.)
If performance is still an issue despite proper indexing, you should be able to improve it a bit by segmenting the table into a collection of smaller tables, but segment it by user_id or post_id (rather than by date as previous answers have suggested). If you break it up by user or post id, then you will still be able to determine whether a given user has previously viewed a given post and, if so, on what date with only a single query. If you segment it by date, then that information will be spread across all tables and, in the worst-case scenario of a user who has never previously viewed a post (which I expect to be fairly common), you'll need to separately query each and every table before having a definitive answer.
As for whether to segment it by user id or by post id, that depends on whether you will more often be looking for all posts viewed by a user (segment by user_id to get them all in one query) or all users who have viewed a post (segment by post_id).

If it doesn't need to be long lasting, you could store it in session instead. If it does, you could either break the records apart by table, like say 1 per month, or you could only store the last 5-10 pages visited, and delete old ones as new ones come in. You could also change it to pages visited today, this week, etc.

If you do need all 14 million records, I would create another historical table to archive the visits that are not the most relevant for the day-to-day site operation.
At the end of the month (or week, or quarter, etc...) have some scheduled logic to archive records beyond a certain cutoff point to the historical table and reduce the number of records in the "live" table. This should help increase the query speed on the "live" table since you would have less records in it.
If you do need to query all of the data, you can use both tables and have all of the data available to you.

you could delete the ones you don't need - if you only want to show the last 10 visited posts then
DELETE FROM visits WHERE user_id = ? AND id NOT IN (SELECT id from visits where user_id = ? ORDER BY last_visit DESC LIMIT 0, 10);
(i think that's the best way to do that query, any mysql guru can tell me otherwise? you can ORDER BY in DELETE but the LIMIT only takes 1 parameter, so you can't do LIMIT 10, 100 there)
after inserting/updating each new row, or every few days if you like

Having a structure like (id, user_id, post_id, last_visit) for your vists table, makes it appear as though you are saving all posts, not just last post per Topic. Don't you need a topic ID in there somewhere so that you can determine what there last post PER TOPIC was, and so you know which row to replace when they post in the same topic more than once?

Store post_ids to $_SESSION and then using MYSQL IN with one SELECT query you will be able to show his visited posts. But all those ids will be destroyed after member close his browser, but anyways, this is much more faster and optimal than using database.
edit: sorry, I didn't notice you that you must store that records in database and use it after months. Then I have no idea how to optimize it, but with 14 mln. records you should definitely use indexes.

How to properly design a simple favorites and blocked table?

i am currently writing a webapp in rails where users can mark items as favorites and also block them. I came up two ways and wondered which one is more common/better way.
1. Separate join tables
Would it be wise to have 2 tables for this? Like:
users_favorites
- user_id
- item_id
users_blocked
- user_id
- item_id
2. single table
users_marks (or so)
- users_id
- item_id
- type (["fav", "blk"])
Both ways seem to have advantages. Which one would you use and why?

The second one has at least the advantage (if the primary key is users_id + item_id) to make sure that no user will have an item both as favorited and blocked.
I suppose I would got with that second solution -- especially considering the two tables, in the first solution, would have the same structure, which seems strange ; and it also allows you to have all the information in the same place, which might help, in some cases (reporting, for instance ? ).

I would go with #2.
It leaves all the appropriate data in a single table.
Otherwise you might have to resort to a union or distinct joins to get a full list of details.

It's just a different status of an item, so #2 will do the job. What would you do if it would be colors? Two different tables? I don't think so ;)
Edit: You might want the status in a different table and link it with a foreign key, but that's up to you. It depend on how many different status you expect to have. Just these two or many others as well?

MySQL: Table structure for a user's "views"

I've got a question to which I've had opposing pieces of advice, would appreciate additional views.
My site has users, each with a user_id. These users can view products, and I need to keep track of the unique instances of users viewing specific products. To record a view in a separate views table, I've currently got two options:
OPTION 1:
view_id (INT,PK) | user_id (INT,FK) | product_id (INT,FK) | view_date
... and create a unique constraint over the two middle columns for easy updating with ON DUPLICATE KEY. If the same view already exists, I just update view_date. If not, I write a new row.
OPTION 2:
user_product (VARCHAR20,PK) | view_date
... merge the two ids into a VARCHAR with a separator in the middle, and use the primary key column for easy updating with ON DUPLICATE KEY in the same way as above.
The structure should accommodate up to approx. million unique views. Any thoughts on which option might be better or worse, and why? Big thanks in advance.
EDIT:
Thanks for the answers, seems like there's a consensus. Was leaning to the same side but just needed the reassurance.

I like the first option better - in general, its good to maintain as much atomicity as possible. If you ever want to query for all of a user's views, or something like that, it would be more difficult to do after merging two columns into one (you would need to use LIKE with a wildcard match, which will never be as fast as an indexed single-valued column). You also lose the ability to index on different fields.
Also, there is no reason why you couldnt have a primary or unique key that involved multiple columns, so I see no advantage to option 2. To perform your update, just use REPLACE (documentation) instead of INSERT - this will allow you to easily maintain your invariant of having only one row per user/product combination.

I think that the first option is your better choice. Later down the line I think it will make querying for different things a bit easier. Queries will likely be faster as well since there won't be string manipulation involved. Further, you can have a primary key over multiple columns if you need.

Definitely go for the first option. The second option will mean many queries from hell if you need to make reports to look for particular groups of users (get me all users that often view product X and product Y so we can offer them a discount), same for looking for specific groups of products (which products are often viewed by the same users, so we can launch a discount promotion)
I understand that it is not a requirement to remember all individual views. But I would certainly capture the number of times they visited the product - this is almost free, as you can keep a running total (insert 1 , on duplicate key update view_count = view_count + 1)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Keep first of duplicate records and delete the rest - duplicates

Something on these lines: DELETE Policies.* FROM Policies WHERE Policies.ID Not In ( SELECT TOP 1 id FROM policies p WHERE p.policyid = policies.policyid ORDER BY createdate DESC, id )

Related

How to efficiently design MySQL database for my particular case

Keeping id's unique Client Side and Server Side

Where to store users visited pages?

How to properly design a simple favorites and blocked table?

MySQL: Table structure for a user's "views"

Categories

Resources