Current revision of entity in MySQL - mysql

Suppose I have the following table
CREATE TABLE `entities` (
`id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
`timestamp` TIMESTAMP NOT NULL
DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`data` VARCHAR(255),
PRIMARY KEY (`id`,`timestamp`)
);
Each entity would normally only be referenced by id, except that there are multiple revisions for each entity, disambiguated by timestamp. The majority of my queries will be selecting the most recent revision, with only a small handful inserting new revisions, and even fewer selecting all past revisions. I expect only about a dozen revisions per id on average.
What is the most efficient (in terms of performance and storage space) method of selecting the most recent revision? Is there an accepted practice for this problem?
As I see it, there are two methods: (1) Create views around a GROUP BY
CREATE VIEW groupedEntities AS
SELECT id, max(timestamp) AS maxt FROM entities GROUP BY id;
CREATE VIEW currentEntities AS
SELECT a.id, data, timestamp FROM groupedEntities AS a
INNER JOIN entities AS b ON b.id=a.id AND b.timestamp=a.maxt
WHERE timestamp <= CURRENT_TIMESTAMP;
SELECT * FROM currentEntities WHERE id=?;
Note the <=CURRENT_TIMESTAMP allows 'deleting' an entity by setting a timestamp to the distant future. And (2) Create a separate table to store current revisions
CREATE TABLE currentEntities (
`id` INT(10) UNSIGNED PRIMARY KEY,
`timestamp` TIMESTAMP,
CONSTRAINT FOREIGN KEY (`id`, `timestamp`)
REFERENCES `entities` (`id`,`timestamp`)
);
SELECT * FROM currentEntites INNER JOIN groupedEntities WHERE id=?;
Or some other option (3)?

Views will eat your lunch in terms of performance, because of the way that MySQL handles views. Specifically, MySQL materializes an intermediate MyISAM table for a view, and does not "push" predicates from an outer query into a view (stored or inline).
The option of having a separate table that holds the frequently used "current" revisions would be the better option of the two you present. That does add complexity, keeping everything in sync, different queries to get current vs. historical, and the overhead of extra inserts, etc.
Given just the original table (storing all the historical revisions in the same table as the current revision (no separate table for just the most recent revision)...
A query with an inline view with a predicate INSIDE the view definition will give the best performance:
SELECT e.id
, e.timestamp
, e.data
FROM `entities` e
JOIN ( SELECT m.id
, MAX(m.timestamp) AS `timestamp`
FROM `entities` m
WHERE m.id = ?
GROUP BY m.id
) c
ON c.id = e.id
AND c.timestamp = e.timestamp
The EXPLAIN output should show "Using where; Using index" on the step to materialize the inline view (derived table). The join predicate on the outer query is by primary key, which is optimal for the retrieval of the data column.

Related

Selecting Max record in nested Join more efficiently

I am trying to figure out the most efficient method of writing the query below. Right now it is using a user table of 3k records, scheduleday of 12k records, and scheduleuser of 300k records.
The method I am using works, but it is not fast. It is plenty fast of 100 and under records, but not how I need it displayed. I know there must be a more efficient way of running this, if i take out the nested select, it runs in .00025 seconds. Add the nested, and we're pushing 9+ seconds.
All I am trying to do is get the most recent date a user was scheduled. The scheduleuser table only tells the scheduleid and dayid. This is then looked up in scheduleday to get the date. I cant use max(scheduleuser.rec) because the order entered may not be in date order.
The result of this query would be:
Bob 4/6/2022
Ralph 4/7/2022
Please note this query works perfectly fine, I am looking for ways to make it more efficient.
Percona Server Mysql 5.5
SELECT
(
SELECT MAX(STR_TO_DATE(scheduleday.ddate, '%m/%d/%Y')) FROM scheduleuser su1
LEFT JOIN scheduleday ON scheduleday.scheduleid=su1.scheduleid AND scheduleday.dayid=su1.dayid WHERE su1.idUser=users.idUser
)
as lastsecheduledate, users.usersName
users
idUser
usersName
1
bob
2
ralph
scheduleday
scheduleid
dayid
ddate
1
1
4/5/2022
1
2
4/6/2022
1
3
4/7/2022
scheduleuser (su1)
rec
idUser
dayid
scheduleid
1
1
2
1
1
2
3
1
1
1
1
1
As requested, full query
SELECT users.iduser, users.adminName, users.firstname, users.lastname, users.lastLogin, users.area, users.type, users.terminationdate, users.termreason, users.cellphone,
(SELECT MAX(STR_TO_DATE(scheduleday.ddate, '%m/%d/%Y')) FROM scheduleuser "
'mySQL=mySQL&" LEFT JOIN scheduleday ON scheduleday.scheduleid=scheduleuser.scheduleid AND scheduleday.dayid=scheduleuser.dayid WHERE scheduleuser.iduser=users.iduser "
'mySQL=mySQL&" ) as lastsecheduledate,
IFNULL(userrating.rating,'0.00') as userrating, IFNULL(location.area,'') as userarea, IFNULL(usertypes.name,'') as usertype, IFNULL(useropen.iduser,0) as useropen
FROM users
mySQL=mySQL&" LEFT JOIN userrating ON userrating.iduser=users.iduser "
mySQL=mySQL&" LEFT JOIN location ON location.idarea=users.area "
mySQL=mySQL&" LEFT JOIN usertypes ON usertypes.idtype=users.type "
mySQL=mySQL&" LEFT JOIN useropen ON useropen.iduser=users.iduser "
WHERE
users.type<>0 AND users.active=1
ORDER BY users.firstName
As requested, create tables
CREATE TABLE `users` (
`idUser` int(11) NOT NULL,
`usersName` varchar(255) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
ALTER TABLE `users`
ADD PRIMARY KEY (`idUser`);
ALTER TABLE `users`
MODIFY `idUser` int(11) NOT NULL AUTO_INCREMENT;
COMMIT;
CREATE TABLE `scheduleday` (
`rec` int(11) NOT NULL,
`scheduleid` int(11) NOT NULL,
`dayid` int(11) NOT NULL,
`ddate` varchar(255) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
ALTER TABLE `scheduleday`
ADD PRIMARY KEY (`rec`),
ADD KEY `dayid` (`dayid`),
ADD KEY `scheduleid` (`scheduleid`);
ALTER TABLE `scheduleday`
MODIFY `rec` int(11) NOT NULL AUTO_INCREMENT;
COMMIT;
CREATE TABLE `scheduleuser` (
`rec` int(11) NOT NULL,
`idUser` int(11) NOT NULL,
`dayid` int(11) NOT NULL,
`scheduleid` int(11) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
ALTER TABLE `scheduleuser`
ADD PRIMARY KEY (`rec`),
ADD KEY `idUser` (`idUser`),
ADD KEY `dayid` (`dayid`),
ADD KEY `scheduleid` (`scheduleid`);
ALTER TABLE `scheduleuser`
MODIFY `rec` int(11) NOT NULL AUTO_INCREMENT;
COMMIT;
I think my recommendation would be to do that subquery once with a GROUP BY and join it. Something like
SELECT users.iduser, users.adminName, users.firstname, users.lastname, users.lastLogin, users.area, users.type, users.terminationdate, users.termreason, users.cellphone,
lsd.lastsecheduledate,
IFNULL(userrating.rating,'0.00') as userrating, IFNULL(location.area,'') as userarea, IFNULL(usertypes.name,'') as usertype, IFNULL(useropen.iduser,0) as useropen
FROM users
LEFT JOIN (SELECT iduser, MAX(STR_TO_DATE(scheduleday.ddate, '%m/%d/%Y')) lastscheduledate FROM scheduleuser LEFT JOIN scheduleday ON scheduleday.scheduleid=scheduleuser.scheduleid AND scheduleday.dayid=scheduleuser.dayid
GROUP BY iduser
) lsd
ON lsd.iduser=users.iduser
LEFT JOIN userrating ON userrating.iduser=users.iduser
LEFT JOIN location ON location.idarea=users.area
LEFT JOIN usertypes ON usertypes.idtype=users.type
LEFT JOIN useropen ON useropen.iduser=users.iduser
WHERE
users.type<>0 AND users.active=1
ORDER BY users.firstName
This will likely be more efficient since the DB can do the query once for all users, likely using your scheduleuser.iduser index.
If you are using something like above and it's still not performant, I might suggest experimenting with:
ALTER TABLE scheduleuser ADD INDEX (scheduleid, dayid)
ALTER TABLE scheduleday ADD INDEX (scheduleid, dayid)
This would ensure it can do the entire join in the subquery with the indexes. Of course, there are tradeoffs to adding more indexes, so depending on your data profile it might not be worth it (and it might not actually improve anything).
If you are using your original query, I might suggest experimenting with:
ALTER TABLE scheduleuser ADD INDEX (iduser,scheduleid, dayid)
ALTER TABLE scheduleday ADD INDEX (scheduleid, dayid)
This would allow it to do the subquery (both the JOIN and the WHERE) without touching the actual scheduleuser table at all. Again, I say "experiment" since there are tradeoffs and this might not actually improve things much.
When you nest a query in the SELECT as you're doing, that query will get evaluated for each record in the result set because its WHERE clause is utilizing a column from outside the query. You really just want to calculate a result set of max dates only once and join your users on after it is done:
select usersName, last_scheduled
from users
left join (select su.iduser, max(sd.ddate) as last_scheduled
from scheduleuser as su left join scheduleday as sd on su.dayid = sd.dayid
and su.scheduleid = sd.scheduleid
group by su.iduser) recents on users.iduser = recents.iduser
I've obviously left your other columns off and just given you the name and date, but this is the general principle.
Bug:
MAX(STR_TO_DATE(scheduleday.ddate, '%m/%d/%Y'))
Change to
STR_TO_DATE(MAX(scheduleday.ddate), '%m/%d/%Y')
Else you will be in for a rude surprise next January.
Possible better indexes. Switch from MyISAM to InnoDB. The following indexes assume InnoDB; they may not work as well in MyISAM.
users: INDEX(active, type)
userrating: INDEX(iduser, rating)
location: INDEX(idarea, area)
usertypes: INDEX(idtype, name)
useropen: INDEX(iduser)
scheduleday: INDEX(scheduleid, dayid, ddate)
scheduleuser: INDEX(iduser, scheduleid, dayid)
users: INDEX(iduser)
When adding a composite index, DROP index(es) with the same leading columns.
That is, when you have both INDEX(a) and INDEX(a,b), toss the former.

How can I get all business data AS WELL as if current user is following them?

In mysql how can I write a query that will fetch ALL business data, and at the same time (or not if it is better another way) check if user is following that business? I have the following relationship table to determine if a user is following a business (status=1 would mean that person is following):
CREATE TABLE IF NOT EXISTS `Relationship_User_Follows_Business` (
`user_id` int(10) unsigned NOT NULL,
`business_id` int(10) unsigned NOT NULL,
`status` tinyint(3) unsigned NOT NULL DEFAULT '0' COMMENT '1=following, 0=not following'
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
ALTER TABLE `Relationship_User_Follows_Business`
ADD UNIQUE KEY `unique_user_business_id` (`user_id`,`business_id`);
Assume business table just holds data on different businesses like name, phone number, etc. I would want to return all of the business data in my query (Business.*). I want to append the status (0 or 1) to the end of each business row to determine if the user is following that business. I have tried the following query but it does not work because it is narrowing the results to only show a business if there is a relationship row. I wish to show ALL businesses regardless if a relationship row exists or not because I only create the relationship row if a user follows:
SELECT Business.*, Relationship_User_Follows_Business.status FROM Business, Relationship_User_Follows_Business WHERE 104=Relationship_User_Follows_Business.user_id AND Business.id=Relationship_User_Follows_Business.business_id
Note that I am using 104 as a test user id. The user id would normally be dependent on user, not a static 104.
You are looking for a LEFT JOIN and not an INNER JOIN which keeps all the records from the master table and all the matching rows from the details table . Also, avoid using implicit join syntax(comma separated) and use the proper syntax of a join :
SELECT Business.*, Relationship_User_Follows_Business.status
FROM Business
LEFT JOIN Relationship_User_Follows_Business
ON Business.id = Relationship_User_Follows_Business.business_id
AND Relationship_User_Follows_Business.user_id = 104

INNER JOIN and GROUP BY to prevent duplicate results

Context:
I'm working on a simple ORM (for PHP) that automatize most of queries, based on a static configuration.
Thus, from tables and entities definitions, the library handles joins automatically and generates appropriate fields/table alias... No problem for LEFT joins but INNER may result in duplicated results in case of relation One-to-Many.
My thought was to automatically add a GROUP BY clause (on the auto-increment key) if necessary.
The question
Is it correct to consider that I need to add a GROUP BY clause if (and only if) the join's ON and WHERE conditions doesn't match a unique key of the joined table ?
Example
A very simple example, where I want to select all events with (at least) an associated Showing.
If there is an other way to do it without INNER JOIN, I'm interested to know how :)
CREATE TABLE `Event` (
`Id` INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
`Name` VARCHAR(255) NOT NULL
);
INSERT INTO `Event` (`Name`) VALUES ('My cool event');
CREATE TABLE `Showing` (
`Id` INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
`EventId` INT UNSIGNED NOT NULL,
`Place` VARCHAR(50) NOT NULL,
FOREIGN KEY (`EventId`) REFERENCES `Event`(`Id`),
UNIQUE (`EventId`, `Place`)
);
INSERT INTO `Showing` (`EventId`, `Place`) VALUES (1, 'School');
INSERT INTO `Showing` (`EventId`, `Place`) VALUES (1, 'Park');
-- Correct queries
SELECT t.* FROM `Event` t INNER JOIN `Showing` t1 ON t.Id=t1.`EventId` WHERE t1.`PlaceId` = 'School';
SELECT t.* FROM `Event` t INNER JOIN `Showing` t1 ON t.Id=t1.`EventId` AND t1.`PlaceId` = 'School';
-- Query leading to duplicate values
SELECT t.* FROM `Event` t INNER JOIN `Showing` t1 ON t.Id=t1.`EventId`;
-- Group by query to prevent duplicate values
SELECT t.* FROM `Event` t INNER JOIN `Showing` t1 ON t.Id=t1.`EventId` GROUP BY t.`Id`;
Thanks !
(this should be a comment but its a bit long)
No problem for LEFT joins but INNER may result in duplicated results in case of relation One-to-Many
It's clear from that sentence that at least one of us is very confused about how a relational database works, and how object-relation mapping should work.
Query leading to duplicate values
The rows produced are not duplicates - you've written the query so it doesn't show you why they are different:
SELECT t1.place, t.*
FROM Event
INNER JOIN Showing
ON Event.Id=Showing.EventId;
If you're not interested in the data from 'showing' then why is it in your query? If you have events without related showing records then you should be using an 'EXISTS' - not a join (consider where you have a single event but 3 million showings)
SELECT t1.place, t.*
FROM `Event` t
WHERE EXISTS (SELECT 1
FROM Showing
WHERE Event.Id=Showing.EventId);
If you are strictly implementing ORM, then you probably shouldn't be writing queries with joins at all - but IMHO, the scenario is better served by using factories.
The data is saying that "My Cool Event" is happening at the park, and at the school. If you inner join the tables you will get more than one result.
Do this query to see what is going on:
Select t.*, t1.* FROM `Event` t INNER JOIN `Showing` t1 ON t.Id=t1.`EventId`;
That is the same query as your duplicate query, but selecting columns from both tables.
The first line of results says the event is happening at the park. The second line says that the same event is happening at the school.

How to optimise change history data for MySQL

The previous table this data was stored in approached 3-4gb, but the data wasn't compressed before/after storage. I'm not a DBA so I'm a little out of my depth with a good strategy.
The table is to log changes to a particular model in my application (user profiles), but with one tricky requirement: we should be able to fetch the state of a profile at any given date.
Data (single table):
id, username, email, first_name, last_name, website, avatar_url, address, city, zip, phone
The only two requirements:
be able to fetch a list of changes for a given model
be able to fetch state of model on a given date
Previously, all of the profile data was stored for a single change, even if only one column was changed. But to get a 'snapshot' for a particular date was easy enough.
My first couple of solutions in optimising the data structure:
(1) only store changed columns. This would drastically reduce data stored, but would make it quite complicated to get a snapshot of data. I'd have to merge all changes up to a given date (could be thousands), then apply that to a model. But that model couldn't be a fresh model (only changed data is stored). To do this, I'd have to first copy over all data from current profiles table, then to get snapshot apply changes to those base models.
(2) store whole of data, but convert to a compressed format like gzip or binary or whatnot. This would remove ability to query the data other than to obtain changes. I couldn't, for example, fetch all changes where email = ''. I would essentially have a single column with converted data, storing the whole of the profile.
Then, I would want to use relevant MySQL table options, like ARCHIVE to further reduce space.
So my question is, are there any other options which you feel are a better approach than 1/2 above, and, if not, which would be better?
First of all, I wouldn't worry at all about a 3GB table (unless it grew to this size in a very short period of time). MySQL can take it. Space shouldn't be a concern, keep in mind that a 500 GB hard disk costs about 4 man-hours (in my country).
That being said, in order to lower your storage requirements, create one table for each field of the table you want to monitor. Assuming a profile table like this:
CREATE TABLE profile (
profile_id INT PRIMARY KEY,
username VARCHAR(50),
email VARCHAR(50) -- and so on
);
... create two history tables:
CREATE TABLE profile_history_username (
profile_id INT NOT NULL,
username VARCHAR(50) NOT NULL, -- same type as profile.username
changedAt DATETIME NOT NULL,
PRIMARY KEY (profile_id, changedAt),
CONSTRAINT profile_id_username_fk
FOREIGN KEY profile_id_fkx (profile_id)
REFERENCES profile(profile_id)
);
CREATE TABLE profile_history_email (
profile_id INT NOT NULL,
email VARCHAR(50) NOT NULL, -- same type as profile.email
changedAt DATETIME NOT NULL,
PRIMARY KEY (profile_id, changedAt),
CONSTRAINT profile_id_fk
FOREIGN KEY profile_id_email_fkx (profile_id)
REFERENCES profile(profile_id)
);
Everytime you change one or more fields in profile, log the change in each relevant history table:
START TRANSACTION;
-- lock all tables
SELECT #now := NOW()
FROM profile
JOIN profile_history_email USING (profile_id)
WHERE profile_id = [a profile_id]
FOR UPDATE;
-- update main table, log change
UPDATE profile SET email = [new email] WHERE profile_id = [a profile_id];
INSERT INTO profile_history_email VALUES ([a profile_id], [new email], #now);
COMMIT;
You may also want to set appropriate AFTER triggers on profile so as to populate the history tables automatically.
Retrieving history information should be straightforward. In order to get the state of a profile at a given point in time, use this query:
SELECT
(
SELECT username FROM profile_history_username
WHERE profile_id = [a profile_id] AND changedAt = (
SELECT MAX(changedAt) FROM profile_history_username
WHERE profile_id = [a profile_id] AND changedAt <= [snapshot date]
)
) AS username,
(
SELECT email FROM profile_history_email
WHERE profile_id = [a profile_id] AND changedAt = (
SELECT MAX(changedAt) FROM profile_history_email
WHERE profile_id = [a profile_id] AND changedAt <= [snapshot date]
)
) AS email;
You can't compress the data without having to uncompress it in order to search it - which is going to severely damage the performance. If the data really is changing that often (i.e. more than an average of 20 times per record) then it would be more efficient to for storage and retrieval to structure it as a series of changes:
Consider:
CREATE TABLE profile (
id INT NOT NULL autoincrement,
PRIMARY KEY (id);
);
CREATE TABLE profile_data (
profile_id INT NOT NULL,
attr ENUM('username', 'email', 'first_name'
, 'last_name', 'website', 'avatar_url'
, 'address', 'city', 'zip', 'phone') NOT NULL,
value CARCHAR(255),
starttime DATETIME DEFAULT CURRENT_TIME,
endtime DATETIME,
PRIMARY KEY (profile_id, attr, starttime)
INDEX(profile_id),
FOREIGN KEY (profile_id) REFERENCES profile(id)
);
When you add a new value for an existing record, set an endtime in the masked record.
Then to get the value at a date $T:
SELECT p.id, attr, value
FROM profile p
INNER JOIN profile_date d
ON p.id=d.profile_id
WHERE $T>=starttime
AND $T<=IF(endtime IS NULL,$T, endtime);
Alternately just have a start time, and:
SELECT p.id, attr, value
FROM profile p
INNER JOIN profile_date d
ON p.id=d.profile_id
WHERE $T>=starttime
AND NOT EXISTS (SELECT 1
FROM prodile_data d2
WHERE d2.profile_id=d.profile_id
AND d2.attr=d.attr
AND d2.starttime>d.starttime
AND d2.starttime>$T);
(which will be even faster with the MAX concat trick).
But if the data is not changing with that frequency then keep it in the current structure.
You need a slow changing dimension:
i will do this only for e-mail and telephone so you understand (pay attention to the fact of i use two keys, 1 as unique in the table, and another that is unique to the user that it concerns. This is, the table key identifies the the record, and the user key identifies the user):
table_id, user_id, email, telephone, created_at,inactive_at,is_current
1, 1, mario#yahoo.it, 123456, 2012-01-02, , 2013-04-01, no
2, 2, erik#telecom.de, 123457, 2012-01-03, 2013-02-28, no
3, 3, vanessa#o2.de, 1234568, 2012-01-03, null, yes
4, 2, erik#telecom.de, 123459, 2012-02-28, null, yes
5, 1, super.mario#yahoo.it, 654321,2013-04-01, 2013-04-02, no
6, 1, super.mario#yahoo.it, 123456,2013-04-02, null, yes
most recent state of the database
select * from FooTable where inactive_at is null
or
select * from FooTable where is_current = 'yes'
All changes to mario (mario is user_id 1)
select * from FooTable where user_id = 1;
All changes between 1 jan 2013 and 1 of may 2013
select * from FooTable where created_at between '2013-01-01' and '2013-05-01';
and you need to compare with the old versions (with the help of a stored procedure, java or php code... you chose)
select * from FooTable where incative_at between '2013-01-01' and '2013-05-01';
if you want you can do a fancy sql statement
select f1.table_id, f1.user_id,
case when f1.email = f2.email then 'NO_CHANGE' else concat(f1.email , ' -> ', f2.email) end,
case when f1.phone = f2.phone then 'NO_CHANGE' else concat(f1.phone , ' -> ', f2.phone) end
from FooTable f1 inner join FooTable f2
on(f1.user_id = f2.user_id)
where f2.created_at in
(select max(f3.created_at) from Footable f3 where f3.user_id = f1.user_id
and f3.created_at < f1.created_at and f1.user_id=f3.user_id)
and f1.created_at between '2013-01-01' and '2013-05-01' ;
As you can see a juicy query, to compare the user_with the previews user row...
the state of the database on 2013-03-01
select * from FooTable where table_id in
(select max(table_id) from FooTable where inactive_at <= '2013-03-01' group by user_id
union
select id from FooTable where inactive_at is null group by user_id having count(table_id) =1 );
I think this is the easiest way of implement what you want... you could implement a multi-million tables relational model, but then it would be a pain in the arse to query it
Your database is not big enough, I work everyday with one even bigger. Now tell me is the money you save in a new server worthy the time you spend on a super-complex relational model?
BTW if the data changes too fast, this approach cannot be used...
BONUS: optimization:
create indexes on created_at, inactive_at, user_id and the pair
perform partition (both horizontal and vertical)
if you try and put all occurring changes in different tables and later if you require an instance on some date you join them along and display by comparing dates, for example if you want an instance at 1st of july you can run a query with condition where date is equal or less than 1st of july and order it in asc ordering limiting the count to 1. that way the joins will produce exactly the instance it was at 1st of july. in this manner you can even figure out the most frequently updated module.
also if you want to keep all the data flat try range partitioning on the basis of month that way mysql will handle it pretty easily.
Note: by date i mean storing unix timestamp of the date its pretty easier to compare.
I'll offer one more solution just for variety.
Schema
PROFILE
id INT PRIMARY KEY,
username VARCHAR(50) NOT NULL UNIQUE
PROFILE_ATTRIBUTE
id INT PRIMARY KEY,
profile_id INT NOT NULL FOREIGN KEY REFERENCES PROFILE (id),
attribute_name VARCHAR(50) NOT NULL,
attribute_value VARCHAR(255) NULL,
created_at DATETIME NOT NULL DEFAULT GETTIME(),
replaced_at DATETIME NULL
For all attributes you are tracking, simply add PROFILE_ATTRIBUTE records when they are updated, and mark the previous attribute record with the DATETIME it was replaced at.
Select Current Profile
SELECT *
FROM PROFILE p
LEFT JOIN PROFILE_ATTRIBUTE pa
ON p.id = pa.profile_id
WHERE p.username = 'username'
AND pa.replaced_at IS NULL
Select Profile At Date
SELECT *
FROM PROFILE p
LEFT JOIN PROFIILE_ATTRIBUTE pa
ON p.id = pa.profile_id
WHERE p.username = 'username'
AND pa.created_at < '2013-07-01'
AND '2013-07-01' <= IFNULL(pa.replaced_at, GETTIME())
When Updating Attributes
Insert the new attribute
Update the previous attribute's replaced_at value
It would probably be important that the created_at for a new attribute match the replaced_at for the corresponding old attribute. This would be so that there is an unbroken timeline of attribute values for a given attribute name.
Advantages
Simple two-table architecture (I personally don't like a table-per-field approach)
Can add additional attributes with no schema changes
Easily mapped into ORM systems, assuming an application lives on top of this database
Could easily see the history for a certain attribute_name over time.
Disadvantages
Integrity is not enforced. For example, the schema doesn't restrict on multiple NULL replaced_at records with the same attribute_name... perhaps this could be enforced with a two-column UNIQUE constraint
Let's say you add a new field in the future. Existing profiles would not select a value for the new field until they save a value to it. This is opposed to the value coming back as NULL if it were a column. This may or may not be an issue.
If you use this approach, be sure you have indexes on the created_at and replaced_at columns.
There may be other advantages or disadvantages. If commenters have input, I'll update this answer with more information.

Optimizing a MySQL query with a large IN() clause or join on derived table

Let's say I need to query the associates of a corporation. I have a table, "transactions", which contains data on every transaction made.
CREATE TABLE `transactions` (
`transactionID` int(11) unsigned NOT NULL,
`orderID` int(11) unsigned NOT NULL,
`customerID` int(11) unsigned NOT NULL,
`employeeID` int(11) unsigned NOT NULL,
`corporationID` int(11) unsigned NOT NULL,
PRIMARY KEY (`transactionID`),
KEY `orderID` (`orderID`),
KEY `customerID` (`customerID`),
KEY `employeeID` (`employeeID`),
KEY `corporationID` (`corporationID`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
It's fairly straightforward to query this table for associates, but there's a twist: A transaction record is registered once per employee, and so there may be multiple records for one corporation per order.
For example, if employees A and B from corporation 1 were both involved in selling a vacuum cleaner to corporation 2, there would be two records in the "transactions" table; one for each employee, and both for corporation 1. This must not affect the results, though. A trade from corporation 1, regardless of how many of its employees were involved, must be treated as one.
Easy, I thought. I'll just make a join on a derived table, like so:
SELECT corporationID FROM transactions JOIN (SELECT DISTINCT orderID FROM transactions WHERE corporationID = 1) AS foo USING (orderID)
The query returns a list of corporations who have been involved in trades with corporation 1. That's exactly what I need, but it's very slow because MySQL can't use the corporationID index to determine the derived table. I understand that this is the case for all subqueries/derived tables in MySQL.
I've also tried to query a collection of orderIDs separately and use a ridiculously large IN() clause (typhically 100 000+ IDs), but as it turns out MySQL has issues using indices on ridiculously large IN() clauses as well and as a result the query time does not improve.
Are there any other options available, or have I exhausted them both?
If I understand your requirement, you could try this.
select distinct t1.corporationID
from transactions t1
where exists (
select 1
from transactions t2
where t2.corporationID = 1
and t2.orderID = t1.orderID)
and t1.corporationID != 1;
or this:
select distinct t1.corporationID
from transactions t1
join transactions t2
on t2.orderID = t1.orderID
and t1.transactionID != t2.transactionID
where t2.corporationID = 1
and t1.corporationID != 1;
Your data makes no sense to me, I think you are using corporationID where you mean customer ID at some point in there, as your query joins the transaction table to the transaction table for corporationID=1 based on orderID to get the corporationIDs...which would then be 1, right?
Can you please specify what the customerID, employeeID, and corporationIDs mean? How do I know employees A and B are from corporation 1 - in that case, is corporation 1 the corporationID, and corporation 2 is the customer, and so stored in the customerID?
If that is the case, you just need to do a group by:
SELECT customerID
FROM transactions
WHERE corporationID = 1
GROUP BY customerID
(Or select and group by orderID if you want one row per order instead of one row per customer.)
By using the group by, you ignore the fact that there are multiple records that are duplicate except for the employeeID.
Conversely, to returns all corporations that have sold to corporation 2.
SELECT corporationID
FROM transactions
WHERE customerID = 2
GROUP BY corporationID