Using IN to select a dynamic list of software names that a computer must have to be matched, and the data exists in the database for a known value, the result when using:
GROUP BY id
HAVING COUNT(DISTINCT column_name) = count_of_list_values;
returns an empty result set when it should be returning the value expected.
NOTE it works when you are searching for a single list item, this issue is when that list expands to more than one.
Plenty of Google searches, resulting in nothing useful... other than that using phpmyadmin, i have been trying to figure out why it is failing.
Isolated the area where it seems to choking and trying different things with the query to try to get it to work.
this is the section of the query that i am using.
SELECT
am_software_archive.asset_name
FROM
am_software_archive
WHERE
LOWER(am_software_archive.sw_name) IN('nodejs', 'visio 2013')
GROUP BY
am_software_archive.id
HAVING
COUNT(
DISTINCT am_software_archive.sw_name
) = 2
The data in this table exists and is valid so it should work...
The table definition
CREATE TABLE IF NOT EXISTS am_software_archive(
id BIGINT NOT NULL UNIQUE,
asset_name VARCHAR(10) NOT NULL,
sw_name VARCHAR(150) NOT NULL,
sw_developer BIGINT NOT NULL,
sw_key VARCHAR(50) DEFAULT NULL,
sw_osver VARCHAR(15) DEFAULT NULL,
CONSTRAINT PK_software PRIMARY KEY(id, asset_name),
INDEX idx_sw_name_asset(asset_name,sw_name),
INDEX idx_sw_key_asset_name(asset_name,sw_key),
INDEX idx_sw_name_sw_key(sw_name,sw_key),
INDEX idx_osver_sw_name(sw_name,sw_osver),
INDEX idx_osver_asset_name(asset_name,sw_osver)
)
In my database the expected result would be:
"ABX50269"
Actual results well are empty, but shouldn't be.
You group by the id, which is unique so there cannot ever be two or more (different) sw_name per id.
As asset_name is the column in the list for SELECT, I think you actually want to group by asset_name.
I have a fairly big file that I matcedh with another file before uploading it to my database using MySQL. The original file was ~211k (t1) and the returning match after matching it with the existing database (t2) is around 300k -- which means I have to do almost 90k work of record-removal before I can upload.
Since the first query where I used a LEFT JOIN to match them on name took so long, I saved the results as a new table called matchnew (the 300k records, seemingly with 90k of duplicates or bad matches). Here's a sample of the matchnew schema after I joined t1 and t2:
CREATE TABLE `rnmatchnew` (
`id1` varchar(255) DEFAULT NULL,
`first1` varchar(255) DEFAULT NULL,
`last1` varchar(255) DEFAULT NULL,
`phone1` varchar(255) DEFAULT NULL,
`zip1` varchar(255) DEFAULT NULL
`id2` varchar(255) DEFAULT NULL,
`first2` varchar(255) DEFAULT NULL,
`last2` varchar(255) DEFAULT NULL,
`phone2` varchar(255) DEFAULT NULL,
`zip2` varchar(255) DEFAULT NULL;
(And the two IDs [id1 and id2] do not match -- they're two unique identifiers from two different databases.)
Right now I'm looking at most of those duplicates or bad matches by using this simple query:
SELECT *, COUNT(id1)
FROM matchnew
GROUP BY id1
HAVING COUNT(id1) > 1;
The good thing about each table that I matched had different unique identifiers attached to them (id1 from the first table and id2 from the second table, which now both exist in matchnew) -- so it should be fairly easy to see when records are appearing multiple times. Also I because I left joined two existing tables together to get matchnew, that means that I have two sets of data for each person from each table -- so two names, two phone numbers, two addresses, etc. But I only did the LEFT JOIN on first and last name to ensure I'd get the biggest possible return to make sure I didn't miss anybody in case they moved or we have different phone numbers for them, etc.
My question is: Is there code I can write or add to the above query which will remove rows if they fit a certain criteria only if there is more than one unique ID in the table? So for example, if my id1 was 1234567 and my query above showed that there were now three of me in the final column, is there additional code I can write to remove one or two (but not all three) of the duplicates or bad matches if my data doesn't match up with other qualifiers (e.g. phone number or zip code)?
To further clarify, if my record with id1: 1234567 from the initial t1 matched with three people with my name from t2 -- is there a way to remove up to two of the rows if, for example, the record from t1 matched the same phone number as one of the three records with the same name from t2? (The only reason why I specify "up to two" is because this example has three duplicates -- and if none of them match the phone number, I don't want to lose them all entirely in case that's a decision I can make manually.)
That was way more complicated to describe than I expected -- so please just let me know if I can provide any further clarification! Thanks so much for the help.
You need to first insert a identity column for all the rows
With identify column id , the rows will be like this
id id1 phone1 first1
1 1 732 t1
2 1 732 t1
3 1 732 t2
4 1 891 t3
The query would remove only row with id 2 as id1, phone1, first1 are matching
We are doing group by on phone1, id1 , if combination has duplicate values, then retaining only maximum value from first1
DELETE M FROM matchnew M
INNER JOIN (
SELECT id1, phone1, first1, MAX(id) as id
FROM matchnew
GROUP BY id1,phone1,first1
HAVING COUNT(*) > 1 )T
ON M.id < T.id
AND M.phone1 = T.phone1
AND M.id1 = T.id1
AND M.first1 = T.first1
The previous table this data was stored in approached 3-4gb, but the data wasn't compressed before/after storage. I'm not a DBA so I'm a little out of my depth with a good strategy.
The table is to log changes to a particular model in my application (user profiles), but with one tricky requirement: we should be able to fetch the state of a profile at any given date.
Data (single table):
id, username, email, first_name, last_name, website, avatar_url, address, city, zip, phone
The only two requirements:
be able to fetch a list of changes for a given model
be able to fetch state of model on a given date
Previously, all of the profile data was stored for a single change, even if only one column was changed. But to get a 'snapshot' for a particular date was easy enough.
My first couple of solutions in optimising the data structure:
(1) only store changed columns. This would drastically reduce data stored, but would make it quite complicated to get a snapshot of data. I'd have to merge all changes up to a given date (could be thousands), then apply that to a model. But that model couldn't be a fresh model (only changed data is stored). To do this, I'd have to first copy over all data from current profiles table, then to get snapshot apply changes to those base models.
(2) store whole of data, but convert to a compressed format like gzip or binary or whatnot. This would remove ability to query the data other than to obtain changes. I couldn't, for example, fetch all changes where email = ''. I would essentially have a single column with converted data, storing the whole of the profile.
Then, I would want to use relevant MySQL table options, like ARCHIVE to further reduce space.
So my question is, are there any other options which you feel are a better approach than 1/2 above, and, if not, which would be better?
First of all, I wouldn't worry at all about a 3GB table (unless it grew to this size in a very short period of time). MySQL can take it. Space shouldn't be a concern, keep in mind that a 500 GB hard disk costs about 4 man-hours (in my country).
That being said, in order to lower your storage requirements, create one table for each field of the table you want to monitor. Assuming a profile table like this:
CREATE TABLE profile (
profile_id INT PRIMARY KEY,
username VARCHAR(50),
email VARCHAR(50) -- and so on
);
... create two history tables:
CREATE TABLE profile_history_username (
profile_id INT NOT NULL,
username VARCHAR(50) NOT NULL, -- same type as profile.username
changedAt DATETIME NOT NULL,
PRIMARY KEY (profile_id, changedAt),
CONSTRAINT profile_id_username_fk
FOREIGN KEY profile_id_fkx (profile_id)
REFERENCES profile(profile_id)
);
CREATE TABLE profile_history_email (
profile_id INT NOT NULL,
email VARCHAR(50) NOT NULL, -- same type as profile.email
changedAt DATETIME NOT NULL,
PRIMARY KEY (profile_id, changedAt),
CONSTRAINT profile_id_fk
FOREIGN KEY profile_id_email_fkx (profile_id)
REFERENCES profile(profile_id)
);
Everytime you change one or more fields in profile, log the change in each relevant history table:
START TRANSACTION;
-- lock all tables
SELECT #now := NOW()
FROM profile
JOIN profile_history_email USING (profile_id)
WHERE profile_id = [a profile_id]
FOR UPDATE;
-- update main table, log change
UPDATE profile SET email = [new email] WHERE profile_id = [a profile_id];
INSERT INTO profile_history_email VALUES ([a profile_id], [new email], #now);
COMMIT;
You may also want to set appropriate AFTER triggers on profile so as to populate the history tables automatically.
Retrieving history information should be straightforward. In order to get the state of a profile at a given point in time, use this query:
SELECT
(
SELECT username FROM profile_history_username
WHERE profile_id = [a profile_id] AND changedAt = (
SELECT MAX(changedAt) FROM profile_history_username
WHERE profile_id = [a profile_id] AND changedAt <= [snapshot date]
)
) AS username,
(
SELECT email FROM profile_history_email
WHERE profile_id = [a profile_id] AND changedAt = (
SELECT MAX(changedAt) FROM profile_history_email
WHERE profile_id = [a profile_id] AND changedAt <= [snapshot date]
)
) AS email;
You can't compress the data without having to uncompress it in order to search it - which is going to severely damage the performance. If the data really is changing that often (i.e. more than an average of 20 times per record) then it would be more efficient to for storage and retrieval to structure it as a series of changes:
Consider:
CREATE TABLE profile (
id INT NOT NULL autoincrement,
PRIMARY KEY (id);
);
CREATE TABLE profile_data (
profile_id INT NOT NULL,
attr ENUM('username', 'email', 'first_name'
, 'last_name', 'website', 'avatar_url'
, 'address', 'city', 'zip', 'phone') NOT NULL,
value CARCHAR(255),
starttime DATETIME DEFAULT CURRENT_TIME,
endtime DATETIME,
PRIMARY KEY (profile_id, attr, starttime)
INDEX(profile_id),
FOREIGN KEY (profile_id) REFERENCES profile(id)
);
When you add a new value for an existing record, set an endtime in the masked record.
Then to get the value at a date $T:
SELECT p.id, attr, value
FROM profile p
INNER JOIN profile_date d
ON p.id=d.profile_id
WHERE $T>=starttime
AND $T<=IF(endtime IS NULL,$T, endtime);
Alternately just have a start time, and:
SELECT p.id, attr, value
FROM profile p
INNER JOIN profile_date d
ON p.id=d.profile_id
WHERE $T>=starttime
AND NOT EXISTS (SELECT 1
FROM prodile_data d2
WHERE d2.profile_id=d.profile_id
AND d2.attr=d.attr
AND d2.starttime>d.starttime
AND d2.starttime>$T);
(which will be even faster with the MAX concat trick).
But if the data is not changing with that frequency then keep it in the current structure.
You need a slow changing dimension:
i will do this only for e-mail and telephone so you understand (pay attention to the fact of i use two keys, 1 as unique in the table, and another that is unique to the user that it concerns. This is, the table key identifies the the record, and the user key identifies the user):
table_id, user_id, email, telephone, created_at,inactive_at,is_current
1, 1, mario#yahoo.it, 123456, 2012-01-02, , 2013-04-01, no
2, 2, erik#telecom.de, 123457, 2012-01-03, 2013-02-28, no
3, 3, vanessa#o2.de, 1234568, 2012-01-03, null, yes
4, 2, erik#telecom.de, 123459, 2012-02-28, null, yes
5, 1, super.mario#yahoo.it, 654321,2013-04-01, 2013-04-02, no
6, 1, super.mario#yahoo.it, 123456,2013-04-02, null, yes
most recent state of the database
select * from FooTable where inactive_at is null
or
select * from FooTable where is_current = 'yes'
All changes to mario (mario is user_id 1)
select * from FooTable where user_id = 1;
All changes between 1 jan 2013 and 1 of may 2013
select * from FooTable where created_at between '2013-01-01' and '2013-05-01';
and you need to compare with the old versions (with the help of a stored procedure, java or php code... you chose)
select * from FooTable where incative_at between '2013-01-01' and '2013-05-01';
if you want you can do a fancy sql statement
select f1.table_id, f1.user_id,
case when f1.email = f2.email then 'NO_CHANGE' else concat(f1.email , ' -> ', f2.email) end,
case when f1.phone = f2.phone then 'NO_CHANGE' else concat(f1.phone , ' -> ', f2.phone) end
from FooTable f1 inner join FooTable f2
on(f1.user_id = f2.user_id)
where f2.created_at in
(select max(f3.created_at) from Footable f3 where f3.user_id = f1.user_id
and f3.created_at < f1.created_at and f1.user_id=f3.user_id)
and f1.created_at between '2013-01-01' and '2013-05-01' ;
As you can see a juicy query, to compare the user_with the previews user row...
the state of the database on 2013-03-01
select * from FooTable where table_id in
(select max(table_id) from FooTable where inactive_at <= '2013-03-01' group by user_id
union
select id from FooTable where inactive_at is null group by user_id having count(table_id) =1 );
I think this is the easiest way of implement what you want... you could implement a multi-million tables relational model, but then it would be a pain in the arse to query it
Your database is not big enough, I work everyday with one even bigger. Now tell me is the money you save in a new server worthy the time you spend on a super-complex relational model?
BTW if the data changes too fast, this approach cannot be used...
BONUS: optimization:
create indexes on created_at, inactive_at, user_id and the pair
perform partition (both horizontal and vertical)
if you try and put all occurring changes in different tables and later if you require an instance on some date you join them along and display by comparing dates, for example if you want an instance at 1st of july you can run a query with condition where date is equal or less than 1st of july and order it in asc ordering limiting the count to 1. that way the joins will produce exactly the instance it was at 1st of july. in this manner you can even figure out the most frequently updated module.
also if you want to keep all the data flat try range partitioning on the basis of month that way mysql will handle it pretty easily.
Note: by date i mean storing unix timestamp of the date its pretty easier to compare.
I'll offer one more solution just for variety.
Schema
PROFILE
id INT PRIMARY KEY,
username VARCHAR(50) NOT NULL UNIQUE
PROFILE_ATTRIBUTE
id INT PRIMARY KEY,
profile_id INT NOT NULL FOREIGN KEY REFERENCES PROFILE (id),
attribute_name VARCHAR(50) NOT NULL,
attribute_value VARCHAR(255) NULL,
created_at DATETIME NOT NULL DEFAULT GETTIME(),
replaced_at DATETIME NULL
For all attributes you are tracking, simply add PROFILE_ATTRIBUTE records when they are updated, and mark the previous attribute record with the DATETIME it was replaced at.
Select Current Profile
SELECT *
FROM PROFILE p
LEFT JOIN PROFILE_ATTRIBUTE pa
ON p.id = pa.profile_id
WHERE p.username = 'username'
AND pa.replaced_at IS NULL
Select Profile At Date
SELECT *
FROM PROFILE p
LEFT JOIN PROFIILE_ATTRIBUTE pa
ON p.id = pa.profile_id
WHERE p.username = 'username'
AND pa.created_at < '2013-07-01'
AND '2013-07-01' <= IFNULL(pa.replaced_at, GETTIME())
When Updating Attributes
Insert the new attribute
Update the previous attribute's replaced_at value
It would probably be important that the created_at for a new attribute match the replaced_at for the corresponding old attribute. This would be so that there is an unbroken timeline of attribute values for a given attribute name.
Advantages
Simple two-table architecture (I personally don't like a table-per-field approach)
Can add additional attributes with no schema changes
Easily mapped into ORM systems, assuming an application lives on top of this database
Could easily see the history for a certain attribute_name over time.
Disadvantages
Integrity is not enforced. For example, the schema doesn't restrict on multiple NULL replaced_at records with the same attribute_name... perhaps this could be enforced with a two-column UNIQUE constraint
Let's say you add a new field in the future. Existing profiles would not select a value for the new field until they save a value to it. This is opposed to the value coming back as NULL if it were a column. This may or may not be an issue.
If you use this approach, be sure you have indexes on the created_at and replaced_at columns.
There may be other advantages or disadvantages. If commenters have input, I'll update this answer with more information.
I've created an insert only table for the purpose of speed and maintaining a history. It's structure is very generic, and is as follows:
`id` bigint(20) unsigned NOT NULL auto_increment,
`user_id` bigint(20) unsigned NOT NULL,
`property` varchar(32) NOT NULL,
`value` longblob NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=latin1;
It's simply a key/value table with a user_id assigned to it. This approach has its advantages as not all users have the same properties, so fields aren't wasted in a table. Also, it allows for a rolling log of changes, since I can see every change to a particular property ever made by a user.
Now, since no deletes or updates ever occur in this table, I can assume that the greatest id will always be the newest entry.
However, I want to select multiple properties at once, for example 'address1', 'address2', 'city', 'state', and I want each to be the entry of it's type with the highest id.
So, if they have changed their 'state' property 8 times, and 'city' property 4 times, then I'd only want a SELECT to return the latest of each (1 state and 1 city).
I'm not sure this can even be done efficiently with this type of a table, so I'm open to different table approaches.
Please, let me know if I need to produce anymore information or clarify my question better.
===
I tried the following, but, there could be 3 rows of 'address1' changes after the last 'address2' change. Perhaps using a GROUP BY will work?
SELECT property, value FROM kvtable WHERE user_id = 1 AND (property = 'address1' OR property = 'address2') ORDER BY id
Assuming your ids are incremental integers and you have not manually specified them out of order, you can do this with a few MAX() aggregates in a subquery. The point of the subquery is to return the latest entry per property name, per user. That is joined against the whole table to pull in the associated property values. Essentially, the subquery discards all rows which don't have a max(id) per group.
SELECT kvtable.*
FROM
kvtable
JOIN (
SELECT
MAX(id) AS id,
user_id,
property
FROM kvtable
/* optionally limit by user_id */
WHERE user_id = <someuser>
GROUP BY user_id, property
) maxids ON kvtable.id = maxids.id
I am new to mysql and I have been pulling my hair out about this problem for days. I need to improve/optimize this query so that it runs faster - right now its taking over 5 seconds.
Here is the query:
SELECT SQL_NO_CACHE COUNT(*) as multiple, a.*,b.*
FROM announcements as a
INNER JOIN stores as s
ON a.username=s.username
WHERE s.username is not null AND s.state='NC'
GROUP BY a.announcement_id
ORDER BY a.dt DESC LIMIT 0,10
Stores table consists of: store_id, username, name, state, city, zip, etc...
Announcements table consists of: announcement_id, msg, dt, username
The stores table has around 10,000 records and the announcements table has around 500,000 records.
What I'm trying to accomplish in english - display the 10 most recent store announcements BUT what makes this complicated is that stores can have multiple entries in the stores table with the same userid (one row per location). So if a chain store, lets say "Chipotle" sends an announcement, I want to display only one row for their announcement with a note next to it that says "this store has multiple locations". That's why I'm using the count(*) and group by, so if count(*) > 1 I know there are multiple locations for the announcement.
The where condition can be any state, city, or zip. Using SQL_NO_CACHE because announcements are updated frequently so you rarely get the same results, does that make sense?
I would really appreciate any suggestions of how to do this better. I know little about indexes, but I did create an index for the "username" field in both tables. Feel free to shred me apart here, I know I must be missing something.
Update --
DESC stores;
Field Type Null Key Default Extra
store_id int(11) NO PRI NULL auto_increment
username varchar(20) NO MUL NULL
name varchar(100) NO NULL
street varchar(100) NO NULL
city varchar(50) NO NULL
state varchar(2) NO NULL
zip varchar(15) NO NULL
DESC announcements;
Field Type Null Key Default Extra
dt datetime NO NULL
username varchar(20) NO MUL NULL
msg varchar(200) NO NULL
announcement_id int(11) NO PRI NULL auto_increment
EXPLAIN output;
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE a index username PRIMARY 47 NULL 315001 Using temporary; Using filesort
1 SIMPLE b ref username username 62 a.username 1 Using where
Try something like this:
SELECT SQL_NO_CACHE COUNT(*) as multiple, a.*,b.*
FROM announcements as a
INNER JOIN
(
SELECT username, COUNT(username) as multiple FROM stores
WHERE username IS NOT NULL AND state = 'NC'
GROUP BY username
) as s
ON a.username=s.username
ORDER BY a.dt DESC LIMIT 10
If you are ordering on the dt column, but there is no index on that column, the MySQL will have to do a (slow, expensive) sort of all of your result rows on that column every time you run the query
Try adding an index on announcements.dt -- MySQL may be able to access the rows in order by using the index, and avoid the sorting step afterwards.
Change the order of tables in your JOIN, MySQL reads rows from the first table and then
finds matching rows in the second table. If you always filter your result by fields in the stores table then the stores table should be the leading table in your JOIN so it won't pick and sort unnecessary rows from the announcements table.
In the EXPLAIN output you pasted it seems like only one shop matched the query, switching the order of tables would cause it to only look for that specific shop in the announcements table.
Add an index on the dt column (having an indexed integer column with unixtime would be even better)
If possible - create an integer userID for each username and JOIN using that column (add an on index on that one as well)
Not sure if MySQL still has problems with this but replacing COUNT(*) with COUNT(1) might be helpful.