How do I use mysql to match against multiple possibilities from a second table? - mysql

I'm not entirely sure how to ask this question, so I'll lead by providing an example table and an example output and then follow up with a more thorough explanation of what I'm attempting to accomplish.
Imagine that I have two tables. In the first is a list of companies. Some of these companies have duplicate entries due to being imported and continuously updated from different sources. For example, the company table may look something like this:
| rawName | strippedName |
| Kohl's | kohls |
| kohls.com | kohls |
| kohls Corporation | kohls |
So in this situation, we have information that has come in from three different sources. In an attempt to allow my program to understand that each of these sources are all the same store, I created the stripped name column (which I also use for creating URL's and whatnot).
In the second table, we have information about deals, coupons, shipping offers, etc. However, since these come in from their various sources, the end up with the three different rawNames that we identified above. For example, the second table might look something like this:
| merchantName | dealInformation |
| kohls.com | 10% off everything... |
| kohl's | Free shipping on... |
| kohls corporation | 1 Day Flash Sale! |
| kohls.com | Buy one get one... |
So here we have four entries that are all from the same company. However, when a user on the site visits the listing for Kohls, I want it to display all the entries from each source.
Here is what I currently have, but it doesn't seem to be doing the trick. This seems to only work if I set the LIMIT in that sub-query to 1 so that it only brings back one of the rawNames. I need it to match against all of the rawNames.
SELECT * FROM table2
WHERE merchantName = (SELECT rawName FROM table1 WHERE strippedName = '".$strippedName."')

The quickest fix is to replace your mercahantName = with merchantName IN
SELECT * FROM table2
WHERE merchantName IN (SELECT rawName FROM table1 WHERE strippedName = '".$strippedName."')
The = operator needs to have exactly one value on each side - the IN keyword matches a value against multiple values.

Related

How to extract relational data from a flat table using SQL?

I have a single flat table containing a list of people which records their participation in different groups and their activities over time. The table contains following columns:
- name (first/last)
- e-mail
- secondary e-mail
- group
- event date
+ some other data in a series of columns, relevant to a specific event (meeting, workshop).
I want to extract distinct people from that into a separate table, so that further down the road it could be used for their profiles giving them a list of what they attended and relevant info. In other words, I would like to have a list of people (profiles) and then link that to a list of groups they are in and then a list of events per group they participated in.
Obviously, same people appear a number of times:
| Full name | email | secondary email | group | date |
| John Smith | jsmith#someplace.com | | AcOP | 2010-02-12 |
| John Smith | jsmith#gmail.com | jsmith#somplace.com | AcOP | 2010-03-14 |
| John Smith | jsmith#gmail.com | | CbDP | 2010-03-18 |
| John Smith | jsmith#someplace.com | | BDz | 2010-04-02 |
Of course, I would like to roll it into one record for John Smith with both e-mails in the resulting People table. I can't rule out that there might be more records for same person with other e-mails than those two - I can live with that. To make it more complex ideally I would like to derive a list of groups, creating a Groups table (possibly with further details on the groups) and then a list of meetings/activities for each group. By linking that I would then have clean relational model.
Now, the question: is there a way to perform such a transformation of data in SQL? Or do I need to write a procedure (program) that would traverse the database and do it?
The database is in MySQL, though I can also use MS Access (it was given to me in that format).
There is no tool that does this automatically. You will have to write a couple queries (unless you want to write a DTS package or something proprietary). Here's a typical approach:
Write two select statements for the two tables you wish to create-- one for users and one for groups. You may need to use DISTINCT or GROUP BY to ensure you only get one row when the source table contains duplicates.
Run the two select statements and inspect them for problems. For example, it's possible some users show up with two different email addresses, or some users have the same name and were combined incorrectly. These will need to be cleaned up in order to proceed. There is great way to do this-- it's more or less a manual process requiring expert knowledge of the data.
Write CREATE TABLE scripts based on the two SELECT statements so that you can store the results somewhere.
Use INSERT FROM or SELECT INTO to populate the tables from your two SELECT statements.

How do I resolve or avoid need for MySQL with multiple AUTO INCREMENT columns?

I have put a lot of effort into my database design, but I think I am
now realizing I made a major mistake.
Background: (Skip to 'Problem' if you don't need background.)
The DB supports a custom CMS layer for a website template. Users of the
template are limited to turning pages on and off, but not creating
their own 'new' pages. Further, many elements are non editable.
Therefore, if a page has a piece of text I want them to be able to edit,
I would have 'manually' assigned a static ID to it:
<h2><%= CMS.getDataItemByID(123456) %></h2>
Note: The scripting language is not relevant to this question, but the design forces
each table to have unique column names. Hence the convention of 'TableNameSingular_id'
for the primary key etc.
The scripting language would do a lookup on these tables to find the string.
mysql> SELECT * FROM CMSData WHERE CMSData_data_id = 123456;
+------------+-----------------+-----------------------------+
| CMSData_id | CMSData_data_id | CMSData_CMSDataType_type_id |
+------------+-----------------+-----------------------------+
| 1 | 123456 | 1 |
+------------+-----------------+-----------------------------+
mysql> SELECT * FROM CMSDataTypes WHERE CMSDataType_type_id = 1;
+----------------+---------------------+-----------------------+------------------------+
| CMSDataType_id | CMSDataType_type_id | CMSDataType_type_name | CMSDataType_table_name |
+----------------+---------------------+-----------------------+------------------------+
| 1 | 1 | String | CMSStrings |
+----------------+---------------------+-----------------------+------------------------+
mysql> SELECT * FROM CMSStrings WHERE CMSString_CMSData_data_id=123456;
+--------------+---------------------------+----------------------------------+
| CMSString_id | CMSString_CMSData_data_id | CMSString_string |
+--------------+--------------------------------------------------------------+
| 1 | 123456 | The answer to the universe is 42.|
+--------------+---------------------------+----------------------------------+
The rendered text would then be:
<h2>The answer to the universe is 42.</h2>
This works great for 'static' elements, such as the example above. I used the exact same
method for other data types such as file specifications, EMail Addresses, Dates, etc.
However, it fails for when I want to allow the User to dynamically generate content.
For example, there is an 'Events' page and they will be dynamically created by the
User by clicking 'Add Event' or 'Delete Event'.
An Event table will use keys to reference other tables with the following data items:
Data Item: Table:
--------------------------------------------------
Date CMSDates
Title CMSStrings (As show above)
Description CMSTexts (MySQL TEXT data type.)
--------------------------------------------------
Problem:
That means, each time an Event is created, I need to create the
following rows in the CMSData table;
+------------+-----------------+-----------------------------+
| CMSData_id | CMSData_data_id | CMSData_CMSDataType_type_id |
+------------+-----------------+-----------------------------+
| x | y | 6 | (Event)
| x+1 | y+1 | 5 | (Date)
| x+2 | y+2 | 1 | (Title)
| x+3 | y+3 | 3 | (Description)
+------------+-----------------+-----------------------------+
But, there is the problem. In MySQL, you can have only 1 AUTO INCREMENT field.
If I query for the highest value of CMSData_data_id and just add 1 to it, there
is a chance there is a race condition, and someone else grabs it first.
How is this issue typically resolved - or avoided in the first place?
Thanks,
Eric
The id should be meaningless, except to be unique. Your design should work no matter if the block of 4 ids is contiguous or not.
Redesign your implementation to add the parts separately, not as a block of 4. Doing so should simplify things overall, and improve your scalability.
What about locking the table before writing into it? This way, when you are inserting a row in the CMSData table, you can get the last id.
Other suggestion would be to not have an incremented id, but a unique generated one, like a guid or so.
Lock Tables

Summary of MySQL detail records matching by IP address ranges - mySQL Jedi Knight required

So, I have to draw upon all the powers of the greatest mySQL minds that SO has to offer. I have to summarize detail records based on the IP address in each record. Here's the scenario:
In short, we have consortiums that want to know: "Which schools within my consortium watched which videos how many times"? In SQL terms, it amounts to COUNTing the detail records, grouped by which IP range it might fall into.
We have several university Consortiums - each with a handful of different schools that are members.
Each school within a consortium uses various IP ranges to access the videos that we serve to these schools.
The IP Ranges are specified with wild cards, so each school specifies something like '100.200.35.x, 100.201.x.x, 100.202.39.50, etc.', with the average number of ranges per school being 10 or 15.
The raw text log files to summarize are already in a database (one row for each log entry), and has the actual IP address that accessed the video file.
There are 100's of millions of detail records, so I fully expect this to be a long slow process that runs for a considerable period.
PHP scripts exist that can "explode" the wildcards into the individual IPs that are represented, but I fear this will be the final answer and could take weeks to run.
(For simplicity sake, I'm only going to refer to the video filename that was accessed and COUNT the log entries for it, but in fact all the details such as start/stop/duration,etc. are there and will ultimately be part of this solution.)
With Consortium records something like this: (All table designs except log details open to suggestion):
| id|consortium |
| 10|Ivy League |
| 20|California |
And School/IP records something like this:
| id|school |consortium_id|
| 101|Harvard |10 |
| 102|Yale |10 |
| 103|UCLA |20 |
| 104|Berkeley |20 |
| id|school_id|ip_range |
| 1| 101 |100.200.x.x |
| 2| 101 |100.201.65.x |
| 3| 101 |100.202.39.50 |
| 4| 101 |100.202.39.51 |
| 5| 101 |100.200.x.x |
| 6| 101 |100.201.65.x |
| 7| 101 |100.202.39.50 |
And detail records something like this:
|session |ip_address |filename |
|560554790925|100.202.390.500|history101.mp4 |
|406417611526|43.22.90.5 |newsreel.mp4 |
|650423700223|100.202.39.50 |history101.mp4 |
|650423700223|100.202.50.12 |science101.mp4 |
|513057324209|100.202.39.56 |history101.mp4 |
I like to think I'm pretty handy with mySQL, but this one is stretching it, and am hoping that there's a spectacular function or set of steps that someone might offer.
With your existing data structure, you could do string matching as follows (but it's not very efficient):
SELECT schools.school, detail.filename, COUNT(*)
FROM schools
JOIN ipranges ON schools.id = ipranges.school_id
JOIN detail ON detail.ip_address LIKE REPLACE(ipranges.ip_range, 'x', '%')
WHERE schools.consortium_id = ?
GROUP BY schools.school, detail.filename
A better way would be to store your IP ranges as network address and prefix length:
ALTER TABLE ipranges
ADD COLUMN network INT UNSIGNED,
ADD COLUMN prefix TINYINT;
UPDATE ipranges SET
network = INET_ATON(REPLACE(ip_range, 'x', 0)),
prefix = 32 - 8*(CHAR_LENGTH(ip_range) - CHAR_LENGTH(REPLACE(ip_range,'x',''));
ALTER TABLE ipranges
DROP COLUMN ip_range;
ALTER TABLE detail
ADD COLUMN ip_address_new INT UNSIGNED;
UPDATE detail SET
ip_address_new = INET_ATON(ip_address);
ALTER TABLE detail
DROP COLUMN ip_address,
CHANGE ip_address_new ip_address INT UNSIGNED;
Then it would merely be a case of performing some bit comparisons:
SELECT schools.school, detail.filename, COUNT(*)
FROM schools
JOIN ipranges ON schools.id = ipranges.school_id
JOIN detail ON detail.ip_address & ~((1 << 32 - ipranges.prefix) - 1)
= ipranges.network
WHERE schools.consortium_id = ?
GROUP BY schools.school, detail.filename
SELECT D.filename, S.school, COUNT(D.*)
FROM detail_records AS D
INNER JOIN ip_map AS I ON D.ip_address LIKE CONCAT(SUBSTRING(I.ip_range, 1, LOCATE('x', I.ip_range)-1), '%')
INNER JOIN school AS S ON S.id = I.school_id
INNER JOIN consortium AS C ON C.id = S.consortium_id
WHERE S.consortium_id = <consortium identifier>
GROUP BY D.filename, S.school

SQL statement to return elements from a column only if no elements from a different column match

Sorry for the confusing question, I will try to clarify.
I have an SQL database ( that I did not create ) that I would like to write a query for. I know very little about SQL, so it is hard for me to even know what to search for to see if this question has already been asked, so sorry if it has. It should be an easy solution for those in the know.
The query I need is for a search I would like to perform on an existing data management system. I want to return all the documents that a given user has NOT signed-off on, as indicated by rows in a signoffs_table. The data is stored similarly to as follows: (this is actually a simplification of the actual schema and hides several LEFT JOINS and columns)
signoffs_table:
| id | user_id | document_id | signers_list |
The naive solution I had was to do something like the following:
SELECT document_id from signoffs_table WHERE (user_id <> $BobsID) AND signers_list LIKE "%Bob%";
This works if ONLY Bob signs the document. The problem is that if Bob and Mary have signed the document then the table looks like this:
signoffs_table:
-----------------------------------------------
| id | user_id | document_id | signers_list |
-----------------------------------------------
| 1 | 10 | 100 | "Bob,Mary,Jim" |
| 2 | 20 | 100 | "Bob,Mary,Jim" |
-----------------------------------------------
(assume Bob's ID = 10 and mary's ID = 20).
and then when I do the query then I get back document_id 100 (in row #2) because there is a row that Bob should have signed, but did not.
Is what I am trying to do possible with the given database structure? I can provide more details if needed. I am not sure how much details are needed.
I guess this query is what you mean:
SELECT document_id FROM signoffs_table AS t1
WHERE signers_list LIKE "%Bob%"
AND NOT EXISTS (
SELECT 1 FROM signoffs_table AS t2
WHERE (t2.user_id = $BobsID) AND t2.document_id = t1.document_id )
I believe your design is incorrect. You have a many-to-many relationship between documents and signers. You should have a junction table, something like:
ID DocumentID SignerID

Create another table just to store a few options?

I'm creating a database with various tables. Let's take the user table, for example. It has fields such as marital status and system role. Each of those fields has predefined options. Does it make sense to create two new tables for each of those fields, so then when a user is added to the system, choices can be made available for selection e.g. single, married, divorced? It seems a bit of an overkill in terms of one extra query. Is this the best way to do it or do I have other options?
I would definitely create separate tables to store the available options for these various columns. This is a good thing to do as far as normalization goes, and will also save you headaches down the road when you need to add, remove, disable or change any of the options. Also, if don't create a separate table and populate the values directly in the user table, you may end up having to do something like select distinct RelationshipStatus from User to get the available options, which is not as performant as just selecting 10 or however many values from a separate table.
As someone commented, over-normalization can sometimes be a pain, but I've found that not normalizing something as a way to do a quick work-around almost always comes back to haunt you.
User
----
ID
RelationshipStatusId
...other columns
RelationshipStatus
------------------
ID
Value
Description
You can use the ENUM datatype in MySQL to better take care of this scenario. Storing such options in a seperate table is a bad idea until you have a lot of them..
mysql> DESC Classes;
+-------+-----------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+-----------------------+------+-----+---------+-------+
| id | int(11) | NO | PRI | NULL | |
| dept | char(4) | NO | | NULL | |
| level | enum('Upper','Lower') | NO | | NULL | |
+-------+-----------------------+------+-----+---------+-------+
3 rows in set (0.00 sec)
mysql> SELECT * FROM Classes;
+----+------+-------+
| id | dept | level |
+----+------+-------+
| 10 | MATH | |
+----+------+-------+
1 row in set (0.00 sec)
mysql> INSERT INTO Classes VALUES (11, 'ENG', 'Upper')
-> ;
Query OK, 1 row affected (0.00 sec)
mysql> SELECT * FROM Classes;
+----+------+-------+
| id | dept | level |
+----+------+-------+
| 10 | MATH | |
| 11 | ENG | Upper |
+----+------+-------+
2 rows in set (0.00 sec)
For design's sake, create another table (what you don't want to do) with a proper PK. This will have the extra benefit of saving space, because imagine having 10000 registers with the word "married" on them.
Also, an alternative is using in your application a "dictionary", storing in a structure and Id and the value, like this:
Id Marital Status
1 Married
2 Single
.. ......
The same table, but not in a database but in the application, hardcoded, serialized or in an external file.
It depends on the size of the rows also. It would be better option to split the tables in to multiple in terms of speed.
For ex. you can keep the frequent used columns in user table and all other informations/optional ones in separate tables. In this case you need take care while displaying the data also.
I guess, there is no need for over-normalization as it will trouble you in writing queries. You need to take care of too many joins.
If your predefined conditions for Marital Status are: Married, Single and Divorced, I would just store a single character like: M, S and D and would provide these options in a DropDown with fixed values.
I think Marital Status has no further possibilities unless you think of something like:
Want to be Divorced
Married but living alone.
For user role also, I would do something like that:
A - Administrator
P - Power User
R - Restricted User
G - Guest
In case you need something more elaborate, I won't create further tables.