Sorry if my question seems unclear, I'll try to explain.
I have a column in a row, for example /1/3/5/8/42/239/, let's say I would like to find a similar one where there is as many corresponding "ids" as possible.
Example:
| My Column |
#1 | /1/3/7/2/4/ |
#2 | /1/5/7/2/4/ |
#3 | /1/3/6/8/4/ |
Now, by running the query on #1 I would like to get row #2 as it's the most similar. Is there any way to do it or it's just my fantasy? Thanks for your time.
EDIT:
As suggested I'm expanding my question. This column represents favourite artist of an user from a music site. I'm searching them like thisMyColumn LIKE '%/ID/%' and remove by replacing /ID/ with /
Since you did not provice really much info about your data I have to fill the gaps with my guesses.
So you have a users table
users table
-----------
id
name
other_stuff
And you like to store which artists are favorites of a user. So you must have an artists table
artists table
-------------
id
name
other_stuff
And to relate you can add another table called favorites
favorites table
---------------
user_id
artist_id
In that table you add a record for every artist that a user likes.
Example data
users
id | name
1 | tom
2 | john
artists
id | name
1 | michael jackson
2 | madonna
3 | deep purple
favorites
user_id | artist_id
1 | 1
1 | 3
2 | 2
To select the favorites of user tom for instance you can do
select a.name
from artists a
join favorites f on f.artist_id = a.id
join users u on f.user_id = u.id
where u.name = 'tom'
And if you add proper indexing to your table then this is really fast!
Problem is you're storing this in a really, really awkward way.
I'm guessing you have to deal with an arbitrary number of values. You have two options:
Store the multiple ID's in a blob object in JSON format. While MySQL doesn't have JSON functions built in, there are user defined functions that will extract values for you, etc.
See: http://blog.ulf-wendel.de/2013/mysql-5-7-sql-functions-for-json-udf/
Alternatively, switch to PostGres
Add as many columns to your table as the maximum number of ID's you expect to have. So if /1/3/7/2/4/8/ is the longest entry, have 6 columns in your table. Reason this is bad: you'll have sparse columns that'll unnecessarily slow your tables.
I'm sure you could write some horrific regex to accomplish the task, but I caution on using complex regex's on enormous tables.
Related
I have come across this problem and I've tried to solve it few days now.
Let's say I have following tables
properties
-----------------------------------------
| id | address | building_material |
-----------------------------------------
| 1 | Street 1 | 1 |
-----------------------------------------
| 2 | Street 2 | 2 |
-----------------------------------------
building_materials
-----------------------------
| id | building_material |
-----------------------------
| 1 | Wood |
-----------------------------
| 2 | Stone |
-----------------------------
Now. I would like to provide an API where you could send a request and ask for every property that has building material of wood. Like this:
myapi.com/properties?building_material=Wood
So I would like to query database like this (I want to return the string value of building_material not the numeric value):
SELECT p.id, p.address, bm.building_material
FROM properties as p
JOIN building_materials as bm ON (p.building_material = bm.id)
WHERE building_material = "Wood"
But this will give me an error
Column 'building_material' in where clause is ambiguous
Also if I want to get property with id of 1.
SELECT p.id, p.address, bm.building_material
FROM properties as p
JOIN building_materials as bm ON (p.building_material = bm.id)
WHERE id = 1
Column 'id' in where clause is ambiguous
I understand that the error means that I have same column name in two tables and I don't specify which id I want like p.id.
Problem is I don't know how many query parametes API user is going to send and I would like to avoid looping through them and changing id to p.id and building_material to bm.building_material. Also I don't want that user has to send request to the API like this
myapi.com/properties?bm.building_material=Wood
I've thought about changing the properties table building_material to fk_building_material and changing properties table id to property_id.
I just don't like the idea that on client side I would then have to refer property's building material as fk_building_material. Is this a valid method to solve this problem or what would be the correct way of designing these tables?
The query mentions two tables, so all the columns in both tables are "on the table" for use anywhere in the query.
In one table building_material is an "id" for linking to the other table; in the other table, it is a string. While this is possible, it is confusing to the reader. And to the parser. To resolve the confusion, you must qualify building_material with which one you want; that is done with a table alias (or table) in front (as you did in all other places).
There are two ids are all ambiguous. But this is the "convention" used by table designers. So, it is OK for an id in one table to be different than the id in the other table. (p.id refers to one thing in one table; bm.id refers to another in another table.)
SELECT p.id, p.address, bm.building_material
FROM properties as p
JOIN building_materials as bm ON (p.building_material = bm.id)
WHERE bm.building_material = "Wood" -- Note "bm."
Suppose I had these 4 tables, consisting of various foreign key relationships (eg a area must belong to a location, a shop must belong to area, an item price must belong to a shop ect..)
----------------------------------
|Location Name | Location ID |
| | |
----------------------------------
-------------------------------------------------
|Area Name | Area ID | Location ID |
| | | |
-------------------------------------------------
-------------------------------------------------
| Shop Name | Shop ID | Area ID |
| | | |
-------------------------------------------------
----------------------------------
| Item Price | Shop ID |
| | |
----------------------------------
And I wanted the sum of 'Item Price' that belonged to a specific location id. So all the areas and shops item price total for location id 'x'.
One way I found to do this is to join all the tables for one location and get the amount eg:
SELECT SUM(Item Price) FROM
items
left join shops ON (items.shop id = shops.shop id)
left join areas ON (shops.area id = areas.area id)
left join locations ON (areas.location id = location.location id)
WHERE Location Id = 4;
However is this the best way to do this since it involves retrieving the full tree of the data and filtering it out? Would there be a better way if there are a million rows or is this the best way?
You can try sub query --
SELECT SUM(Item Price) FROM
items
left join shops ON (items.shop id = shops.shop id)
left join (select area id from areas where Location Id = 4) as Ar ON (shops.area id = areas.area id)
If you define the right indexes, then the query does not read all the millions of rows for each table.
Think about a telephone book and how you look up a name. Do you read the whole book cover to cover looking for the name? No, you take advantage of the fact that the book is sorted by lastname, firstname and you go directly to the name. It takes only a few tries to find the right page. In fact, on average it takes about log2N tries for a book with N names in it.
The same kind of search happens for each join. If you have indexes, each comparison expression uses a similar lookup to find matching rows in the joined table. It's pretty fast.
But if that's not fast enough, you can also use denormalization, which in this case would be storing all the data in one table, with many columns wide.
----------------------------------------------------------------------
|Location Name | Area Name | Shop Name | Item Name | Item Price |
| | | | | |
----------------------------------------------------------------------
The advantage of denormalization is that it avoids certain joins. It stores the row just like one of the rows you'd get from the result set of your example joined SQL query. You just read one row from the table and you have all the information you need.
The disadvantage of denormalization is the redundant storage of data. Presumably each shop has many items. But each item is stored on a row of its own, which means that row has to repeat the names of the shop, area, and location.
By storing those data repeatedly, you create an opportunity for "anomalies" like if you change the name of a given shop, but you mistakenly change it only on a few rows instead of everywhere the shop name appears. Now you have two names for the same shop, and someone else looking at the database has no way of knowing which one is correct.
In general, maintaining multiple normalized tables in preferable, because each "fact" is stored exactly once, so there can be no anomalies.
Creating indexes to help your queries is sufficient for most applications.
You might like my presentation, How to Design Indexes, Really, and the video: https://www.youtube.com/watch?v=ELR7-RdU9XU
I have 6 tables. These are simplified for this example.
user_items
ID | user_id | item_name | version
-------------------------------------
1 | 123 | test | 1
data
ID | name | version | info
----------------------------
1 | test | 1 | info
data_emails
ID | name | version | email_id
------------------------
1 | test | 1 | 1
2 | test | 1 | 2
emails
ID | email
-------------------
1 | email#address.com
2 | second#email.com
data_ips
ID | name | version | ip_id
----------------------------
1 | test | 1 | 1
2 | test | 1 | 2
ips
ID | ip
--------
1 | 1.2.3.4
2 | 2.3.4.5
What I am looking to achieve is the following.
The user (123) has the item with name 'test'. This is the basic information we need for a given entry.
There is data in our 'data' table and the current version is 1 as such the version in our user_items table is also 1. The two tables are linked together by the name and version. The setup is like this as a user could have an item for which we dont have data, likewise there could be an item for which we have data but no user owns..
For each item there are also 0 or more emails and ips associated. These can be the same for many items so rather than duplicate the actual email varchar over and over we have the data_emails and data_ips tables which link to the emails and ips table respectively based on the email_id/ip_id and the respective ID columns.
The emails and ips are associated with the data version again through the item name and version number.
My first query is is this a good/well optimized database setup?
My next query and my main question is joining this complex data structure.
What i had was:
PHP
- get all the user items
- loop through them and get the most recent data entry (if any)
- if there is one get the respective emails
- get the respective ips
Does that count as 3 queries or essentially infinite depending on the number of user items?
I was made to believe that the above was inefficient and as such I wanted to condense my setup into using one query to get the same data.
I have achieved that with the following code
SELECT user_items.name,GROUP_CONCAT( emails.email SEPARATOR ',' ) as emails, x.ip
FROM user_items
JOIN data AS data ON (data.name = user_items.name AND data.version = user_items.version)
LEFT JOIN data_emails AS data_emails ON (data_emails.name = user_items.name AND data_emails.version = user_items.version)
LEFT JOIN emails AS emails ON (data_emails.email_id = emails.ID)
LEFT JOIN
(SELECT name,version,GROUP_CONCAT( the_ips.ip SEPARATOR ',' ) as ip FROM data_ips
LEFT JOIN ips as the_ips ON data_ips.ip_id = the_ips.ID )
x ON (x.name = data.name AND x.version = user_items.version)
I have done loads of reading to get to this point and worked tirelessly to get here.
This works as I require - this question seeks to clarify what are the benefits of using this instead?
I have had to use a subquery (I believe?) to get the ips as previously it was multiplying results (I believe based on the complex joins). How this subquery works I suppose is my main confusion.
Summary of questions.
-Is my database setup well setup for my usage? Any improvements would be appreciated. And any useful resources to help me expand my knowledge would be great.
-How does the subquery in my sql actually work - what is the query doing?
-Am i correct to keep using left joins - I want to return the user item, and null values if applicable to the right.
-Am I essentially replacing a potentially infinite number of queries with 2? Does this make a REAL difference? Can the above be improved?
-Given that when i update a version of an item in my data table i know have to update the version in the user_items table, I now have a few more update queries to do. Is the tradeoff off of this setup in practice worthwhile?
Thanks to anyone who contributes to helping me get a better grasp of this !!
Given your data layout, and your objective, the query is correct. If you've only got a small amount of data it shouldn't be a performance problem - that will change quickly as the amount of data grows. However when you ave a large amount of data there are very few circumstances where you should ever see all your data in one go, implying that the results will be filtered in some way. Exactly how they are filtered has a huge impact on the structure of the query.
How does the subquery in my sql actually work
Currently it doesn't work properly - there is no GROUP BY
Is the tradeoff off of this setup in practice worthwhile?
No - it implies that your schema is too normalized.
Let's say I have the following scenario.
A database of LocalLibrary with two tables Books and Readers
| BookID| Title | Author |
-----------------------------
| 1 | "Title1" | "John" |
| 2 | "Title2" | "Adam" |
| 3 | "Title3" | "Adil" |
------------------------------
And the readers table looks like this.
| UserID| Name |
-----------------
| 1 | xy L
| 2 | yz |
| 3 | xz |
----------------
Now, lets say that user can create a list of books that they read (a bookshelf, that strictly contains books from above authors only i.e authors in our Db). So, what is the best way to represent that bookshelf in Database.
My initial thought was a comma separated list of BookIDin Readers table. But it clearly sounds awkward for a relational Db and I'll also have to split it every time I display the list of users' books. Also, when a user adds a new book to shelf, there is no way of checking if it already exists in their shelves except to split the comma-separated list and and compare the IDs of two. Deleting is also not easy.
So, in one line, the question is how does one appropriately models situations like these.
I have not done anything beyond simple SELECTs and INSERTs in MySQL. It would be much helpful if you could describe in simpler terms and provide links for further reading.
Please comment If u need some more explanation.
Absolutely forget the idea about a comma separated list of books to add to the Readers table. It will be unsearchable and very clumsy. You need a third table that join the Books table and the Readers table. Each record in this table represent a reader reading a book.
Table ReaderList
--------------------
UserID | BookID |
--------------------
You get a list of books read by a particular user with
select l.UserID, r.Name, l.BookID, b.Title, b.Author
from ReaderList l left join Books b on l.BookID = b.BookID
left join Readers r on l.UserID = r.UserID
where l.UserID = 1
As you can see this pattern requires the use of the keyword JOIN that bring togheter data from two or more table. You can read more about JOIN in this article
If you want, you could enhance this model adding another field to the ReaderList like the ReadingDate
the question is :
i have a table that contains details, this table is used by users when they registered or update there profile or participate in different exams.
The report I need will have some calculation like aggregate scores .
I would to as if it is better to create new table witch includes the report i need or it's better to work on the same table.
Are you able to provide any further details? What fields are available in the table that you want to query? How do you want to display this information? On a website? For a report?
From what you describe, you need two tables. One table (lets call is 'users') would contain information about each user, and the other would contain the actual exam scores (lets call this table 'results' ).
Each person in the 'user' table has a unique ID number (I'll call it UID) to identify them, and each score in the 'results' table also has the UID of person the score relates to. By including the UID of the user in the 'results' table you can link an infinite number of results (known as a one-to-many relationship).
The 'user' table could look like this:
userUID (UID for each person) | Name | User Details
1 | Barack Obama | President
2 | George Bush | Ex-President
The 'results' table could look like this:
UID for each exam | userUID (UID of the person who look the test) | Score
1 | 1 | 85
2 | 2 | 40
3 | 1 | 82
4 | 2 | 25
I always like to add a UID for things like the exam because it allows you to easily find a specific exam result.
Anyway... a query to get all of the results for Barack Obama would look like this:
SELECT Score From 'results' WHERE userUID = 1
To get results for George Bush, you just change the userUID to 2. You would obviously need to know the UID of the user (userUID) before you ran this query.
Please note that these are VERY basic examples (involving fictional characters ;) ). You could easily add an aggregated score field to the 'user' table and update that each time you add a new result to the 'results' table. Depending upon how your code is set up this could save you a query.
Good luck - Hopefully this helps!