This is my first question in stackoverflow and I am delighted to be part of this community because it has helped me many times.
I'm not an expert in SQL and MySQL but I'm working in a project that needs large tables (million rows). I have a problem when doing a join and I don't understand why it takes so long. Thanks in advance:)
Here are the tables:
CREATE TABLE IF NOT EXISTS tabla_maestra(
id int UNIQUE,
codigo_alta char(1),
nombre varchar(100),
empresa_apellido1 varchar(150),
apellido2 varchar(50),
tipo_via varchar(20),
nombre_via varchar(100),
numero_via varchar(50),
codigo_via char(5),
codigo_postal char(5),
nombre_poblacion varchar(100),
codigo_ine char(11),
nombre_provincia varchar(50),
telefono varchar(250) UNIQUE,
actividad varchar(100),
estado char(1),
codigo_operadora char(3)
);
CREATE TABLE IF NOT EXISTS tabla_actividades_empresas(
empresa_apellido1 varchar(150),
actividad varchar(100)
);
Here is the query I want to do:
UPDATE tabla_maestra tm
INNER JOIN tabla_actividades_empresas tae
ON (tm.nombre!='' AND tae.empresa_apellido1=tm.empresa_apellido1)
SET tm.actividad=tae.actividad;
This query takes too long, and before executing it I was trying to test how long takes this simplier query:
SELECT COUNT(*) FROM tabla_maestra tm
INNER JOIN tabla_actividades_empresas tae
ON (tm.nombre!='' AND tae.empresa_apellido1=tm.empresa_apellido1);
It is still taking too long, and I don't understand why. Here are the indexes I use:
CREATE INDEX cruce_nombre
USING HASH
ON tabla_maestra (nombre);
CREATE INDEX cruce_empresa_apellido1
USING HASH
ON tabla_maestra (empresa_apellido1);
CREATE INDEX index_actividades_empresas
USING HASH
ON tabla_actividades_empresas(empresa_apellido1);
If I use the EXPLAIN statement, these are the results:
http://oi59.tinypic.com/2zedoy0.jpg
I would be so grateful to receive any answer that could help me. Thanks a lot,
Dani.
A join involving half a million rows -- as your query plan shows -- is bound to take some time. The count(*) query is quicker because it doesn't need to read the tabla_maestra table itself, but it still needs to scan all the rows of index cruce_empresa_apellido1.
It might help some if you made index index_actividades_empresas a unique index (supposing that that's indeed appropriate) or if instead you drop that index and make column empresa_apellido1 a primary key of table tabla_actividades_empresas.
If even that does not give you sufficient performance, then the only other thing I see to do is to give table tabla_actividades_empresas a synthetic primary key of integer type, and to change the corresponding column of tabla_maestra to match. That should help because comparing an integer to an integer is faster than comparing a string to a string, even when you can filter out (most) mismatches via a hash.
I agree with the other ones (see John Bollinger i.e.) about the lack of Primary Keys on it. It's highly adiviced for IDs (I noticed you worry about it be repeated, but PK smoothly treats it too - I meant MySQL's AUTOINCREMENT).
Why do you use the tabla_actividades_empresas.empresa_apellido1 instead of look for tabla_maestra's ID to be referenced in?
If so, you could define Foreign Key to it: tabla_actividades_empresas.maestra_id i.e.
Because it gets better if you associate tables with non-strings types.
You also can subquery the tables before the JOIN action between them. It's an example:
UPDATE (SELECT * FROM tabla_maestra WHERE nombre != '') AS tm
INNER JOIN tabla_actividades_empresas AS tae
ON tae.empresa_apellido1 = tm.empresa_apellido1
SET tm.actividad = tae.actividad;
I have not tested it. But it seems to be a nice behavior to follow since then.
Oh... everytime do you need to update all the data rows? Unless, you can update only the forgotten ones. You can apply the UPDATE by INNER JOIN after one LEFT JOIN to determine the needed ones to be updated. Does it have sense? I'm not any expert, but it can be useful to think about.
EDIT
You may test some subquery too:
UPDATE tabla_maestra AS main, tabla_actividades_empresas AS aggr
SET main.actividad = aggr.actividad
WHERE main.empresa_apellido1 = aggr.empresa_apellido1
AND main.nombre <> ''
Don't forget to try of adjusting the relationship.
Thank you so much for your answers.
The fact is that table 'tabla_maestra' is a table that contain information about enterprises, but does't contain the values for the 'actividad' field (activity of the enterprise). Moreover, the 'id' field is still empty (I will it in a future. It is difficult to explain why, but it has to be done this way).
I need to add the activity of each enterprise joining with an auxiliar table 'tabla_actividades_empresas', which contain the activity for each enterprise name. And I only have to do it one time, no more. Then I will be able to drop the table 'tabla_actividades_empresas' because I won't need it.
And the only way to join them is by the field 'empresa_apellido1', it is to say, the name of the enterprise.
I have made the field 'tabla_actividades_empresas.empresa_apellido1' unique, but it doesn't improve the performance.
And it doesn't have sense to define a foreign key on 'tabla_actividades_empresas' because the field 'empresa_apellido1' is UNIQUE only for the 'tabla_actividades_empresas', not for the 'tabla_maestra' (in this table, an enterprise name can appear many times because enterprises can have different offices in different places). It is to say, 'tabla_actividades_empresas' doesn't contain repeated enterprises, but 'tabla_maestra' has repeated name enterprises.
By the way, what do you mean by "adjusting the relationship"? I have tried your subqueries with the explain statement, and it doesn't use the indexes correctly, the performance is worse.
Related
I'm going to keep it brief here for convenience's sake. I'm new to SQL coding, so please excuse me if I say something weird.
I did not manage to find a solid solution to it (at least one that I would truly understand), which is precisely why I'm posting here as a last resort at this point.
The table code:
create table companies (
company_id mediumint not null auto_increment,
Name varchar(40) not null,
Address varchar(40),
FoundingDate date,
primary key (company_id)
);
create table employees (
Employee_id mediumint not null auto_increment,
Name varchar (40),
Surname varchar(40),
primary key (Employee_id)
);
create table accounts (
Account_id mediumint not null auto_increment,
Account_number varchar(10) not null,
CompanyID int(10),
Date_of_creation date,
NET_value int(30),
VAT int(3),
Total_value int(40),
EmployeeID int(10) not null,
Description varchar(40),
primary key (Account_number)
);
Table values are random strings and numbers until I figure this out.
My issue is that I'm stuck at forming correct SQL queries, namely:
Query all accounts with their designated companies. I need it to show 'NULL' value if an account has no associated company.
Query that can list all accounts whose date is less than 2018-03-16 or those without a date.
Query that will print the description of the 'Accounts' table in one column and the number of characters in that description in a different column.
Query that lists all employees whose names end with '-gh' and that have names greater than 5 characters in length.
Query that will list the top total sum amount.
Query that will list all accounts that have '02' in them (i.e. 3/02/05).
If you can answer at least one of these queries and if you can explain how you got to the solution in a simplistic manner, well... I'm afraid I have nothing to offer but honest gratitude! ^^'
Welcome to the community, but as Jerry commented, you should really try to show SOMETHING that you have tried just to show what you THINK is needed. Also, don't just add comments to respond, but edit your original post with additional details / data as people ask questions.
To try and advance you forward though, I will point out two specific links that should help you out. The first one is a link for the basics on querying explaining the
select [fields] from [what table] join [other tables] where [what is your criteria] -- etc. Some Basics on querying
The next give some very good clarification on JOIN conditions of (INNER) JOIN -- which means required record match in BOTH tables being joined, and FULL OUTER JOINS, LEFT JOINs, etc.
After reviewing those, if you STILL have questions, please edit your original question, post some samples of what you THINK is working and let us know (or comment back to a specific answer), and we in the forum can follow-up with you.
HINT, your first query wanting NULL you should get from the visual link via LEFT JOIN.
A visual representation and samples on querying
We're developing a monitoring system. In our system values are reported by agents running on different servers. This observations reported can be values like:
A numeric value. e.g. "CPU USAGE" = 55. Meaning 55% of the CPU is in
use).
Certain event was fired. e.g. "Backup completed".
Status: e.g. SQL Server is offline.
We want to store this observations (which are not know in advance and will be added dynamically to the system without recompiling).
We are considering adding different columns to the observations table like this:
IntMeasure -> INTEGER
FloatMeasure -> FLOAT
Status -> varchar(255)
So if the value we whish to store is a number we can use IntMeasure or FloatMeasure according to the type. If the value is a status we can store the status literal string (or a status id if we decide to add a Statuses(id, name) table).
We suppose it's possible to have a more correct design but would probably become to slow and dark due to joins and dynamic table names depending on types? How would a join work if we can't specify the tables in advance in the query?
I haven't done a formal study, but from my own experience I would guess that more than 80% of database design flaws are generated from designing with performance as the most important (if not only) consideration.
If a good design calls for multiple tables, create multiple tables. Don't automatically assume that joins are something to be avoided. They are rarely the true cause of performance problems.
The primary consideration, first and foremost in all stages of database design, is data integrity. "The answer may not always be correct, but we can get it to you very quickly" is not a goal any shop should be working toward. Once data integrity has been locked down, if performance ever becomes an issue, it can be addressed. Don't sacrifice data integrity, especially to solve problems that may not exist.
With that in mind, look at what you need. You have observations you need to store. These observations can vary in the number and types of attributes and can be things like the value of a measurement, the notification of an event and the change of a status, among others and with the possibility of future observations being added.
This would appear to fit into a standard "type/subtype" pattern, with the "Observation" entry being the type and each type or kind of observation being the subtype, and suggests some form of type indicator field such as:
create table Observations(
...,
ObservationKind char( 1 ) check( ObservationKind in( 'M', 'E', 'S' )),
...
);
But hardcoding a list like this in a check constraint has a very low maintainability level. It becomes part of the schema and can be altered only with DDL statements. Not something your DBA is going to look forward to.
So have the kinds of observations in their own lookup table:
ID Name Meaning
== =========== =======
M Measurement The value of some system metric (CPU_Usage).
E Event An event has been detected.
S Status A change in a status has been detected.
(The char field could just as well be int or smallint. I use char here for illustration.)
Then fill out the Observations table with a PK and the attributes that would be common to all observations.
create table Observations(
ID int identity primary key,
ObservationKind char( 1 ) not null,
DateEntered date not null,
...,
constraint FK_ObservationKind foreign key( ObservationKind )
references ObservationKinds( ID ),
constraint UQ_ObservationIDKind( ID, ObservationKind )
);
It may seem strange to create a unique index on the combination of Kind field and the PK, which is unique all by itself, but bear with me a moment.
Now each kind or subtype gets its own table. Note that each kind of observation gets a table, not the data type.
create table Measurements(
ID int not null,
ObservationKind char( 1 ) check( ObservationKind = 'M' ),
Name varchar( 32 ) not null, -- Such as "CPU Usage"
Value double not null, -- such as 55.00
..., -- other attributes of Measurement observations
constraint PK_Measurements primary key( ID, ObservationKind ),
constraint FK_Measurements_Observations foreign key( ID, ObservationKind )
references Observations( ID, ObservationKind )
);
The first two fields will be the same for the other kinds of observations except the check constraint will force the value to the appropriate kind. The other fields may differ in number, name and data type.
Let's examine an example tuple that may exist in the Measurements table:
ID ObservationKind Name Value ...
==== =============== ========= =====
1001 M CPU Usage 55.0 ...
In order for this tuple to exist in this table, a matching entry must first exist in the Observations table with an ID value of 1001 and an observation kind value of 'M'. No other entry with an ID value of 1001 can exist in either the Observations table or the Measurements table and cannot exist at all in any other of the "kind" tables (Events, Status). This works the same way for all the kind tables.
I would further recommend creating a view for each kind of observation which will provide a join of each kind with the main observation table:
create view MeasurementObservations as
select ...
from Observations o
join Measurements m
on m.ID = o.ID;
Any code that works solely with measurements would need to only hit this view instead of the underlying tables. Using views to create a wall of abstraction between the application code and the raw data greatly enhances the maintainability of the database.
Now the creation of another kind of observation, such as "Error", involves a simple Insert statement to the ObservationKinds table:
F Fault A fault or error has been detected.
Of course, you need to create a new table and view for these error observations, but doing so will have no impact on existing tables, views or application code (except, of course, to write the new code to work with the new observations).
Just create it as a VARCHAR
This will allow you to store whatever data you require in it. It is much more difficult to do queries based on the number in the field such as
Select * from table where MyVARCHARField > 50 //get CPU > 50
However if you think you want to do this, then either you need a field per item or a generalised table such as
Create Table
Description : Varchar
ValueType : Varchar //Can be String, Float, Int
ValueString: Varchar
ValueFloat: Float
ValueInt : Int
Then when you are filling the data you can put your value in the correct field and select like this.
Select Description ,ValueInt from table where Description like '%cpu%' and ValueInt > 50
I had a used two columns for a similar problem. First column was for data type and second value contained data as a Varchar.
First column had codes ( e.g. 1= integer, 2 = string, 3 = date and so on), which could be combined to compare values. ( e.g. find the max integer where type=1)
I did not have joins, but i think you can use this approach. It will also help you if tomorrow more data types are introduced.
Table 1
create table itemType1(
maincatg varchar(25),
subcatg varchar(25),
price float(5),
primary key(maincatg, subcatg)
);
Table2
create table itemType2(
maincatg varchar(25),
subcatg varchar(25),
price float(5),
primary key(maincatg, subcatg)
);
It sounds like you're looking for a UNION rather than a JOIN.
(SELECT maincatg, subcatg, price
FROM itemType1)
UNION ALL
(SELECT maincatg, subcatg, price
FROM itemType2)
This is effectively a concatenation of the two tables.
Beyond that, I'm really not sure what you're hoping to accomplish.
This kind of table is going to be much easier to work with.
create table item(
maincatg varchar(25),
subcatg varchar(25),
price float(5),
type tinyint(1),
primary key(maincatg, subcatg)
);
It also looks like you're going to have a lot of duplicate data in your maincatg and subcatg columns. If I was doing it I would have those in another table or possibly two. You join them with ids.
You will probably have an easier time if you read up on database normalization. It's an easy concept to brush up on as long as you find a decent tutorial and there are plenty out there.
If you're stuck using a database that you can't change for whatever reason, you should probably edit your question to say so, or else most answers will probably be similar to mine.
Sorry, not sure if question title is reflects the real question, but here goes:
I designing system which have standard orders table but with additional previous and next columns.
The question is which approach for foreign keys is better
Here I have basic table with following columns (previous, next) which are self referencing foreign keys. The problem with this table is that the first placed order doesn't have previous and next fields, so they left out empty, so if I have say 10 000 records 30% of them have those columns empty that's 3000 rows which is quite a lot I think, and also I expect numbers to grow. so in a let's say a year time period it can come to 30000 rows with empty columns, and I am not sure if it's ok.
The solution I've have came with is to main table with other 2 tables which have foreign keys to that table. In this case those 2 additional tables are identifying tables and nothing more, and there's no longer rows with empty columns.
So the question is which solution is better when considering query speed, table optimization, and common good practices, or maybe there's one even better that I don't know? (P.s. I am using mysql with InnoDB engine).
If your aim is to do order sets, you could simply add a new table for that, and just have a single column as a foreign key to that table in the order table.
The orders could also include a rank column to indicate in which order orders belonging to the same set come.
create table order_sets (
id not null auto_increment,
-- customer related data, etc...
primary key(id)
);
create table orders (
id int not null auto_increment,
name varchar,
quantity int,
set_id foreign key (order_set),
set_rank int,
primary key(id)
);
Then inserting a new order means updating the rank of all other orders which come after in the same set, if any.
Likewise, for grouping queries, things are way easier than having to follow prev and next links. I'm pretty sure you will need these queries, and the performances will be much better that way.
CREATE TABLE college
(
id SERIAL PRIMARY KEY,
SCHOOL VARCHAR(100),
CColor VARCHAR(100),
CCmascot VARCHAR(100)
);
CREATE TABLE mats
(
id SERIAL PRIMARY KEY,
CColor VARCHAR(100),
CCNAME VARCHAR(100)
);
MYSQL
Ok so here is the problem I think its pretty simple but I am not getting it right. I have the SCHOOL name passed to me through the URL and I use the $_GET to get the college name now I need to query:
By using the SCHOOL name I need to get the CCOLOR and the CCNAME.
Your question is unclear so an answer can only be approximated.
You need columns in both tables that can be used to join them, that is columns that have values that can be used to identify when a record/s in the parent table (college) matches a record/s in the child table (mats). Ideally you would have a foreign key in the child table maps, which could be named college_id (this uses a naming convention that references the parent table).
Giving a foreign key like the one mentioned above your query would become
select
college.ccolor
from
college inner join mats
on college.id = mats.college_id
where
mats.ccname = "<<COLOUR_DESIRED>>";
assuming ccname is the name of ccolor.
You have the college name and you wish to find out the colour name, if I understand correctly.
The linking attribute is CColor.
You query should look a little bit like this:
select
m.ccname, m.ccolor
from
mats m
inner join
college c
on
c.ccolor = m.ccolor
where
c.school = #myVariable
Database Tip of the Day: Use Foreign Key constraints, or you will have data corruption problems, and people on SO will have no idea how your columns relate to each other.
When you know the whys and the whatfors of relational modeling, you might find it necessary to go without them (although it's not recommended unless you have a really good reason), but for now, use them to explicitly define how the tables relate to each other.
Otherwise your question is kind of like asking a chef, "I have some unlabeled jars of food and what I think is oregano. How do I cook a romantic dinner for two?" (Umm.. what's in the jars??)
Foreign key doumentation: http://dev.mysql.com/doc/refman/5.1/en/ansi-diff-foreign-keys.html
Join documentation: http://dev.mysql.com/doc/refman/5.1/en/join.html
SELECT college.CColor FROM college
INNER JOIN mats ON college.CColor = mats.CColor
AND mats.CColor = 'your query'