I would like to make sure, my table layout is production safe. Maybe you guys could give me some advice about the design. So my table looks like this:
AI_Index PartName String1 String2 ... String5 ... TINYINT1 TINYINT2 TINYINT3
1 Example L somestuff morestuff 0 1 0
2 Example X morestuff andmore 1 1 1
For clarification:
AI_Index is AutoIncrementing for each value added. PartName represents a filename. All the other strings and tinyints are describing the value String1. It will most likely not happen that PartName is unique, it will always be about 5-10 times in there. First estimate will bring about 1000 diffenrent parts. So the database will have about 5.000-10.000 rows.
I'm connecting to the DB with VB.NET and the MySQL connector. If you open a Part in SolidWorks, a query will check if the activedoc is present in the PartName column. If so, a UserForm will show up, displaying all values from column String1 with the values from String2, String3, String4, String5, Tinyint1, Tinyint2 and Tinyint3. There are about 40 people working with SolidWorks, changing activeparts frequently. That means ~500 queries a minute just checking if the part is present.
My questions are as follows:
Does it make sense to add PartName as Index? I read many times that a bad index decision can make the database slower.
How could a powerful query look like? I suggest if i create a view with a SELECT DISTINCT PartName the query for the active part will be faster. Is this right?
Does it make sense to create a MySQL function that returns a TINYINT if the ActiveDoc is present in PartName? Will the TableView or the function be faster?
Could you just:
SELECT String1, String2, String3, String4, String5, Tinyint1, Tinyint2, Tinyint3
WHERE PartName = 'Example';
If the result is empty, no need to show the form. Then you have the data at hand to show right away. one query, one result set.
DISTINCT will make it slower. You would have to do some stress testing to see if your server can handle 500 queries per minute as there are a lot of factors besides just the index and query.
Related
I have to use this for a project at work, and am running into some trouble. I have a large database (58mil rows) that I have figured out how to query down to what I want and then write this row in to a separate table. Here is my code so far:
insert into emissionfactors(pollutantID,fuelTypeID,sourceTypeID,emissionFactor)
select pollutantID,fuelTypeID,sourceTypeID,avg(ratePerDistance) as emissionFactor
from onroad_run_1.rateperdistance
where pollutantID=45
and fuelTypeID=2
and sourceTypeID=32;
I have about 60 different pollutant ID's, and currently I am manually changing the pollutantID number on line 5 and executing the script to write the row into my 'emissionfactors' table. Each run takes 45 seconds and I have several other fuel types and source types to do so this could take like 8 hours of clicking every 45 seconds. I have some training in matlab and thought I could put a while loop around the above code, create an index, and have it loop through from 1 to 184 on the pollutant IDs but I can't seem to get it to work.
Here are my goals:
- loop the pollutantID from 1 to 184.
-- not all integers are in this range, so need it to simply add one to the index and check to see if that number is found in the pollutantID column if the index is not found.
-- if the index number is found in the pollutant ID column, execute my above code to write the data into my other table
You do not need a while loop, all you need is to change your where clause to use a BETWEEN clause and also tell it what you want to base the average on by adding a GROUP BY clause
insert into emissionfactors(pollutantID,fuelTypeID,sourceTypeID,emissionFactor)
select pollutantID,fuelTypeID,sourceTypeID,avg(ratePerDistance) as emissionFactor
from onroad_run_1.rateperdistance
where pollutantID BETWEEN 1 AND 184
and fuelTypeID=2
and sourceTypeID=32
GROUP BY pollutantID , fuelTypeID, sourceTypeID;
If in fact you want the entire range of the pollutantID, fuelTypeID and sourceTypeID that exists you can just remove the where clause altogether.
insert into emissionfactors(pollutantID,fuelTypeID,sourceTypeID,emissionFactor)
select pollutantID,fuelTypeID,sourceTypeID,avg(ratePerDistance) as emissionFactor
from onroad_run_1.rateperdistance
GROUP BY pollutantID , fuelTypeID, sourceTypeID;
You also don't need to check if the row exists before executing the query, as if it doesn't exist and returns no rows it just won't insert any.
As to the speed issue, you will need to look at adding some table indexes to your table to improve performance. In this case an index that has pollutantID, fuelTypeID and sourceTypeID would speed things up greatly.
My advice, ask for help at work. It is better to admit early that you do not know how to do something and get proper help, as you also mention that you have different fuel types that you want, but the details of that are missing from your question.
I have a table that has three columns. The id of a material, the family it belongs (there is some kind of grouping), the type of test and the results of that test as a number (1-4). The type of test points to a table with a list of tests and the result is a number. I am trying to create a query/view if appropriate so that I can have a result of a family of materials and for each test a new column with the cell showing the average of that test in all the materials in the family...
I am thinking that since it's for a small project and I know the test (don't think they are going to change but not sure) that I could just add the tests (they are 6 of them) as columns in the table instead of having it as data but it doesn't feel right to me. Also, there could be a chance that a test would be added in the future. At the same time though, adding a column would not be hard and I could just change the code for averages to disregard values to a specific value so I could differentiate from values before a test was added.
So how would I go about doing it and is it a good idea the way I am doing it?
For now what I have is maybe making a select statement for each pair of family/test and then somehow creating a view (is it a virtual table) with the results of those queries.
So if the table is
test_result
family_id | material_id | test_id | result
The query would be
SELECT AVG(result) AS 'TEST'
FROM test_result
WHERE family_id = 'family_id'
AND test_id = 'test_id;
But I am not sure how to proceed or if there is a better way than doing this 6 times and somehow combining the results
I am not sure how much you know about database design.
The question you are asking, about having 6 columns for test_id in one table versus having another table with family_id and test_id as the primary key (unique identifier) is a fundamental one about database design. It has to do with first normal form. You can study up on first normal form, and on data normalization generally, if you choose to.
Here is an oversimplified version, for this case.
There are two big problems with the six columns in one table approach.
The first is this: what happens when they change their minds and add a seventh test? If this never happens, everything is ok. But if not, you have to alter the table by adding another column, and you have to alter any queries that reference the table. If that's only one query in your case, you can manage it. In cases where there are hundreds of queries that may reference the table, and some of those are in application programs that may require a maintenance cycle to revise the query, this can be a nightmare. That is why database tutorials are full of material that you may not need to learn if this small project is the only one you ever do.
The second is this: what happens when you have to write a query that has to find every occurrence of testid = 4, regardless of which of the six columns the value is stored in? You are going to have to write a query with five OR operators in the WHERE clause. This is tedious, error prone, and runs slow. Again, this may never be a problem.
The generally better approach is to create a third table with family_id and test_id as columns, and maybe result as a third column (I'm not sure what material_id is... is there a material table?)
The first table, families, has the family_id and any data that only depends on the family, like family_name.
The second table, tests, has the test_id and any data that only depends on the test, like test_name.
And the third table contains data that depends on both.
You then write a view that joins all three tables to together, to make it look the way you want to use it.
I apologize if this covers a lot of concepts you already know. Again, I couldn't tell from your question.
I have a table of > 250k rows of 'names' (and ancillary info) which I am displaying using jQuery Datatables.
My Users can choose any 'name' (Row), which is then flagged as 'taken' (and timestamped).
A (very) cut down version of the table is:
Key, Name, Taken, Timestamp
I would like to be able to display the 'taken' rows (in timestamp order) first and then the untaken records in their key order [ASC] next.
The problem would be simple, but, because of size constraints (both visual UI & data set size) My display mechanism paginates - 10 / 20 / 50 / 100 rows (user choice)
Which means a) the total number of 'taken' will vary and b) the pagination length varies.
Thus I can see no obvious method of keeping track of the pagination.
(My Datatable tells me the count of the start record and the length of the displayed records)
My SQL (MySQL) at this level is weak, and I have no idea how to return a record set that accounts for the 'taken' offset without some kind of new (or internal MySQL) numeric indices to paginate to.
I thought of:
Creating a temporary table with the key and a new numeric indices on
each pagination.
Creating a trigger that re-ordered the table when the row was
'taken'.
Having a "Running order" column that was updated on each new 'taken'
Some sort of cursor based procedure (at this point my hair was
ruffled as the explanations shot straight over the top of my head!)
All seem excessive.
I also thought of doing a lot of manipulation in PHP (involving separate queries, dependant on the pagination size, amount of names already taken, and keeping a running record of the pagination position.)
To the Human Computer (Brain) the problem is untaxing - but translating it into SQL has foxed me, as has coming up with a fast alternative to 1-3 (the test case on updating the "Running order" solution took almost three minutes to complete!)
It 'feels' like there should be a smart SQL query answer to this, but all efforts with ORDER BY, LIMITS, and the like fall over unless I return the whole dataset and do a lot of nasty counting.
Is there something like a big elephant in the room I am missing - or am I stuck with the hard slog to get what I need.
A query that displays the 'taken' rows (in timestamp order) first and then the untaken records in their key order [ASC] next:
SELECT *
FROM `table_name`
ORDER BY `taken` DESC, IF(`taken` = 1, `Timestamp`, `Key`) ASC
LIMIT 50, 10
The LIMIT values: 10 is the page size, 50 is the index of the first element on page 6.
Change the condition on IF(taken = 1,Timestamp,Key) with the correct condition to match the values you store in column taken. I assumed you store 1 when the row is 'taken' and 0 otherwise.
A simplified version of my MySQL db looks like this:
Table books (ENGINE=MyISAM)
id <- KEY
publisher <- LONGTEXT
publisher_id <- INT <- This is a new field that is currently null for all records
Table publishers (ENGINE=MyISAM)
id <- KEY
name <- LONGTEXT
Currently books.publisher holds values that keep getting repeated, but that the publishers.name holds uniquely.
I want to get rid of books.publisher and instead populate the books.publisher_id field.
The straightforward SQL code that describes what I want done, is as follows:
UPDATE books
JOIN publishers ON books.publisher = publishers.name
SET books.publisher_id = publishers.id;
The problem is that I have a big number of records, and even though it works, it's taking forever.
Is there a faster solution than using something like this in advance?:
CREATE INDEX publisher ON books (publisher(20));
Your question title says ".. optimize ... query without using an index?"
What have you got against using an index?
You should always examine the execution plan if a query is running slowly. I would guess it's having to scan the publishers table for each row in order to find a match. It would make sense to have an index on publishers.name to speed the lookup of an id.
You can drop the index later but it wouldn't harm to leave it in, since you say the process will have to run for a while until other changes are made. I imagine the publishers table doesn't get update very frequently so performance of INSERT and UPDATE on the table should not be an issue.
There are a few problems here that might be helped by optimization.
First of all, a few thousand rows doesn't count as "big" ... that's "medium."
Second, in MySQL saying "I want to do this without indexes" is like saying "I want to drive my car to New York City, but my tires are flat and I don't want to pump them up. What's the best route to New York if I'm driving on my rims?"
Third, you're using a LONGTEXT item for your publisher. Is there some reason not to use a fully indexable datatype like VARCHAR(200)? If you do that your WHERE statement will run faster, index or none. Large scale library catalog systems limit the length of the publisher field, so your system can too.
Fourth, from one of your comments this looks like a routine data maintenance update, not a one time conversion. So you need to figure out how to avoid repeating the whole deal over and over. I am guessing here, but it looks like newly inserted rows in your books table have a publisher_id of zero, and your query updates that column to a valid value.
So here's what to do. First, put an index on tables.publisher_id.
Second, run this variant of your maintenance query:
UPDATE books
JOIN publishers ON books.publisher = publishers.name
SET books.publisher_id = publishers.id
WHERE books.publisher_id = 0
LIMIT 100;
This will limit your update to rows that haven't yet been updated. It will also update 100 rows at a time. In your weekly data-maintenance job, re-issue this query until MySQL announces that your query affected zero rows (look at mysqli::rows_affected or the equivalent in your php-to-mysql interface). That's a great way to monitor database update progress and keep your update operations from getting out of hand.
Your update query has invalid syntax but you can fix that later. The way to get it to run faster is to add a where clause so that you are only updating the necessary records.
I am working on a side project that is quite an undertaking; my question regards the efficiency gained when using a BOOLEAN value to determine whether or not further data processing is required.
For example: If I had a table that listed all the creatures. In another table that was relational in nature listed their hibernation period, and calories consumed each day during hibernation.
Is it efficient to have inside the (Creatures) table a value for "hibernates" BOOLEAN.
If true then go to the "hibernation_creature_info_relations" table and find the creature with that ID and return that information.
This means that for all the creatures whose value for "hibernates" = false will prevent SQL from having to search through the large table of "hibernation_creature_info_relations."
Or when using ID's is the process so fast in checking the "hibernation_creature_info_relations" table so fast that there will actually be a larger impact on performance by having to process the argument of doing what based on if the value of hibernation is set to true or false?
I hope this was enough information to help you understand what I am asking, if not please let me know so I can rephrase or include more details.
No, that is not a good way to do things.
Use a normal field that can be null instead.
Example
table creatures
---------------
id name info_id
1 dino null
2 dog 1
3 cat 2
table info
--------------
id info_text
1 dogs bark
2 cats miauw
Now you can just do a join:
SELECT c.name, i.info_text
FROM creature c
LEFT JOIN info i ON (c.info_id = i.id)
If you do it like this, SQL can use an index.
No SQL database will create an index on a boolean field.
The cardinality of that field is too low and using indexes on low cardinality fields slows things down instead of speeding things up.
See: MySQL: low cardinality/selectivity columns = how to index?
If you want to use the column "hibernates" only to prevent the SQL from having to search through the other table then you should follow #Johan otherwise you can create index on the column "hibernates" it will improve the execution time. But keep in mind what #Johan is trying to tell you.