I was reading Django Book and came across interesting statement.
Notice that Django doesn’t use SELECT * when looking up data and instead lists
all fields explicitly. This is by design:
in certain circumstances SELECT * can be slower,
I got this from http://www.djangobook.com/en/1.0/chapter05/
So my question is can someone explain me why SELECT * can be slower, than call every single column explicitly. Would be good if you can give me some examples.
Or if you think the opposite (it doesn't matter), can you explain why?
Update:
That's the table :
BEGIN;
CREATE TABLE "books_publisher" (
"id" serial NOT NULL PRIMARY KEY,
"name" varchar(30) NOT NULL,
"address" varchar(50) NOT NULL,
"city" varchar(60) NOT NULL,
"state_province" varchar(30) NOT NULL,
"country" varchar(50) NOT NULL,
"website" varchar(200) NOT NULL
);
And that's how Django will call SELECT * FROM book_publisher:
SELECT
id, name, address, city, state_province, country, website
FROM book_publisher;
performance (will matter only if you are selecting less columns than there are in the table
I am not sure about how Django works; but in some languages/ db drivers "select *" will cause an error if you change the table schema (say add a new column). This is because the DB driver "caches" the table schema and now its internal schema does not match the table schema.
If you have 100 columns, SELECT * will return the data for all columns. Listing the columns explicitly will reduce the columns returned, therefore reducing the amount of data transmitted between the server and application.
This is clearly not faster in many case, and when one of them is faster, it is by a slight margin: check by yourself, benchmarking a lot of queries :)
It might be faster to select only some columns in some case, including when you select only column that are on a combined index, avoiding the need to read the whole row, and also when you avoid accessing BLOB or TEXT columns on MySQL.
And naturally if you select less column you will transfer less data between MySQL and your application
I think in this exact case there will be no performance difference, this is exactly that in certain circumstances SELECT * can be slower is all about.
Related
How would you accomplish this task to get best performance?
Table schema:
CREATE TABLE `test_truck_report` (
`id` INT(11) NOT NULL AUTO_INCREMENT,
`truck_id` INT(11) NOT NULL,
`odometer_initial` INT(11) NOT NULL,
`odometer_final` INT(11) NOT NULL,
`fuel_initial` INT(11) NOT NULL,
`fuel_final` INT(11) NOT NULL,
PRIMARY KEY (`id`)
)
ENGINE=InnoDB;
What i'm trying to execute is this query:
SELECT
truck_id,
(odometer_final - odometer_initial) AS mileage,
(fuel_initial - fuel_final) AS consumed_fuel,
(consumed_fuel / mileage) AS consumption_per_km
FROM
test_truck_report
WHERE
consumption_per_km > 2
Somehow this obvious on the first sight logic doesn't work and i'm forced to use this query instead:
SELECT
truck_id,
(odometer_final - odometer_initial) AS mileage,
(fuel_initial - fuel_final) AS consumed_fuel,
((fuel_initial - fuel_final) / (odometer_final - odometer_initial)) AS consumption_per_km
FROM
test_truck_report
WHERE
((fuel_initial - fuel_final) / (odometer_final - odometer_initial)) > 2
I assume that constant recalculation of each calculated field every time where it needs to be placed makes significant performance downgrade. And this is just a test case, actual working table has 50+ fields and some of calculated fields consists of 10+ operands. So it's a really HUGE problem at the moment.
Reason why i don't want to actually create these fields and perform something like:
UPDATE
`test_truck_report`
SET
consumed_fuel = fuel_initial - fuel_final
is that existing records are being constantly updated by the users and in that case i would need to constantly update that data.
So do you consider creating actual fields a better idea? Or is there some better way?
Thanks.
Try to use views:
We need an auxiliary view:
CREATE OR REPLACE VIEW vw_truck_data AS
SELECT truck_id,
(odometer_final - odometer_initial) AS mileage,
(fuel_initial - fuel_final) AS consumed_fuel
FROM test_truck_report;
And the final view:
CREATE OR REPLACE VIEW vw_truck_consumption AS
SELECT data.*,
(data.consumed_fuel / data.mileage) AS consumption_per_km
FROM vw_truck_data data;
Now you can query whenever you want in an easy and readable way:
SELECT *
FROM vw_truck_consumption
WHERE consumption_per_km > 2
This way MySQL should be able to only substract each field once so the performance should be at least as good as your solution or better. Normally the CPU cost from adding fields is smaller than the cost to retreive data from the database but of course it depends on your hardware, mysql version, configuration and data distribution. Do some measurements if it is really an issue.
Anyway remmember that you are making a query filtering by consumption_per_km which is a funtion of fields. As MySQL seems to lack funtional indexes it will surely scan the full table and be slow.
I am looking into storing a "large" amount of data and not sure what the best solution is, so any help would be most appreciated. The structure of the data is
450,000 rows
11,000 columns
My requirements are:
1) Need as fast access as possible to a small subset of the data e.g. rows (1,2,3) and columns (5,10,1000)
2) Needs to be scalable will be adding columns every month but the number of rows are fixed.
My understanding is that often its best to store as:
id| row_number| column_number| value
but this would create 4,950,000,000 entries? I have tried storing as just rows and columns as is in MySQL but it is very slow at subsetting the data.
Thanks!
Build the giant matrix table
As N.B. said in comments, there's no cleaner way than using one mysql row for each matrix value.
You can do it without the id column:
CREATE TABLE `stackoverflow`.`matrix` (
`rowNum` MEDIUMINT NOT NULL ,
`colNum` MEDIUMINT NOT NULL ,
`value` INT NOT NULL ,
PRIMARY KEY ( `rowNum`, `colNum` )
) ENGINE = MYISAM ;
You may add a UNIQUE INDEX on colNum, rowNum, or only a non-unique INDEX on colNum if you often access matrix by column (because PRIMARY INDEX is on ( `rowNum`, `colNum` ), note the order, so it will be inefficient when it comes to select a whole column).
You'll probably need more than 200Go to store the 450.000x11.000 lines, including indexes.
Inserting data may be slow (because there are two indexes to rebuild, and 450.000 entries [1 per row] to add when adding a column).
Edit should be very fast, as index wouldn't change and value is of fixed size
If you access same subsets (rows + cols) often, maybe you can use PARTITIONing of the table if you need something "faster" than what mysql provides by default.
After years of experience (20201 edit)
Re-reading myself years later, I would say the "cache" ideas are totally dumb, as it's MySQL role to handle these sort of cache (it should actually already be in the innodb pool cache).
A better thing would be, if matrix is full of zeroes, not storing the zero values, and consider 0 as "default" in the client code. That way, you may lightenup the storage (if needed: mysql should actually be pretty fast responding to queries event on such 5 billion row table)
Another thing, if storage makes issue, is to use a single ID to identify both row and col: you say number of rows is fixed (450000) so you may replace (row, col) with a single (id = 450000*col+row) value [tho it needs BIGINT so maybe not better than 2 columns)
Don't do like below: don't reinvent MySQL cache
Add a cache (actually no)
Since you said you add values, and doesn't seem to edit matrix values, a cache can speed up frequently asked rows/columns.
If you often read the same rows/columns, you can cache their result in another table (same structure to make it easier):
CREATE TABLE `stackoverflow`.`cachedPartialMatrix` (
`rowNum` MEDIUMINT NOT NULL ,
`colNum` MEDIUMINT NOT NULL ,
`value` INT NOT NULL ,
PRIMARY KEY ( `rowNum`, `colNum` )
) ENGINE = MYISAM ;
That table will be void at the beginning, and each SELECT on the matrix table will feed the cache. When you want to get a column / row:
SELECT the row/column from that caching table
If the SELECT returns a void/partial result (no data returned or not enough data to match the expected row/column number) then do the SELECT on the matrix table
Save the SELECT from the matrix table to the cachingPartialMatrix
If the caching matrix gets too big, clear it (the bigger cached matrix is, the slower it becomes)
Smarter cache (actually, no)
You can make it even smarter with a third table to count how many times a selection is done:
CREATE TABLE `stackoverflow`.`requestsCounter` (
`isRowSelect` BOOLEAN NOT NULL ,
`index` INT NOT NULL ,
`count` INT NOT NULL ,
`lastDate` DATETIME NOT NULL,
PRIMARY KEY ( `isRowSelect` , `index` )
) ENGINE = MYISAM ;
When you do a request on your matrix (one may use TRIGGERS) for the Nth-row or Kth-column, increment the counter. When the counter gets big enough, feed the cache.
lastDate can be used to remove some old values from the cache (take care: if you remove the Nth-column from cache entries because its ``lastDate```is old enough, you may break some other entries cache) or to regularly clear the cache and only leave the recently selected values.
if we have table such:
create table x(
id int primary key,
something_else_1 int,
something_else_2 int,
something_else_3 int,
char_data text, -- or varchar(1000)
);
this table will be queried on all fields except char_data.
most queries will be similar to:
select id, something_else_1
from x
where something_else_2 = 2 and something_else_3 = 5;
question is - if we have indexes etc,
what configuration will be better - text or varchar.
Just one final note -
I know I can separate this into two tables, buy separation in this case will be not the best idea, since all fields except the blob's will be something like unique index or similar.
this table will be queried on all fields except char_data.
Then data type of char_data has no influence over performance. Only if you select char_data it'll consume more bandwidth. Nothing else.
Its not a problem. Because you are not using in your sql. SELECT * will become slow but SELECT id, something_else_1 will not make it slow. WHERE id=2 and something_else_2=1 has no effect, but WHERE char_data like '%charsequence%'. As long as you are not searching your table with char_data you are safe.
Besides if you still want to search by char_data, you should enable full text search.
ALTER TABLE `x` ADD FULLTEXT(`char_data`);
Note: Full text search is only supported in MyISAM table engine.
This might be a basic question: I am using a temporary table in some of my php code like so:
CREATE TEMPORARY TABLE ttable( `d` DATE NOT NULL , `p` DECIMAL( 11, 2 ) NOT NULL , UNIQUE KEY `date` ( `date` ) );
INSERT INTO ttable( d, p ) VALUES ( '$d' , '$p' );
SELECT * FROM ttable;
As we scale up our site, will this ever be a problem? ie, will user1's ttable & user2's ttable ever get mixed up & user1 sees user2's ttable & vice versa? Is it better to create a unique name for each unique temporary table?
thx
Temporary tables are session-specific. Every time you connect to a host (in PHP, this is done with mysql_connect), temporary tables that you create exist only within that session/connection.
It is almost always better to find a different way than using temporary tables.
The only time I would consider them is under the following conditions:
The activity is rare. Meaning, a given user MIGHT do this once a week.
It is used as a holding container prior to doing a regular full import of data.
It deals with data whose structure is unknown prior to being filled.
All three of those really go with building some type of generic bulk import routines where the data mapping is defined at run time.
If you find yourself creating temp tables frequently in the application, there's probably a better way.
Scalability is going to depend on the amount of data being loaded and frequency of temp table usage. For a low trafficked site it might be okay.
We're in the process of ripping out a ton of temp table usage by a client's app. 90% of the queries in their system result in a temp table being created. Analysis of all the queries have shown that the original dev used this mechanism simply because they didn't understand SQL. We're doing this because performance has radically dropped off as new users are added to the system.
Can you post a use case? Maybe we can help provide an alternate mechanism.
UPDATE:
Now that we have a use case, here is a simple table structure to accomplish what you need.
Table ZipCodes
ZipCode char(5) [or char(10) depending on need]
CityName varchar(50)
*other columns as necessary such as latitude or whatever.
Table TempReadings
ZipCode char(5) [foreign key to the ZipCode table]
ReadingDate datetime
Temperature float (or some equivalent)
To get all the temp readings for a given zip code you would do something like:
select ZipCode, ReadingDate, Temperature
from TempReadings
if you need info from the main ZipCode table:
select Z.ZipCode, Z.CityName, TR.ReadingDate, TR.Temperature
from ZipCodes Z
inner join TempReadings TR on (TR.ZipCode = Z.ZipCode)
add where clauses as necessary. Note that none of the above requires having a separate table per zip code.
Assume that I have one big table with three columns: "user_name", "user_property", "value_of_property". Lat's also assume that I have a lot of user (let say 100 000) and a lot of properties (let say 10 000). Then the table is going to be huge (1 billion rows).
When I extract information from the table I always need information about a particular user. So, I use, for example where user_name='Albert Gates'. So, every time the mysql server needs to analyze 1 billion lines to find those of them which contain "Albert Gates" as user_name.
Would it not be wise to split the big table into many small ones corresponding to fixed users?
No, I don't think that is a good idea. A better approach is to add an index on the user_name column - and perhaps another index on (user_name, user_property) for looking up a single property. Then the database does not need to scan all the rows - it just need to find the appropriate entry in the index which is stored in a B-Tree, making it easy to find a record in a very small amount of time.
If your application is still slow even after correctly indexing it can sometimes be a good idea to partition your largest tables.
One other thing you could consider is normalizing your database so that the user_name is stored in a separate table and use an integer foriegn key in its place. This can reduce storage requirements and can increase performance. The same may apply to user_property.
you should normalise your design as follows:
drop table if exists users;
create table users
(
user_id int unsigned not null auto_increment primary key,
username varbinary(32) unique not null
)
engine=innodb;
drop table if exists properties;
create table properties
(
property_id smallint unsigned not null auto_increment primary key,
name varchar(255) unique not null
)
engine=innodb;
drop table if exists user_property_values;
create table user_property_values
(
user_id int unsigned not null,
property_id smallint unsigned not null,
value varchar(255) not null,
primary key (user_id, property_id),
key (property_id)
)
engine=innodb;
insert into users (username) values ('f00'),('bar'),('alpha'),('beta');
insert into properties (name) values ('age'),('gender');
insert into user_property_values values
(1,1,'30'),(1,2,'Male'),
(2,1,'24'),(2,2,'Female'),
(3,1,'18'),
(4,1,'26'),(4,2,'Male');
From a performance perspective the innodb clustered index works wonders in this similar example (COLD run):
select count(*) from product
count(*)
========
1,000,000 (1M)
select count(*) from category
count(*)
========
250,000 (500K)
select count(*) from product_category
count(*)
========
125,431,192 (125M)
select
c.*,
p.*
from
product_category pc
inner join category c on pc.cat_id = c.cat_id
inner join product p on pc.prod_id = p.prod_id
where
pc.cat_id = 1001;
0:00:00.030: Query OK (0.03 secs)
Properly indexing your database will be the number 1 way of improving performance. I once had a query take a half an hour (on a large dataset, but none the less). Then we come to find out that the tables had no index. Once indexed the query took less than 10 seconds.
Why do you need to have this table structure. My fundemental problem is that you are going to have to cast the data in value of property every time you want to use it. That is bad in my opinion - also storing numbers as text is crazy given that its all binary anyway. For instance how are you going to have required fields? Or fields that need to have constraints based on other fields? Eg start and end date?
Why not simply have the properties as fields rather than some many to many relationship?
have 1 flat table. When your business rules begin to show that properties should be grouped then you can consider moving them out into other tables and have several 1:0-1 relationships with the users table. But this is not normalization and it will degrade performance slightly due to the extra join (however the self documenting nature of the table names will greatly aid any developers)
One way i regularly see databqase performance get totally castrated is by having a generic
Id, property Type, Property Name, Property Value table.
This is really lazy but exceptionally flexible but totally kills performance. In fact on a new job where performance is bad i actually ask if they have a table with this structure - it invariably becomes the center point of the database and is slow. The whole point of relational database design is that the relations are determined ahead of time. This is simply a technique that aims to speed up development at a huge cost to application speed. It also puts a huge reliance on business logic in the application layer to behave - which is not defensive at all. Eventually you find that you wan to use properties in a key relationsip which leads to all kinds of casting on the join which further degrades performance.
If data has a 1:1 relationship with an entity then it should be a field on the same table. If your table gets to more than 30 fields wide then consider movign them into another table but dont call it normalisation because it isnt. It is a technique to help developers group fields together at the cost of performance in an attempt to aid understanding.
I don't know if mysql has an equivalent but sqlserver 2008 has sparse columns - null values take no space.
SParse column datatypes
I'm not saying a EAV approach is always wrong, but i think using a relational database for this approach is probably not the best choice.