This is a problem that bothers me whenever there is a need to add a new field to a table. Here the table has got about 1.5 Billion records (partitioned and sharded so it is physically separated files). Now I need to add a nullable field which is varchar(1024), which is going to accept some JSON strings. It is possible that the field length has to be increased in future to accommodate longer strings.
Here are the arguments
All existing rows will have null values for this field. (fav. new table)
Only 5% of the newly inserted records will have value for this. (fav. new table )
Most of the current queries on the table will need to access this field. (fav. alter)
I'm not sure if query memory allocation has a role to play in this, based on where I store.
Now should I add to current table, or define another table with same primary keys to store this data.
Your comments would help a decision.
Well if your older records wont need to have that varchar field , you should put it in another table and while pulling data give a join with primary key of other
Its not a big deal you can simply add a column in that table and for that just set null for that new column.
I think that, regardless of the 3 situations you have posited, you should alter the existing table, rather than creating a new one.
My reasoning is as follows:
1) Your table is very large (1.5 billion rows). If you create a new table, you would replicate the PK for 1.5 billion rows in the new table.
This will cause the following problems:
a) Wastage of DB space.
b) Time-intensive. Populating a new table with 1.5 billion rows and updating their PKs is a non-trivial exercise.
c) Rollback-segment exhaustion. If the rollback segments have insufficient space during the insertion of the new rows, the insert will fail. This will increase the DB fragmentation.
On the other hand, all these problems are avoided by altering the table:
1) There is no space wastage.
2) The operation won't be time-consuming.
3) There is no risk of rollback segment failure or DB fragmentation.
So alter the table.
Both these approaches have merits and demerits. I think I found a compromise between these two options., which has benefits of both approaches
create a new table to hold the JSON string. This table has same primary key as first table. Say the first table is Customer, and second table is Customer_json_attributes
alter the current table(customer) to add a flag indicating the presence of value in the JSON field. say json_present_indicator char(1).
Application to set the json_present_indicator='Y' in the fist table if there is a value for the JSON field in the second table, if not set to 'N'
Select queries will have a left join having json_present_indicator = ‘Y’ as a join condition. This will be efficient join as the query will search the second table only when the indicator is ‘Y’. Remember only 5% of the records will have a value on the JSON field
Related
I have a users table, that contains many attributes like email, username, password, phone, etc.
I would like to save a new type of data (integer), let's call it "superpower", but only very few users will have it. the users table contains 10K+ records, while fewer than 10 users will have a superpower (for all others it will be null).
So my question is which of the following options is more correct and better in terms of performance:
add another column in the users table called "superpower", which will be null for almost all users
have a new table calles users_superpower, which will at most contains 10 records and will map users to superpowers.
some things i have thought about:
a. the first option seems wasteful of space, but it really just an ingeger...
b. the second option will require a left join every time i query the users...
c. will the answer change if "superpower" data was 5 columns, for example?
note: i'm using hibenate and mysql, if it changes the answer
This might be a matter of opinion. My viewpoint on this follows:
If superpower is an attribute of users and you are not in the habit of adding attributes, then you should add it as a column. 10,000*4 additional bytes is not very much overhead.
If superpower is just one attribute and you might add others, then I would suggest using JSON or another EAV table to store the value.
If superpower is really a new type of user with other attributes and dates and so on, then create another table. In this table, the primary key can be the user_id, making the joins between the tables even more efficient.
I would go with just adding a new boolean field in your user entity which keeps track of whether or not that user has superpowers.
Appreciate that adding a new table and linking it requires the creation of a foreign key in your current users table, and this key will be another column taking up space. So it doesn't really get around avoiding storage. If you just want a really small column to store whether a user has superpowers, you can use a boolean variable, which would map to a MySQL BIT(1) column. Because this is a fixed width column, NULL values would still take up a single bit of space, but this not a big storage concern most likely as compared to the rest of your table.
I have a usecase where i have to fetch data (row) from a large table (contains million entries) with filter on a 'text' column field. Problem is in normal iteration following simple query on this table is timing out.
select * from st_overflow_tbl where uniue_txt_id = '123456'
I ran explain command and found that there is no indexed on unique_txt_id. output below.
"id","select_type","table","type","possible_keys","key","key_len","ref","rows","Extra"
1,"SIMPLE","st_overflow_tbl","ALL",NULL,NULL,NULL,NULL,12063881,"Using where"
I, then, tried to create index for this table but it is failing with following error message -
BLOB/TEXT column 'uniue_txt_id' used in key specification without a key length
Command that i was running is this -
alter table st_overflow_tbl add index uti_idx (uniue_txt_id)
With scenarios mentioned above questions that i have are -
Is there any better way (than just creating indexes on table) to
fetch data where we are searching 'text' field on very large table.
How can i create index on this table? What i am doing wrong?
More specific to my usecase. If i cannot create index on this table. Still can i be able to fetch data fast from this table with same select query (mentioned above)? Lets say unique_txt_id field will always be numeric (well, it is till now).
If the column only has a handful of characters, then change the column to an appropriate type, such as varchar(32).
If you don't want to do that, you can specify an initial length to index:
alter table st_overflow_tbl add index uti_idx (uniue_txt_id(32))
And, finally, if it really is a text column that contains words and so on, then you can investigate a full text index.
I have a few tables storing their corresponding records for my system. For example, there could be a table called templates and logos. But for each table, one of the rows will be a default in the system. I would have normally added a is_default column for each table, but all of the rows except for 1 would have been 0.
Another colleague of mine sees another route, in which there is a system_defaults table. And that table has a column for each table. For example, this table would have a template_id column and a logo_id column. Then that column stores the corresponding default.
Is one way more correct than the other generally? The first way, there are many columns with the same value, except for 1. And the second, I suppose I just have to do a join to get the details, and the table grows sideways whenever I add a new table that has a default.
The solutions mainly differ in the ways to make sure that no more than one default value is assigned for each table.
is_default solution: Here it may happen that more than one record of a table has the value 1. It depends on the SQL dialect of your database whether this can be excluded by a constraint. As far as I understand MySQL, this kind of constraint can't be expressed there.
Separate table solution: Here you can easily make sure by your table design that at most one default is present per table. By assigning not null constraints, you can also force defaults for specific tables, or not. When you introduce a new table, you are extending your database (and the software working on it) anyway, so the additional attribute on the default table won't hurt.
A middle course might be the following: Have a table
Defaults
id
table_name
row_id
with one record per table, identified by the table name. Technically, the problem of more than one default per table may also occur here. But if you only insert records into this table when a new table gets introduced, then your operative software will only need to perform updates on this table, never inserts. You can easily check this via code inspection.
I currently have a non-temporal MySQL DB and need to change it to a temporal MySQL DB. In other words, I need to be able to retain a history of changes that have been made to a record over time for reporting purposes.
My first thought for implementing this was to simply do inserts into the tables instead of updates, and when I need to select the data, simply doing a GROUP BY on some column and ordering by the timestamp DESC.
However, after thinking about things a bit, I realized that that will really mess things up because the primary key for each insert (which would really just be simulating a number of updates on a single record) will be different and thus mess up any linkage that uses the primary key to link to other records in the DB.
As such, my next thought was to continue updating the main tables in the DB, but also create a new insert into an "audit table" that is simply a copy of the full record after the update, and then when I needed to report on temporal data, I could use the audit table for querying purposes.
Can someone please give me some guidance or links on how to properly do this?
Thank you.
Make the given table R temporal(ie, to maintain the history).
One design is to leave the table R as it is and create a new table R_Hist with valid_start_time and valid_end_time.
Valid time is the time when the fact is true.
The CRUD operations can be given as:
INSERT
Insert into both R
Insert into R_Hist with valid_end_time as infinity
UPDATE
Update in R
Insert into R_Hist with valid_end_time as infinity
Update valid_end_time with the current time for the “latest” tuple
DELETE
Delete from R
Update valid_end_time with the current time for the “latest” tuple
SELECT
Select from R for ‘snapshot’ queries (implicitly ‘latest’ timestamp)
Select from R_Hist for temporal operations
Instead, you can choose to design new table for every attribute of table R. By this particular design you can capture attribute level temporal data as opposed to entity level in the previous design. The CRUD operations are almost similar.
I did a column Deleted and a column DeletedDate. Deleted defaults to false and deleted date null.
Complex primary key on IDColumn, Deleted, and DeletedDate.
Can index by deleted so you have real fast queries.
No duplicate primary key on your IDColumn because your primary key includes deleted and deleted date.
Assumption: you won't write to the same record more than once a millisecond. Could cause duplicate primary key issue if deleted date is not unique.
So then I do a transaction type deal for updates: select row, take results, update specific values, then insert. Really its an update to deleted true deleted date to now() then you have it spit out the row after update and use that to get primary key and/or any values not available to whatever API you built.
Not as good as a temporal table and takes some discipline but it builds history into 1 table that is easy to report on.
I may start updating the deleted date column and change it to added/Deleted in addition to the added date so I can sort records by 1 column, the added/deleted column while always updated the addedBy column and just set the same value as the added/Deleted column for logging sake.
Either way could just do a complex case when not null as addedDate else addedDate as addedDate order by AddedDate desc. so, yeah, whatever, this works.
I'm trying to filter rows from the MySQL table where all the $_POST data is stored from an online form. Sometimes the user's internet connection stalls or the browser screws up, and the new page after form submission is not displayed (though the INSERT worked and the table row was created). They then hit refresh, and submit their form twice, creating a duplicate row (except for the timestamp and autoincrement id columns).
I'd like to select unique form submissions. This has to be a really common task, but I can't seem to find something that lets me call with DISTINCT applying to every column except the timestamp and id in a succinct way (sort of like SELECT id, timestamp, DISTINCT everything_else FROM table;. At the moment, I can do:
CREATE TEMPORARY TABLE IF NOT EXISTS temp1 AS (
SELECT DISTINCT everything,except,id,and,timestamp
FROM table1
);
SELECT * FROM table1 LEFT OUTER JOIN temp1
ON table1.everything = temp1.everything
...
;
My table has 20k rows with about 25 columns (classification features for a machine learning exercise). This query takes forever (as I presume it traverses the 20k rows 20K times?) I've never even let it run to completion. What's the standard practice way to do this?
Note: This question suggests add an index to the relevant columns, but there can be max 16 key parts to an index. Should I just choose the most likely unique ones? I can find about 700 duplicates in 2 seconds this way, but I can't be sure of not throwing away a unique row because I also have to ignore some columns when specifying the index.
If you have a UNIQUE key (other than an AUTO_INCREMENT), simply use INSERT IGNORE ... to silently avoid duplicate rows. If you don't have a UNIQUE key, do you never need to find a row again?
If you have already allowed duplicates and you need to get rid of them, that is a different question.
I would try to eliminate the problem in the first place. There are techniques to eliminate this issue. The first one on my mind is that you could generate a random string and store it in both the session and as a hidden field in the form. This random string should be generated each time the form is displayed. When the user submits the form you need to check that the session key and the input key matches. Make sure to generate a different key on each request. Thus when a user refreshes the page he will submit an old key and it will not match.
Another solution could be that if this data should always be unique in the database check if there is that exact data in the database first before inserting. And if the data is unique by lets say the email address you can create a unique key index. Therefore that field will have to be unique in the table.