Quite large (400k) mysql database design for multiple users - mysql

I'm seeking advice of experienced admins.
I'm working on a website, where you solve word anagrams. If it solved it should never be displayed again.
Wordbase contains ~400k entries. What would be the most effective solution to storing such data?
One way could be:
+---------+------------------------+
| word_id | user1 | user2 | user...|
+---------+------------------------+
| 1 | null | null | 1 |
| 2 | 1 | null | null |
| ... | | | |
| 400000 | null | 1 | null |
+---------+------------------------+
Where let's say 1 = solved.
But wouldn't it become a monster quite quickly?
(+even a simple query of extending it by a new user takes forever)
Other solution is to store every solved word_id for all users, but then, it can be 6-digits for every entry and growing massively and rapidly aswell.
Also which engine would be more effective in this example? MyISAM or InnoDB?

You would not put the users as columns. If I understand the question, you would have a table, called something like WordUsers with one row per "word" and one per "user":
create table WordUsers (
WordUserId int not null primary key auto_increment,
WordId int not null,
UserId int not null,
. . .
constraint fk_WordId foreign key (WordId) references Words(WordId),
constraint fk_UserId foreign key (UserId) references Users(UserId)
);
When a word is shown to a user, then you add a row to this table. The . . . can include other information, such as the date/time of the interaction.

If your database supports it (and I think all of them do now) - why not just put either a text field on the user's table, fill it with a string of "N"s for "No - they haven't seen this word yet" and when they are given a word just change the "N" to "Y" for that record/word and re-save the new string? A TEXT string can be up to 65,536 characters long. So you make your string something like 5,000 "N"s.
Or if you want to beat yourself up a bit - use the BIT field and make it something like 5000 flags. Same concept but harder to use.
BTW: On the string of "N"s and "Y"s you should be able to make an SQL query which has something like "WHERE SUBSTR(SEEN_IT,WORD_ID,1)='N'" kind of test.

You should use a relational database, like it should be, relational:
CREATE TABLE user( user_id int autoincrement, user CHAR(16));
CREATE TABLE word( word_id int autoincrement, word CHAR(16));
CREATE TABLE solved( word_id, user_id);

Related

Data model question: how to qualify 3rd objects

I'm dealing with slightly different types hence for clarity of what I'm trying to achieve I have decided to use metaphor.
Let's say you need to create tables that describe projects by two architectural bureaus:
1st only deals with 3D plans
2nd only deals with 2D sketches
I have the following table
mysql> describe sketch;
+------------------+-------------------------------+------+-----+-------------------+
| Field | Type | Null | Key | Default |
+------------------+-------------------------------+------+-----+-------------------+
| project_id | binary(16) | NO | PRI | NULL |
| company_id | binary(16) | NO | PRI | NULL |
| type | enum('2D','3D','N/A') | YES | |'N/A' |
+------------------+-------------------------------+------+-----+-------------------+
As you can see project_id & company_id form the PRIMARY KEY
The issue arises when in some exceptional circumstances the same company takes on 2D and 3D task under the same project ID.
Or the same company starts working on two or more sub-projects of the same type (e.g. both are 2D sketches) but within the realm of let's call it parent project with exactly the same ID.
One quick and dirty fix would be simply to add unique ID to the above table but it wouldn't work for me, because there are various reports and and other functions which basically do this: SELECT blah FROM sketch WHERE project_id=XXX AND company_id
I could add code to filter the results from the above SQL but I can't really change the structure or the table.
Any ideas of what options do I have?
Appreciate any ideas!
And thank you very much beforehand!
As you describe the problem, company/project is not a primary key. You describe circumstances where uniqueness is violated.
Then company/project/type does seem to be a unique key and a candidate primary key. I would say that you should have a numeric primary key and declare the tripartite key as unique.

MySQL Database Table optimization for FASTER Querying & Performance

I have 2 tables in a my MySQL Database.
Let's call 1st main, 2nd final.
TABLE `main` has the structure | TABLE `final` has the structure
|
`id` --> PRIMARY KEY (Auto Increment) | `id` --> PRIMARY KEY (Auto Increment)
| `id_main` --> ?? (Need help here)
|
id | name | info | id | id_main | name | info(changed)
--------------------- | ---------------------------------------
1 | Peter | 5,9 | 1 | 2 | Butters | 0.3,34
2 | Butters | 3,3 | 2 | 4 | Stewie | 1.2,4.4
3 | Stan | 2,96 | 3 | 1 | Peter | 5.7,0.9
4 | Stewie | 1,84 | 4 | 3 | Stan | 4.8,0.74
After analysing data in main the results get put into final.
As you can see final has an extra column (id_main) which points back to main.id
In actuality these 2 tables are 100 million+ rows each, my problem arises while performing SQL queries.
How should final especially (id & id_main) be configured so that Querying from main to final is the fastest.
Can I do away with final.id (PRIMARY KEY, Auto Increment) & keep
final.id_main (As an UNIQUE Index?)
OR
Should I keep id AS PRIMARY KEY (AI) & final.id_main AS UNIQUE Index?
I would be making calls like:
int id_From_Main= 10000;
SELECT `id_main` FROM `final` WHERE `id`='"+id_From_Main+"'
If there's a 1:1 relation between those tables, I don't see any reason why they would need two separate auto-incremented primary keys.
I would remove the final.id column and have the final.id_main as a non-auto-incremented primary key and a foreign key to the main.id column.
In general, you can also have a table without a primary key at all. It depends on if you want to be able to select specific individual rows or not.
I don't understand your query SELECT id_main FROM final WHERE id = '"+id_From_Main+"' — you're trying to select the value of ID from main by ID from main. What's the purpose, why are you trying to get the value you already have?
Anyway, you're not providing enough information to give you a qualified answer. You have to optimize you data structures according to queries you'll be doing.
Make sure you have indexes on columns which you are using in the WHERE clausule. If you're selecting by final.id_main, have an index on that column. If you're selecting by final.id_main and final.name, have a composite index on both columns, etc.
Do you really need to have the name column in both tables? It's a bad database design, unless it's some performance optimization (to avoid a join).
So, you should:
collect all queries you're currently using, set proper indexes according to them
remove any unnecessary columns (e.g. final.id, final.name)
use the EXPLAIN on your queries to get execution information (you can also use the Explain analyzer to help you interpret the results)
you can try query profiling
In mysql, you have to define id as PK because it is auto_increment. Define id_main as UNIQUE.

mysql auto increment id vs combined fields primary key

I'm having a diffecult time figuring out what to use for my primary key.
My table:
| gender | age | value | date updated | page id(the forgein key) |
| M | 15-24 | 100 | some date | 1
| M | 25-34 | 120 | some date | 1
| M | 35-44 | 110 | some date | 1
| F | 15-24 | 190 | some date | 1
| F | 25-34 | 230 | some date | 1
Now I need to add a primary key. I could either add a id field with auto increment and make that the pk but that id will not be used as forgein key or anything else in another table so it would be kind of useless to add it.
I could also combine the page, gender and age and make them the primary key but I am not sure what the advantage on that would be. I tried googling for a while but still not sure what to do.
Please read the documentation of MySQL:
The primary key for a table represents the column or set of columns
that you use in your most vital queries. It has an associated index,
for fast query performance. Query performance benefits from the NOT
NULL optimization, because it cannot include any NULL values. With the
InnoDB storage engine, the table data is physically organized to do
ultra-fast lookups and sorts based on the primary key column or
columns.
If your table is big and important, but does not have an obvious
column or set of columns to use as a primary key, you might create a
separate column with auto-increment values to use as the primary key.
These unique IDs can serve as pointers to corresponding rows in other
tables when you join tables using foreign keys.
Thanks #AaronDigulla for his explanation...:
Necessary? No. Used behind the scenes? Well, it's saved to disk and
kept in the row cache, etc. Removing will slightly increase your
performance (use a watch with millisecond precision to notice).
But ... the next time someone needs to create references to this
table, they will curse you. If they are brave, they will add a PK (and
wait for a long time for the DB to create the column). If they are not
brave or dumb, they will start creating references using the business
key (i.e. the data columns) which will cause a maintenance nightmare.
Conclusion: Since the cost of having a PK (even if it's not used ATM)
is so small, let it be.
From my experience and knowledge if you do not define your primary key the database will create an hidden primary key. So in your situation best solution is to create it anyway.
I don't think that using an auto increment key, or using gender and age as a composite primary key would significantly change performance.
Anyway primary key on gender and age should be a nice choice as also it prevents duplicate entries (you can't repeat the same pair of values in other records) and leaves the table structure more clear.

MySQL InnoDB hash index optimizing

I was wondering if i could optimize it more, maybe someone struggled with that.
First of all I have table:
CREATE TABLE `site_url` (
`id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
`url_hash` CHAR(32) NULL DEFAULT NULL,
`url` VARCHAR(2048) NULL DEFAULT NULL,
PRIMARY KEY (`id`),
INDEX `url_hash` (`url_hash`)
)
ENGINE=InnoDB;
where I store site URI (domain is in different table, but for purpose of this question id doesn't matter - I hope)
url_hash is MD5 calculated from url
It seems that all fields are in good length, indexes should be correct but there are a lat of data in it and I'm looking for more optimization.
Standard query looks like this:
select id from site_url where site_url.url_hash = MD5('something - often calculated in application rather than in mysql') and site_url.url = 'something - often calculated in application rather than in mysql'
describe gives:
+----+-------------+----------+------+---------------+----------+---------+-------+------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+------+---------------+----------+---------+-------+------+------------------------------------+
| 1 | SIMPLE | site_url | ref | url_hash | url_hash | 97 | const | 1 | Using index condition; Using where |
+----+-------------+----------+------+---------------+----------+---------+-------+------+------------------------------------+
But I'm wondering if I could help mysql doing that search. It must by InnoDB engine, I can't add key to url because of it's length
Friend of mine told me to short up hash to 16 chars, and write it as number. Will index on BIGINT be faster than on char(32)? Friend also suggested to do MD5 and take 16 first/last chars of it but I think it will make a lot more collisions.
What are your thoughts about it?
This is your query:
select id
from site_url
where site_url.url_hash = MD5('something - often calculated in application rather than in mysql') and
site_url.url = 'something - often calculated in application rather than in mysql';
The best index for this query would be on site_url(url_hash, url, id). The caveat is that you might need to use a prefix unless you have the large prefix option set (see innodb_large_prefix).
If url_hash is md5 of url why you select by 2 keys?
select id from site_url where site_url.url_hash = MD5('something - often calculated in application rather than in mysql');
Actually you dont need seсond check of site_url.url;
But if you want, you can select by 2 fields with USE INDEX syntax:
select id from site_url USE INDEX (url_hash) where site_url.url_hash = MD5('something - often calculated in application rather than in mysql') and site_url.url = 'something - often calculated in application rather than in mysql');

Whats the most efficient way to store an array of integers in a MySQL column?

I've got two tables
A:
plant_ID | name.
1 | tree
2 | shrubbery
20 | notashrubbery
B:
area_ID | name | plants
1 | forrest | *needhelphere*
now I want the area to store any number of plants, in a specific order and some plants might show up a number of times: e.g 2,20,1,2,2,20,1
Whats the most efficient way to store this array of plants?
Keeping in mind I need to make it so that if I perform a search to find areas with plant 2, i don't get areas which are e.g 1,20,232,12,20 (pad with leading 0s?) What would be the query for that?
if it helps, let's assume I have a database of no more than 99999999 different plants. And yes, this question doesn't have anything to do with plants....
Bonus Question
Is it time to step away from MySQL? Is there a better DB to manage this?
If you're going to be searching both by forest and by plant, sounds like you would benefit from a full-on many-to-many relationship. Ditch your plants column, and create a whole new areas_plants table (or whatever you want to call it) to relate the two tables.
If area 1 has plants 1 and 2, and area 2 has plants 2 and 3, your areas_plants table would look like this:
area_id | plant_id | sort_idx
-----------------------------
1 | 1 | 0
1 | 2 | 1
2 | 2 | 0
2 | 3 | 1
You can then look up relationships from either side, and use simple JOINs to get the relevant data from either table. No need to muck about in LIKE conditions to figure out if it's in the list, blah, bleh, yuck. I've been there for a legacy database. No fun. Use SQL to its greatest potential.
How about this:
table: plants
plant_ID | name
1 | tree
2 | shrubbery
20 | notashrubbery
table: areas
area_ID | name
1 | forest
table: area_plant_map
area_ID | plant_ID | sequence
1 | 1 | 0
1 | 2 | 1
1 | 20 | 2
That's the standard normalized way to do it (with a mapping table).
To find all areas with a shrubbery (plant 2), do this:
SELECT *
FROM areas
INNER JOIN area_plant_map ON areas.area_ID = area_plant_map.area_ID
WHERE plant_ID = 2
You know this violates normal form?
Typically, one would have an areaplants table: area_ID, plant_ID with a unique constraint on the two and foreign keys to the other two tables. This "link" table is what gives you many-many or many-to-one relationships.
Queries on this are generally very efficient, they utilize indexes and do not require parsing strings.
8 years after this question was asked, here's 2 ideas:
1. Use json type (link)
As of MySQL 5.7.8, MySQL supports a native JSON data type defined by RFC 7159 that enables efficient access to data in JSON (JavaScript Object Notation) documents.
2. Use your own codification
Turn area_id into a string field (varchar or text, your choice, think about performance), then you can represent values as for example -21-30-2-4-20- then you can filter using %-2-%.
If you somehow try one of these, I'd love it if you shared your performance results, with 100M rows as you suggested.
--
Remember than using any of these breaks first rule of normalization, which says every column should hold a single value
Your relation attributes should be atomic, not made up of multiple values like lists. It is too hard to search them. You need a new relation that maps the plants to the area_ID and the area_ID/plant combination is the primary key.
Use many-to-many relationship:
CREATE TABLE plant (
plant_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(255)
) ENGINE=INNODB;
CREATE TABLE area (
area_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(255)
) ENGINE=INNODB;
CREATE TABLE plant_area_xref (
plant_id INT NOT NULL,
area_id INT NOT NULL,
sort_idx INT NOT NULL,
FOREIGN KEY (plant_id) REFERENCES plant(plant_id) ON DELETE CASCADE,
FOREIGN KEY (area_id) REFERENCES area(area_id) ON DELETE CASCADE,
PRIMARY KEY (plant_id, area_id, sort_idx)
) ENGINE=INNODB;
EDIT:
Just to answer your bonus question:
Bonus Question Is it time to step away from MySQL? Is there a better DB to manage this?
This has nothing to do with MySQL. This was just an issue with bad database design. You should use intersection tables and many-to-many relationship for cases like this in every RDBMS (MySQL, Oracle, MSSQL, PostgreSQL etc).