mySQL - Searching a row by its row number? - mysql

Question from a total mySQL newbie. I'm trying to build a table containing information about machine parts (screws, brackets, cylinders, etc), and each part corresponds to a machine that the part belongs to. The database will be designed so that whenever the client reads from the table, all of the parts from one specified machine will be selected. I'm trying to figure out the fastest way in which all rows falling under a certain category can be read from the disk.
Sorting the table is not an option as many people might be adding rows to the table at once. Using a table for each machine is not practical either, since new machines might be created. I expect it to have to handle lots of INSERT and SELECT operations, but almost no DELETE operations. I've come up with a plan to quickly identify each part belonging to any machine, and I've come to ask if it's practical:
Each row containing the data for a machine part will contain the row number of the previous part and the next part for the same machine. A separate table will contain the row number of the last part of each machine that appears on the table. A script could follow the list of these 'pointers,' skipping to different parts of the table until all of the parts were found.
TL;DR
Would this approach of searching a row by its row number be any faster than searching instead by an integer primary key (since a primary key does not necessarily indicate a position on the table)? How much faster would it be? Would it yield noticeable performance improvements over using an index?

This would be a terrible approach. Selecting rows which match some criteria is a fundamental feature of MySQL (or any other DB engine really...).
Just create a column called machine_id in your parts table and give an id to each machine.
You could put your machines in a machines table and use their primary key in the machine_id field of the parts table.
Then all you have to do to retrieve ALL parts of machine 42 is:
SELECT * FROM parts WHERE machine_id = 42;
If your database is massive you may also consider indexing the machine_id column for better performances.

Related

Does it make sense to split a large table into smaller ones to reduce the number of rows (not columns)? [duplicate]

rails app, I have a table, the data already has hundreds of millions of records, I'm going to split the table to multiple tables, this can speed up the read and write.
I found this gem octopus, but he is a master/slave, I just want to split the big table.
or what can I do when the table too big?
Theoretically, a properly designed table with just the right indexes will be able to handle very large tables quite easily. As the table grows the slow down in queries and insertion of new records is supposed to be negligible. But in practice we find that it doesn't always work that way! However the solution definitely isn't to split the table into two. The solution is to partition.
Partitioning takes this notion a step further, by enabling you to
distribute portions of individual tables across a file system
according to rules which you can set largely as needed. In effect,
different portions of a table are stored as separate tables in
different locations. The user-selected rule by which the division of
data is accomplished is known as a partitioning function, which in
MySQL can be the modulus, simple matching against a set of ranges or
value lists, an internal hashing function, or a linear hashing
function.
If you merely split a table your code is going to become inifinitely more complicated, each time you do an insert or a retrieval you need to figure out which split you should run that query on. When you use partitions, mysql takes care of that detail for you an as far as the application is concerned it's still one table.
Do you have an ID on each row? If the answer is yes, you could do something like:
CREATE TABLE table2 AS (SELECT * FROM table1 WHERE id >= (SELECT COUNT(*) FROM table1)/2);
The above statement creates a new table with half of the records from table1.
I don't know if you've already tried, but an index should help in speed for a big table.
CREATE INDEX index_name ON table1 (id)
Note: if you created the table using unique constraint or primary key, there's already an index.

Which is better, using a central ID store or assigning IDs based on tables

In many ERP Systems (Locally) I have seen that Databases (Generally MYSQL) uses central key store (Resource Identity). Why is that?
i.e. In a database one special table is maintained for generation of IDs which will have one cell (first one) which will have a number (ID) which is assigned to the subsequent tuple (i.e. common ID generation for all the tables in the same database).
Also in this table the entry for last inserted batch details are inserted. i.e. when 5 tuples in table ABC is inserted and, lets say that last ID of item in the batch is X, then an entry in the table (the central key store) is also inserted like ('ABC', X).
Is there any significance of this architecture?
And also where can I find the case study of common large scale custom built ERP system?
If I understand this correctly, you are asking why would someone replace IDs that are unique only for a table
TABLE clients (id_client AUTO_INCREMENT, name, address)
TABLE products (id_product AUTO_INCREMENT, name, price)
TABLE orders (id_order AUTO_INCREMENT, id_client, date)
TABLE order_details (id_order_detail AUTO_INCREMENT, id_order, id_product, amount)
with global IDs that are unique within the whole database
TABLE objects (id AUTO_INCREMENT)
TABLE clients (id_object, name, address)
TABLE products (id_object, name, price)
TABLE orders (id_object, id_object_client, date)
TABLE order_details (id_object, id_object_order, id_object_product, amount)
(Of course you could still call these IDs id_product etc. rather than id_object. I only used the name id_object for clarification.)
The first approach is the common one. When inserting a new row into a table you get the next available ID for the table. If two sessions want to insert at the same time, one must wait briefly.
The second approach hence leads to sessions waiting for their turn everytime they want to insert data, no matter what table, as they all get their IDs from the objects table. The big advantage is that when exporting data, you have global references. Say you export orders and the recipient tells you: "We have problems with your order details 12345. There must be something wrong with your data". Wouldn't it be great, if you could tell them "12345 is not an order detail ID, but a product ID. Do you have problems importing the product or can you tell me an order detail ID this is about?" rather than looking at an order detail record for hours that happens to have the number 12345, while it had nothing to do with the issue, really?
That said, it might be a better choice to use the first approach and add a UUID to all tables you'd use for external communication. No fight for the next ID and still no mistaken IDs in communication :-)
This is the common strategy used in datawarehouse to track the the batch number after successful or failure of dataloading, in case the loading of the data got failed you will say something like 'ABC' , 'Batch_num' and 'Error_Code' in the batch control table, so your further logic of loading can decide on what do with failure and can easily track the loading, in case if you want to recheck we can archive the data. This ID's are usually generated by a sequence in data base, in one word it is mostly used for monitoring purposes.
You can refer this link for more details
There are several more techniques, each with pros and cons. But let me start by pointing out two techniques that hit a brick wall at some point in scaling up. Let's assume you have billions of items, probably scattered across multiple server either by sharding or other techniques.
Brick wall #1: UUIDs are handy because clients can create them without having to ask some central server for values. But UUIDs are very random. This means that, in most situations, each reference incurs a disk hit because the id is unlikely to be cached.
Brick wall #2: Ask a central server, which has an AUTO_INCREMENT under the covers to dole out ids. I watched a social media site that was doing nothing but collecting images for sharing crash because of this. That's in spite of there being a server whose sole purpose is to hand out numbers.
Solution #1:
Here's one (of several) solutions that avoids most problems: Have a central server that hands out 100 ids at a time. After a client uses up the 100 it has been given, it asks for a new batch. If the client crashes, some of the last 100 are "lost". Oh, well; no big deal.
That solution is upwards of 100 times as good as brick wall #2. And the ids are much less random than those for brick wall #1.
Solution #2: Each client can generate its own 64-bit, semi-sequential, ids. The number includes a version number, some of the clock, a dedup-part, and the client-id. So it is roughly chronological (worldwide), and guaranteed to be unique. But still have good locality of reference for items created at about the same time.
Note: My techniques can be adapted for use by individual tables or as an uber-number for all tables. That distinction may have been your 'real' question. (The other Answers address that.)
The downside to such a design is that it puts a tremendous load on the central table when inserting new data. It is a built-in bottleneck.
Some "advantages" are:
Any resource id that is found anywhere in the system can be readily identified, regardless of type.
If there is any text in the table (such as a name or description), then it is all centralized facilitating multi-lingual support.
Foreign key references can work across multiple types.
The third is not really an advantage, because it comes with a downside: the inability to specify a specific type for foreign key references.

Seeking a performant solution for accessing unique MySQL entries

I know very little about MySQL (or web development in general). I'm a Unity game dev and I've got a situation where users (of a region the size of which I haven't decided yet, possibly globally) can submit entries to an online database. The users must be able to then locate their entry at any time.
For this reason, I've generated a guid from .Net (System.Guid.NewGuid()) and am storing that in the database entry. This works for me! However... I'm no expert, but my gut tells me that looking up a complex string in what could be a gargantuan table might have terrible performance.
That said, it doesn't seem like anything other than a globally unique identifier will solve my problem. Is there a more elegant solution that I'm not seeing, or a way to mitigate against any issues this design pattern might create?
Thanks!
Make sure you define the GUID column as the primary key in the MySQL table. That will cause MySQL to create an index on it, which will enable MySQL to quickly find a row given the GUID. The table might be gargantuan but (assuming a regular B-tree index) the time required for a lookup will increase logarithmically relative to the size of the table. In other words, if it requires 2 reads to find a row in a 1,000-row table, finding a row in a 1,000,000-row table will only require 2 more reads, not 1,000 times as many.
As long as you have defined the primary key, the performance should be good. This is what the database is designed to do.
Obviously there are limits to everything. If you have a billion users and they're submitting thousands of these entries every second, then maybe a regular indexed MySQL table won't be sufficient. But I wouldn't go looking for some exotic solution before you even have a problem.
If you have a key of the row you want, and you have an index on that key, then this query will take less than a second, even if the table has a billion rows:
SELECT ... FROM t WHERE id = 1234.
The index in question might be the PRIMARY KEY, or it could be a secondary key.
GUIDs/UUIDs should be used only if you need to manufacture unique ids in multiple clients without asking the database for an id. If you do use such, be aware that GUIDs perform poorly if the table is bigger than RAM.

Statistical data like display in website from large record set

I have 4 databases with tables having lots of data. My requirement is to show count of all the records in these tables on mouse hovering the corresponding div in UI(It is an asp.net website). Please note the count may change in every minute or in hour. (Means new records can be added or deleted from the table [using another application]). Now the issue is like, it is taking lot of time to get the count (since it has lots of data). So each mouse over, it is having a call to corresponding database and taking the count. Is there any better approach to implement this?
I am thinking of implementing something like as below.
http://www.worldometers.info/world-population/
But to change the figures like that in each second I need to have a call to the database. Right? (To get the latest count) Is there any better approach to show data like this statistics?
By the Way, I am using MySQL.
Thanks
You need to give more details - what table engines you are using, how does your count query look like, etc.
But assuming that you are using InnoDB, and you are trying to run count(*) or count(primary_id_column), you have to remember that InnoDB has clustered primary keys, that are stored with the data pages of the row itself, not in separate index pages, so the count will do full scan on the rows.
One thing you can try is to create additional, separate, not primary index on any unique column (like row's id etc,) and make sure (use explain query statement) that your count uses this index.
If this does not work for you, I would suggest to create separate table (for example with columns: table_name, row_count) to store counters in it and create triggers on insert and on delete on other tables (you need to count records in) to increment or decrement these values. From my experience (we monitor number of records in daily and hourly manners, on tables with hundreds of milions of records and heavy write load: ~150 inserts/sec) this is the best solution I have came up so far.

How do I design a schema to handle periodic bulk inserts/updates?

(tldr; I think that periodic updates forces the table to use a natural key. And so I'll have to migrate my database schema.)
I have a production database with a table like planets, which although it has good potential natural keys (e.g., the planet names which never really change), uses a typical incremented integer as the primary key. The planets table has a self-referencing column or two such as *parent_planet_id*.
Now I'm building offline cloud-based workers that re-create subsets of the planets records each week, and they need to be integrated with the main server. My plan is:
A worker instance has a mini version of the database (same schema, but no planets records)
Once per week, the worker fires up, does all its processing, creates its 100,000 or so planets records, and exports the data. (I don't think the export format matters for this particular problem: could be mysqldump, yaml, etc.)
Then, the production server imports the records: some are new records, most are updates.
This last step is what I don't know how to solve. I'm not entirely replacing the planets table each time, so the problem is that the two databases each have their own incrementing integer PK's. And so I can't just do a simple import.
I thought about exporting without the id column, but then I realized that the self-referencing columns prevent this.
I see two possible solutions:
Redesign the schema to use a natural key for the planets table. This would be a pain.
Use UUID instead of an incrementing integer for the key. Would be easier, I think, to move to. The id's would be unique, and the new rows could be safely imported. This also avoids the issues with depending on natural data in keys.
Modify the Planets to use alternate-hierarchy technique, like nested sets, closure table, or path enumeration and than export. This will break the ID-dependency.
Or, if you still do not like the idea, consider your export and import as an ETL problem.
Modify the record during the export to include PlanetName, ParentPlanetName
Import all Planets (PlanetNames) first
Then import the hierarchy (ParentPlanetName)
In any case, the surrogate key from the first DB should never leave that DB -- it has no meaning outside of it.
The best solution (in terms of desing) would be to refine your keys architecture and implement some composite key having info about when and from where the planets were imported, but you do not want to do this.
Easier (I think), and yet a bit "happy engineering" solution would be to modify importing keys. You can do this for example like that:
1. lock planets table in main system (so no new key will appear during import),
2. create lookup table having two columns, ID and PLANET NAME basing on planet table in main system,
3. get the maximum key value from that table,
4. increment every imported key (identyfying and referencing the parent-child planet relationship) value by adding the MAX value retrived within step #3,
5. alter main planet table and change current auto increment value for actual MAX + 1 value
6. now go over the table (cursor loop within procedure) checking if for the current planet name you have different key in your lookup, if yes first remove the record from the table with the key from lookup (the old one) and update the key value within the currently inspected row for an old id (that was an update),
7. unlock the table.
Most operations are updates
So you need a "real" merge. In other words, you'll have to identify a proper order in which you can INSERT/UPDATE the data, so no FKs are violated in the process.
I'm not sure what parent_planet_id means, but assuming it means "orbits" and the word "planet" is stretched to also include moons, imagine you have only Phobos in your master database and Mars and Deimos need to be imported. This can only be done in a certain order:
INSERT Mars.
INSERT Deimos, set its parent_planet_id so it points to Mars.
UPDATE Phobos' parent_planet_id so it points to Mars.
While you could exchange steps (2) and (3), you couldn't do either before the step (1).
You'll need a recursive descent to determine the proper order and then compare natural keys1 to see what needs to be UPDATEd and what INSERTed. Unfortunately, MySQL doesn't support recursive queries, so you'll need to do it manually.
I don't quite see how surrogate keys help in this process - if anything, they add one more level of indirection you'll have to reconcile eventually.
1 Which, unlike surrogates, are meaningful across different databases. You can't just compare auto-incremented integers because the same integer value might identify different planets in different databases - you'll have false UPDATEs. GUIDs, on the other hand, will never match, even when rows describe the same planet - you'll have false INSERTs.