MySQL optimizing by splitting table into two - mysql

I currently have a table called map_tiles that will eventually have around a couple hundred thousand rows. Each row in this table represents an individual tile on a world map of my game. Currently, the table structure is like so:
id int(11) PRIMARY KEY
city_id int(11)
type varchar(20)
x int(11) INDEX KEY
y int(11) INDEX KEY
level int(11)
I also want to be able to store a stringified JSON object that will contain information regarding that specific tile. Since I could have 100,000+ rows, I want to optimize my queries and table design to get the best performance possible.
So here is my scenario: A player loads a position, say at 50,50 on the worldmap. We will load all tiles within a 25 tile radius of the player's coordinate. So, we will have to run a WHERE query on this table of hundreds of thousands of rows in my map_tiles table.
So, would adding another field of type text called data to the existing table prove better performance? However, this would slow down the main query.
Or, would it be worth it to create a separate table called map_tiles_data, that just has the structure like so:
tile_id int(11) PRIMARY KEY
data text
And I could run the main query that finds the tiles within the radius of the player from the map_tiles, and then do a UNION ALL possibly that just pulls the JSON stringified data from the second table?
EDIT: Sorry, I should have clarified. The second table if used, would not have a row for each corresponding tile in the map_tiles table. A row will only be added if data is to be stored on a map tile. So by default, there will be 0 rows in the map_tiles_data table, while there could be 100,000 thousand rows in the map_tiles table. When a player does x action, then the game will add a row to map_tiles_data.

No need to store data in separate table. You can use the same table. But you have to use InnoDB plugin and set innodb_file_format=barracuda and as data is going to be text, use ROW_FORMAT=Dynamic (or Compressed)
InnoDB will store the text out side the ROW page, so having data in the same table is efficient than having it in separate table (you can avoid joins and foreign keys). Also add index on x and y as all your queries are based on the location
Useful reading:
Innodb Plugin in “Barracuda” format and ROW_FORMAT=DYNAMIC. In this format Innodb stores either whole blob on the row page or only 20 bytes BLOB pointer giving preference to smaller columns to be stored on the page, which is reasonable as you can store more of them. BLOBs can have prefix index but this no more requires column prefix to be stored on the page – you can build prefix indexes on blobs which are often stored outside the page.
COMPRESSED row format is similar to DYNAMIC when it comes to handling blobs and will use the same strategy storing BLOBs completely off page. It however will always compress blobs which do not fit to the row page, even if KEY_BLOCK_SIZE is not specified and compression for normal data and index pages is not enabled.
Don't think that I am referring only to BLOBs. From storage prospective BLOB, TEXT as well as long VARCHAR are handled same way by Innodb.
Ref: https://www.percona.com/blog/2010/02/09/blob-storage-in-innodb/

The issue of storing data in one table or two tables is not really your main issue. The issue is getting the neighboring tiles. I'll return to that in a moment.
JSON can be a convenient format for flexibly storing attribute/value pairs. However, it is not so useful for accessing the data in the database. You might want to consider a hybrid form. This suggests another table, because you might want to occasionally add or remove columns
Another consideration is maintaining history. You may want history on the JSON component, but you don't need that for the rest of the data. This suggests using a separate table.
As for optimizing the WHERE. I think you have three choices. The first is your current approach, which is not reasonable.
The second is to have a third table that contains all the neighbors within a given distance (one row per tile and per neighboring tile). Unfortunately, this method doesn't allow you to easily vary the radius, which might be desirable.
The best solution is to use a GIS solution. You can investigate MySQL's support for geographic data types here.

Where you store your JSON really won't matter much. The main performance problem you face is the fact that your WHERE clause will not be able to make use of any indexes (because you're ultimately doing a greater / less than query rather than a fixed query). A hundred thousand rows isn't that many, so performance from this naive solution might be acceptable for your use case; ideally you should use the geospatial types supported by MySQL.

Related

Is this the right time to use a solution like meilisearch?

I am caught up in a situation where I need to index like seven columns used for search and filtering on a table, but this is obviously going to hurt the performance for inserts, updates and deletes when dataset on the table grows(and its going to). Now I am thinking of using a solution like meilisearch for search and filtering and only maintain the index on primary and foreign keys and drop the indexes on the other columns. Is this the right way to go about a problem like this?
MeiliSearch seems to fit your use case as it can support a lot of document with a lot of fields.
But the way to efficiently add these documents in MeiliSearch is to add them in batches. So I you have 1 million documents, you add them in batches of 1000 instead of in one batch of 1 million. You can have more information about limitations here.
Also, we are not limited to 10mb! That was the previous default payload size which now is 100mb. Meaning that you can not make a request with more than 100mb at once. If you want to change that number you can with the right flag.

Work around to use two unique keys on partitioned table

We have a dataset of roughly 400M rows, 200G in size. 200k rows are added in a daily batch. It mainly serves as an archive that is indexed for full text search by another application.
In order to reduce the database footprint, the data is stored in plain myIsam.
We are considering a range-partitioned table to streamline the backup process, but cannot figure a good way to handle unique keys. We absolutely need two of them. One to be directly compatible with the rest of the schema (ex: custId), another to be compatible with the full text search app (ex: seqId).
My understanding is that partitions do not support more than one globally unique key. We would have to merge both unique keys (custId, seqId), which will not work in our case.
Am I missing something?

Is it correct to have a BLOB field directly in the main table?

Which one is better: having a BLOB field in the same table or having a 1-TO-1 reference to it in another table?
I'm making a MySQL database whose main table is called item(ID, Description). This table is consulted by a program I'm developing in VB.NET which offers the possibility to double-click a specific item obtained with a query. Once opened its dedicated form, I would like to show an image stored in the BLOB field, a sort of item preview. The problem is I don't know where is better to create this BLOB field.
Assuming to have a table like this: Item(ID, Description, BLOB), will the BLOB field affect the database performance on queries like:
SELECT ID, Description FROM Item;
If yes, what do you think about this solution:
Item(ID, Description)
Images(Item, File)
Where Images.Item references to Item.ID, and File is the BLOB field.
You can add the BLOB field directly to your main table, as BLOB fields are not stored in-row and require a separate look-up to retrieve its contents. Your dependent table is needless.
BUT another and preferred way is to store on your database table only a pointer (path to the file on server) to your image file. In this way you can retrive the path and access the file from your VB.NET application.
To quote the documentation about blobs:
Each BLOB or TEXT value is represented internally by a separately allocated object. This is in contrast to all other data types, for which storage is allocated once per column when the table is opened.
In simpler terms, the blob's storage isn't stored inside the table's row, only a pointer is - which is pretty similar to what you're trying to achieve with the secondary table. To make a long story short - there's no need for another table, MySQL already doesn't the same thing internally.
Most of what has been said in the other Answers is mostly correct. I'll start from scratch, adding some caveats.
The two-table, 1-1, design is usually better for MyISAM, but not for InnoDB. The rest of my Answer applies only to InnoDB.
"Off-record" storage may happen to BLOB, TEXT, and 'large' VARCHAR and VARBINARY, almost equally.
"Large" columns are usually stored "off-record", thereby providing something very similar to your 1-1 design. However, by having InnoDB do the work usually leads to better performance.
The ROW_FORMAT and the size of the column makes a difference.
A "small" BLOB may be stored on-record. Pro: no need for the extra fetch when you include the blob in the SELECT list. Con: clutter.
Some ROW_FORMATs cut off at 767 bytes.
Some ROW_FORMATs store 20 bytes on-record; this is just a 'pointer'; the entire blob is off-record.
etc, etc.
Off-record is beneficial when you need to filter out a bunch of rows, then fetch only a few. Also, when you don't need the column.
As a side note, TINYTEXT is possibly useless. There are situations where the 'equivalent' VARCHAR(255) performs better.
Storing an image in the table (on- or off-record) is arguably unwise if that image will be used in an HTML page. HTML is quite happy to request the <img src=...> from your server or even some other server. In this case, a smallish VARCHAR containing a url is the 'correct' design.

indexing varchars without duplicating the data

I've huge data-set of (~1 billion) records in the following format
|KEY(varchar(300),UNIQE,PK)|DATA1(int)|DATA2(bool)|DATA4(varchar(10)|
Currently the data is stored in MySAM MYSQL table, but the problem is that the key data (10G out of 12G table size) is stored twice - once in the table and once as index. (the data is append only there won't ever be UPDATE query on the table)
There are two major actions that run against the data-set :
contains - Simple check if a key is found
count - Aggregation (mostly) functions according to the data fields
Is there a way to store the key data only once?
One idea I had is to drop the DB all together and simply create 2-5 char folder structure.
this why the data assigned to the key "thesimon_wrote_this" would be stored in the fs as
~/data/the/sim/on_/wro/te_/thi/s.data
This way the data set will function much as btree and the "contains" and data retrieval functions will run in almost O(1) (with the obvious HDD limitations).
This makes the backups pretty easy (backing up only files with A attribute) but the aggregating functions became almost useless as I need to grep 1 billion of files every time. The allocation unit size is irrelevant as I can adjust the file structure so that only 5% of the disk space is taken without use.
I'm pretty sure that there is another - much more elegant way to do that, I can't Google it out :).
It would seem like a very good idea to consider having a fixed-width, integral key, like a 64-bit integer. Storing and searching a varchar key is very slow by comparison! You can still add an additional index on the KEY column for fast lookup, but it shouldn't be your primary key.

Does MYSQL stores it in an optimal way it if the same string is stored in multiple rows?

I have a table where one of the columns is a sort of id string used to group several rows from the table. Let's say the column name is "map" and one of the values for map is e.g. "walmart". The column has an index on it, because I use to it filter those rows which belong to a certain map.
I have lots of such maps and I don't know how much space the different map values take up from the table. Does MYSQL recognizes the same map value is stored for multiple rows and stores it only once internally and only references it with an internal numeric id?
Or do I have to replace the map string with a numeric id explicitly and use a different table to pair map strings to ids if I want to decrease the size of the table?
MySQL will store the whole data for every row, regardless of whether the data already exists in a different row.
If you have a limited set of options, you could use an ENUM field, else you could pull the names into another table and join on it.
I think MySQL will duplicate your content each time : it stores data row by row, unless you explicitly specify otherwise (putting the data in another table, like you suggested).
Using another table will mean you need to add a JOIN in some of your queries : you might want to think a bit about the size of your data (are they that big ?), compared to the (small ?) performance loss you may encounter because of that join.
Another solution would be using an ENUM datatype, at least if you know in advance which string you will have in your table, and there are only a few of those.
Finally, another solution might be to store an integer "code" corresponding to the strings, and have those code translated to strings by your application, totally outside of the database (or use some table to store the correspondances, but have that table cached by your application, instead of using joins in SQL queries).
It would not be as "clean", but might be better for performances -- still, this may be some kind of micro-optimization that is not necessary in your case...
If you are using the same values over and over again, then there is a good functional reason to move it to a separate table, totally aside from disk space considerations: To avoid problems with inconsistent data.
Suppose you have a table of Stores, which includes a column for StoreName. Among the values in StoreName "WalMart" occurs 300 times, and then there's a "BalMart". Is that just a typo for "WalMart", or is that a different store?
Also, if there's other data associated with a store that would be constant across the chain, you should store it just once and not repeatedly.
Of course, if you're just showing locations on a map and you really don't care what they are, it's just a name to display, then this would all be irrelevant.
And if that's the case, then buying a bigger disk is probably a simpler solution than redesigning your database just to save a few bytes per record. Because if we're talking arbitrary strings for place names here, then trying to find duplicates and have look-ups for them is probably a lot of work for very little gain.