Work around to use two unique keys on partitioned table - mysql

We have a dataset of roughly 400M rows, 200G in size. 200k rows are added in a daily batch. It mainly serves as an archive that is indexed for full text search by another application.
In order to reduce the database footprint, the data is stored in plain myIsam.
We are considering a range-partitioned table to streamline the backup process, but cannot figure a good way to handle unique keys. We absolutely need two of them. One to be directly compatible with the rest of the schema (ex: custId), another to be compatible with the full text search app (ex: seqId).
My understanding is that partitions do not support more than one globally unique key. We would have to merge both unique keys (custId, seqId), which will not work in our case.
Am I missing something?

Related

MySQL partitioning or NoSQL like AWS DynamoDB?

Business logic:
My application crawls a lot(hundreds or sometimes thousands) of webpages every few hours and stores all the links(i.e. all anchor tags) on that webpage in a MySQL database table say links. This table is growing very big day by day (already around 20 million records as of now).
Technical:
I have a unique index combined on [webpage_id, link] in the links table. Also, I have a column crawl_count in the same table.
Now whenever I crawl a webpage, I already know webpage_id (the foreign key to webpages table) and I get links in that webpage (i.e. array of link) which I just do an insert or update query without worrying about what is already in the table.
INSERT INTO ........ ON DUPLICATE KEY UPDATE crawl_count=crawl_count+1
Problem:
The table grows big every day & I want to optimize the table for performance. Options I considered are,
Partitioning: Partition table by domains. All webpages belong to a particular domain. For example: Webpage https://www.amazon.in/gp/goldbox?ref_=nav_topnav_deals belong to the domain https://www.amazon.in/
NoSQL like DynamoDB. I have other tables of application in MySQL DB which I do not want to migrate to DynamoDB unless it's absolutely required. Also I have considered change in application logic (eg: change the structure of webpages table to something like
{webpage: "http://example.com/new-brands", links: [link1, link2, link3]}
and migrate this table to DynamoDB so I don't have a links table. But again, there is a limit for every record in DynamoDB(400kb). What if it exceeds this limit?
I have read pros & cons of using either of the approach. As far my understanding goes, DynamoDB doesn't seem to be a good fit for my situation. But still wanted to post this question so I can make a good decision for this scenario.
PARTITION BY domain -- No. There won't be any performance gain. Anyway, you will find that one domain dominates the table, and a zillion domains show up only once. (I'm speaking from experience.)
The only concept of an "array" is a separate table. It would have, in your case, webpage_id and link as a 2-column PRIMARY KEY (which is 'unique').
Normalize. This is to avoid having lots of copies of each domain and each link. This saves some space.
I assume you have two categories of links -- the ones for pages you have scanned, and the ones for pages waiting to scan. And probably the two sets are similar in size. I don't understand the purpose of `crawl count, but it adds to the cost.
I may be able to advise further if I could see the queries -- both inserting and selecting. Also, how big are the tables (GB) and what is the value of innodb_buffer_pool_size? Putting these together, we can discuss likely points if sluggishness.
Also the slowlog would help.
Are you dealing with non-ascii urls? Urls too long to index? Do you split urls into domain + path? Do you strip off "#..."? And "?..."?

MySQL optimizing by splitting table into two

I currently have a table called map_tiles that will eventually have around a couple hundred thousand rows. Each row in this table represents an individual tile on a world map of my game. Currently, the table structure is like so:
id int(11) PRIMARY KEY
city_id int(11)
type varchar(20)
x int(11) INDEX KEY
y int(11) INDEX KEY
level int(11)
I also want to be able to store a stringified JSON object that will contain information regarding that specific tile. Since I could have 100,000+ rows, I want to optimize my queries and table design to get the best performance possible.
So here is my scenario: A player loads a position, say at 50,50 on the worldmap. We will load all tiles within a 25 tile radius of the player's coordinate. So, we will have to run a WHERE query on this table of hundreds of thousands of rows in my map_tiles table.
So, would adding another field of type text called data to the existing table prove better performance? However, this would slow down the main query.
Or, would it be worth it to create a separate table called map_tiles_data, that just has the structure like so:
tile_id int(11) PRIMARY KEY
data text
And I could run the main query that finds the tiles within the radius of the player from the map_tiles, and then do a UNION ALL possibly that just pulls the JSON stringified data from the second table?
EDIT: Sorry, I should have clarified. The second table if used, would not have a row for each corresponding tile in the map_tiles table. A row will only be added if data is to be stored on a map tile. So by default, there will be 0 rows in the map_tiles_data table, while there could be 100,000 thousand rows in the map_tiles table. When a player does x action, then the game will add a row to map_tiles_data.
No need to store data in separate table. You can use the same table. But you have to use InnoDB plugin and set innodb_file_format=barracuda and as data is going to be text, use ROW_FORMAT=Dynamic (or Compressed)
InnoDB will store the text out side the ROW page, so having data in the same table is efficient than having it in separate table (you can avoid joins and foreign keys). Also add index on x and y as all your queries are based on the location
Useful reading:
Innodb Plugin in “Barracuda” format and ROW_FORMAT=DYNAMIC. In this format Innodb stores either whole blob on the row page or only 20 bytes BLOB pointer giving preference to smaller columns to be stored on the page, which is reasonable as you can store more of them. BLOBs can have prefix index but this no more requires column prefix to be stored on the page – you can build prefix indexes on blobs which are often stored outside the page.
COMPRESSED row format is similar to DYNAMIC when it comes to handling blobs and will use the same strategy storing BLOBs completely off page. It however will always compress blobs which do not fit to the row page, even if KEY_BLOCK_SIZE is not specified and compression for normal data and index pages is not enabled.
Don't think that I am referring only to BLOBs. From storage prospective BLOB, TEXT as well as long VARCHAR are handled same way by Innodb.
Ref: https://www.percona.com/blog/2010/02/09/blob-storage-in-innodb/
The issue of storing data in one table or two tables is not really your main issue. The issue is getting the neighboring tiles. I'll return to that in a moment.
JSON can be a convenient format for flexibly storing attribute/value pairs. However, it is not so useful for accessing the data in the database. You might want to consider a hybrid form. This suggests another table, because you might want to occasionally add or remove columns
Another consideration is maintaining history. You may want history on the JSON component, but you don't need that for the rest of the data. This suggests using a separate table.
As for optimizing the WHERE. I think you have three choices. The first is your current approach, which is not reasonable.
The second is to have a third table that contains all the neighbors within a given distance (one row per tile and per neighboring tile). Unfortunately, this method doesn't allow you to easily vary the radius, which might be desirable.
The best solution is to use a GIS solution. You can investigate MySQL's support for geographic data types here.
Where you store your JSON really won't matter much. The main performance problem you face is the fact that your WHERE clause will not be able to make use of any indexes (because you're ultimately doing a greater / less than query rather than a fixed query). A hundred thousand rows isn't that many, so performance from this naive solution might be acceptable for your use case; ideally you should use the geospatial types supported by MySQL.

indexing varchars without duplicating the data

I've huge data-set of (~1 billion) records in the following format
|KEY(varchar(300),UNIQE,PK)|DATA1(int)|DATA2(bool)|DATA4(varchar(10)|
Currently the data is stored in MySAM MYSQL table, but the problem is that the key data (10G out of 12G table size) is stored twice - once in the table and once as index. (the data is append only there won't ever be UPDATE query on the table)
There are two major actions that run against the data-set :
contains - Simple check if a key is found
count - Aggregation (mostly) functions according to the data fields
Is there a way to store the key data only once?
One idea I had is to drop the DB all together and simply create 2-5 char folder structure.
this why the data assigned to the key "thesimon_wrote_this" would be stored in the fs as
~/data/the/sim/on_/wro/te_/thi/s.data
This way the data set will function much as btree and the "contains" and data retrieval functions will run in almost O(1) (with the obvious HDD limitations).
This makes the backups pretty easy (backing up only files with A attribute) but the aggregating functions became almost useless as I need to grep 1 billion of files every time. The allocation unit size is irrelevant as I can adjust the file structure so that only 5% of the disk space is taken without use.
I'm pretty sure that there is another - much more elegant way to do that, I can't Google it out :).
It would seem like a very good idea to consider having a fixed-width, integral key, like a 64-bit integer. Storing and searching a varchar key is very slow by comparison! You can still add an additional index on the KEY column for fast lookup, but it shouldn't be your primary key.

Bulk inserts of heavily indexed child items (Sql Server 2008)

I'm trying to create a data import mechanism for a database that requires high availability to readers while serving irregular bulk loads of new data as they are scheduled.
The new data involves just three tables with new datasets being added along with many new dataset items being referenced by them and a few dataset item metadata rows referencing those. Datasets may have tens of thousands of dataset items.
The dataset items are heavily indexed on several combinations of columns with the vast majority (but not all) reads including the dataset id in the where clause. Because of the indexes, data inserts are now too slow to keep up with inflows but because readers of those indexes take priority I can not remove the indexes on the main table but need to work on a copy.
I therefore need some kind of working table that I copy into, insert into and reindex before quickly switching it to become part of the queried table/view. The question is how do I quickly perform that switch?
I have looked into partitioning the dataset items table by a range of dataset id, which is a foreign key, but because this isn't part of the primary key SQL Server doesn't seem make that easy. I am not able to switch the old data partition with a readily indexed updated version.
Different articles suggest use of partitioning, snapshot isolation and partitioned views but none directly answer this situation, being either about bulk loading and archiving of old data (partitioned by date) or simple transaction isolation without considering indexing.
Is there any examples that directly tackle this seemingly common problem?
What different strategies do people have for really minimizing the amount of time that indexes are disabled for when bulk loading new data into large indexed tables?
Notice, that partitioning on a column requires the column to be part of the clustered index key, not part of the primary key. The two are independent.
Still, partitioning imposes lots of constraints on what you operations you can perform on your table. For example, switching only works if all indexes are aligned and no foreign keys reference the table being modified.
If you can make use of partitioning under all of those restrictions this is probably the best approach. Partitioned views give you a more flexibility but have similar restrictions: All indexes are obviously aligned and incoming FKs are impossible.
Partitioning data is not easy. It is not a click-through-wizard-and-be-done solution. The set of tradeoffs is very complex.

Maximum table size for a MySQL database

What is the maximum size for a MySQL table? Is it 2 million at 50GB? 5 million at 80GB?
At the higher end of the size scale, do I need to think about compressing the data? Or perhaps splitting the table if it grew too big?
I once worked with a very large (Terabyte+) MySQL database. The largest table we had was literally over a billion rows.
It worked. MySQL processed the data correctly most of the time. It was extremely unwieldy though.
Just backing up and storing the data was a challenge. It would take days to restore the table if we needed to.
We had numerous tables in the 10-100 million row range. Any significant joins to the tables were too time consuming and would take forever. So we wrote stored procedures to 'walk' the tables and process joins against ranges of 'id's. In this way we'd process the data 10-100,000 rows at a time (Join against id's 1-100,000 then 100,001-200,000, etc). This was significantly faster than joining against the entire table.
Using indexes on very large tables that aren't based on the primary key is also much more difficult. Mysql stores indexes in two pieces -- it stores indexes (other than the primary index) as indexes to the primary key values. So indexed lookups are done in two parts: First MySQL goes to an index and pulls from it the primary key values that it needs to find, then it does a second lookup on the primary key index to find where those values are.
The net of this is that for very large tables (1-200 Million plus rows) indexing against tables is more restrictive. You need fewer, simpler indexes. And doing even simple select statements that are not directly on an index may never come back. Where clauses must hit indexes or forget about it.
But all that being said, things did actually work. We were able to use MySQL with these very large tables and do calculations and get answers that were correct.
About your first question, the effective maximum size for the database is usually determined by operating system, specifically the file size MySQL Server will be able to create, not by MySQL Server itself. Those limits play a big role in table size limits. And MyISAM works differently from InnoDB. So any tables will be dependent on those limits.
If you use InnoDB you will have more options on manipulating table sizes, resizing the tablespace is an option in this case, so if you plan to resize it, this is the way to go. Give a look at The table is full error page.
I am not sure the real record quantity of each table given all necessary information (OS, Table type, Columns, data type and size of each and etc...) And I am not sure if this info is easy to calculate, but I've seen simple table with around 1bi records in a couple cases and MySQL didn't gave up.