How to create index on massive data (mysql) - mysql

I am currently evaluating strategy for storing supplier catalogs.
There can be multiple items in catalog vary from 100 to 0.25million.
Each item may have multiple errors. application should support browsing of catalog items
Group by Type of Error, Category, Manufacturer, Suppliers etc..
Browse items for any group, Should be able to sort and search on multiple columns (partid,
names, price etc..)
Question is when i have to provide functionality of "Multiple SEARCH and SORT and GROUP" how should i create index.
According to mysql doc & blogs for index it seems that creating index on individual column will not be used by all query.
Creating multi column index is even not specific for my case.
There might be 20 - 30 combination of group search & sort.
How do i scale and how can i make search fast.
Expecting to handle 50 million records of data.
Currently evaluating on 15 million of data.
Suggestions are welcome.
CREATE TABLE CATALOG_ITEM
(
AUTO_ID BIGINT PRIMARY KEY AUTO_INCREMENT,
TENANT_ID VARCHAR(40) NOT NULL,
CATALOG_ID VARCHAR(40) NOT NULL,
CATALOG_VERSION INT NOT NULL,
ITEM_ID VARCHAR(40) NOT NULL,
VERSION INT NOT NULL,
NAME VARCHAR(250) NOT NULL,
DESCRIPTION VARCHAR(2000) NOT NULL,
CURRENCY VARCHAR(5) NOT NULL,
PRICE DOUBLE NOT NULL,
UOM VARCHAR(10) NOT NULL,
LEAD_TIME INT DEFAULT 0,
SUPPLIER_ID VARCHAR(40) NOT NULL,
SUPPLIER_NAME VARCHAR(100) NOT NULL,
SUPPLIER_PART_ID VARCHAR(40) NOT NULL,
MANUFACTURER_PART_ID VARCHAR(40),
MANUFACTURER_NAME VARCHAR(100),
CATEGORY_CODE VARCHAR(40) NOT NULL,
CATEGORY_NAME VARCHAR(100) NOT NULL,
SOURCE_TYPE INT DEFAULT 0,
ACTIVE BOOLEAN,
SUPPLIER_PRODUCT_URL VARCHAR(250),
MANUFACTURER_PRODUCT_URL VARCHAR(250),
IMAGE_URL VARCHAR(250),
THUMBNAIL_URL VARCHAR(250),
UNIQUE(TENANT_ID,ITEM_ID,VERSION),
UNIQUE(TENANT_ID,CATALOG_ID,ITEM_ID)
);
CREATE TABLE CATALOG_ITEM_ERROR
(
ITEM_REF BIGINT,
FIELD VARCHAR(40) NOT NULL,
ERROR_TYPE INT NOT NULL,
ERROR_VALUE VARCHAR(2000)
);

If you are determined to do this solely in MySQL, then you should be creating indexes that will work for all your queries. It's OK to have 20 or 30 indexes if there are 20-30 different queries doing your sorting. But you can probalby do it with far less indexes than that.
You also need to plan how these indexes will be maintained. I'm assuming because this is for supplier catalogs that the data is not going to change much. In this case, simply creating all the indexes you need should do the job nicely. If the data rows are going to be edited or inserted frequently in realtime, then you have to consider that with your indexing - then having 20 or 30 indexes might not be such a good idea (since MySQL will be constantly having to update them all). You also have to consider which MySQL storage engine to use. If your data never changes, MyISAM (the default engine, basically fast flat files) is a good choice. If it changes a lot, then you should be using InnoDB so you can get row level locking. InnoDB would also allow you to define a clustered index, which is a special index that controls the order stuff is stored on disk. So if you had one particular query that is run 99% of the time, you could create a clustered index for it and all the data would already be in the right order on disk, and would return super super fast. But, every insert or update to the data would result in the entire table being reordered on disk, which is not fast for lots of data. You'd never use one if the data changed at all frequently, and you might have to batch load data updates (like new versions of a supplier's million rows). Again, it comes down to whether you will be updating it never, now and then, or constantly in realtime.
Finally, you should consider alternative means than doing this in MySQL. There are a lot of really good search products out there now, such as Apache Solr or Sphinx (mentioned in a comment above), which could make your life a lot easier when coding up the search interfaces themselves. You could index the catalogs in one of these and then use them provide some really awesome search features like full text and/or faceted search. It's like having a private google search engine indexing your stuff, is a good way to describe how these work. It takes time to write the code to interface with the search server, but you will most likely save that time not having to write and wrap your head around the indexing problem and other issues I mentioned above.
If you do just go with creating all the indexes though, learn how to use the EXPLAIN command in MySQL. That will let you see what MySQL's plan for executing a query will be. You can create indexes then re-run EXPLAIN on your queries and see how MySQL is going to use them. This way you can make sure that each of your query methods has indexes supporting it, and is not falling back to a scanning your entire table of data to find things. With as many rows as you're talking about, every query MUST be able to use indexes to find its data. If you get those right, it'll perform fine.

Related

Store huge json data of indefinite size with key

I need to store the geo path data comprising of geo-points which should be indexed by unique key. For example: The path traveled by vehicle indexed by its trip id. This path can be of indefinite length.
As of now, I am thinking to store the path in the form of JSON object. The options that I have in my mind are Riak and MongoDB. I want to go with open-source technology. It will be nice if it supports clustering. In case one node goes down, we won't have any downtime in our application.
MySQL is currently our source of raw data (which we will be anyhow moving to the NoSQL DB but not as of now). But with the huge amount of data (2 million geo-point entries per day), it takes MYSQL a lot of time to filter the data based on timestamp. MySQL will still be our primary data source. The solution I am looking for will act as a cache for faster path retrieval based on id.
In current MySQL schema, the fields I have are:
system_timestamp,
gps_timestamp,
speed,
lat,
lot
This table store all the geo-points of the vehicle whether vehicle is on trip or not. Here trip is based on whether driver wants to track the movement or not. If he want to track the movement, we generate a unique trip id and associate it to the driver along with the trip's start time and the end time. Later for displaying the path based on trip id, we use the start & end time of the trip to filter the data from the raw table.
I want to store the trip path into secondary database as a cache so that it's retrieval will be fast.
Which database should be my ideal choice? What other options do I have?
I'm going to go out on a limb here and say that I believe there is a less complicated way of fixing your performance issue.
I assume you are using MySQL with InnoDB and you are indexing the timestamp field(s).
If I were you, I would simply turn the relevant timestamp (system or gps) into the primary key. With InnoDB, the table data is physically organized to do ultra-fast lookups based on the primary key column(s). Also, make sure that the relevant timestamp column is of the unsigned non-null type.
Now, instead of doing a lookup for the paths in between start and end time (as you're currently doing), I would create a separate table within the same MySQL database containing pairs of trip ID/path timestamp, where "path timestamp" is the primary key from the paths table, as mentioned earlier. Primary index the trip ID. Populate this table using the same logic/mechanism you initially imagined for Riak or MongoDB. This will basically be your "caching" system, using nothing but MySQL.
A typical lookup would take the trip ID to find all of the path timestamps associated and thus all of the path data.
CREATE TABLE IF NOT EXISTS `paths` (
`system_timestamp` int(10) unsigned NOT NULL,
`gps_timestamp` int(10) NOT NULL,
`speed` smallint(8) unsigned NOT NULL,
`lat` decimal(10,6) NOT NULL,
`lng` decimal(10,6) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
ALTER TABLE `paths` ADD PRIMARY KEY (`system_timestamp`);
CREATE TABLE IF NOT EXISTS `trips` (
`trip_id` int(10) unsigned NOT NULL,
`system_timestamp` int(10) unsigned NOT NULL,
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
ALTER TABLE `trips` ADD PRIMARY KEY (`trip_id`);
SELECT * FROM `trips`
INNER JOIN `paths` ON
`trips`.`system_timestamp` = `paths`.`system_timestamp`
WHERE `trip_id` = 1;

Increase JOIN Query response time by combining tables

We have a sports shopping website that recommends products to users. our query recommends by doing a JOIN on three tables of the following effect: (1) what sports a user is interested in, (2) what products are part of that sport, and (3) eliminate products the user has already bought. We have three tables currently. The response time is 3 seconds.
In an effort to make the query response faster, we are proposing combing two tables into one table . The attached image shows the proposed logic. My question is:
is the proposed query even possible as a single query
if all else is equal, will the proposed logic be faster than the current logic - even if it is a small amount?
We are on AWS MySQL RDS. All indexes have been done correctly. Please don't discuss about migrating to Redis, MEMSql etc, i am just interested at this stage to understand if the proposed logic will be faster.
Thank you for your help!!
CREATEs
CREATE TABLE UserPreferences (
UserPreferenceId int(11) NOT NULL AUTO_INCREMENT,
UserId int(11) NOT NULL,
FamilyId int(11) NOT NULL,
InsertedDate datetime NOT NULL,
PRIMARY KEY (UserPreferenceId),
KEY userID (UserId),
KEY FamilyId (FamilyId),
KEY user (UserId),
KEY fk_UserPreferences_1 (FamilyId),
) ENGINE=InnAoDB AUTO_INCREMENT=261 DEFAULT CHARSET=utf8
CREATE TABLE ArticleToFamily (
ArticleToFamilyId int(10) unsigned NOT NULL AUTO_INCREMENT,
ArticleId int(11) DEFAULT NULL,
FamilyId int(11) unsigned NOT NULL,
InsertedDate datetime DEFAULT NULL,
Confidence int(11) NOT NULL DEFAULT '0',
Rank int(11) NOT NULL DEFAULT '0',
PRIMARY KEY ArticleToFamilyId),
KEY ArticleIdAndFamilyId` (ArticleId,FamilyId),
KEY FamilyId (FamilyId)
) ENGINE=InnoDB AUTO_INCREMENT=19795572 DEFAULT CHARSET=latin1
CREATE TABLE ItemsUserHasBought (
ItemsUserHasBoughtId int(11) NOT NULL AUTO_INCREMENT,
UserId int(11) NOT NULL,
ArticleId int(11) NOT NULL,
BuyDate datetime NOT NULL,
InsertedDate datetime NOT NULL,
UpdatedDate datetime NOT NULL,
Status char(1) NOT NULL DEFAULT '1',
PRIMARY KEY (ItemsUserHasBoughtId),
KEY ArticleId (ArticleId)
) ENGINE=InnoDB AUTO_INCREMENT=367 DEFAULT CHARSET=latin1
Don't do it.
Combining tables usually means denormalization of some kind, which is not the direction you want to be moving in a relational database. It's rarely side-effect free and often fails to achieve the desired gains. All in all, something to avoid, to be done only when all other avenues are exhausted.
Instead, check your indexes on the three tables that you have. It's likely that adding a foreign key in the right place can easily make this query run in a fraction of it's current time. Unfortunately, until we know what indexes you're already using, we can't be any more specific about how to improve it. It's also possible you're doing the right things here, and are really hitting a wall in terms of what your server is able to do... but probably not.
If indexes don't help, the next place I'd usually look is a materialized/indexed view. This is supported by Sql Server, Oracle, Postgresql, and most other modern database server engines. Sadly, like Windowing Functions, the APPLY/lateral join operation, and correct NULL handling, indexed views are among the many parts of ansi sql where MySql lags behind other dbs. MySql is sadly becoming more and more of a joke with each passing year... but then that's probably all part of Oracle's plan since the Sun acquisition. If you really want an open source DB, Postgresql has outclassed MySql for years now in pretty much every category. MySql is living now off of it's old momentum; it's popular because it's been popular, and is therefore widely available among the low-cost web hosts, but not at all because it's better.
Don't get me wrong: MySql used to be a great option. Postgresql hardly existed, and Oracle and Sql Server weren't any better back then and priced out of reach for most small businesses. But Oracle, Sql Server, Postgresql, and others have all moved on in ways that MySql hasn't. Postgresql, specifically, has gotten easier to manage while MySql has lost some of the simplicity that gave it an advantage, without picking up enough features that really matter.
But anyone can be an armchair architect, and I've editorialized way too much already. Given wholesale database change isn't likely to be an option for you by now anyway, take a long close look at your indexes. It's a good bet you'll be able to fix your problem that way. And if you can't, you can always throw more hardware at your server. Because MySql is cheaper, right?

MySql - Handle table size and performance

We are having a Analytics product. For each of our customer we give one JavaScript code, they put that in their web sites. If a user visit our customer site the java script code hit our server so that we store this page visit on behalf of this customer. Each customer contains unique domain name.
we are storing this page visits in MySql table.
Following is the table schema.
CREATE TABLE `page_visits` (
`domain` varchar(50) DEFAULT NULL,
`guid` varchar(100) DEFAULT NULL,
`sid` varchar(100) DEFAULT NULL,
`url` varchar(2500) DEFAULT NULL,
`ip` varchar(20) DEFAULT NULL,
`is_new` varchar(20) DEFAULT NULL,
`ref` varchar(2500) DEFAULT NULL,
`user_agent` varchar(255) DEFAULT NULL,
`stats_time` datetime DEFAULT NULL,
`country` varchar(50) DEFAULT NULL,
`region` varchar(50) DEFAULT NULL,
`city` varchar(50) DEFAULT NULL,
`city_lat_long` varchar(50) DEFAULT NULL,
`email` varchar(100) DEFAULT NULL,
KEY `sid_index` (`sid`) USING BTREE,
KEY `domain_index` (`domain`),
KEY `email_index` (`email`),
KEY `stats_time_index` (`stats_time`),
KEY `domain_statstime` (`domain`,`stats_time`),
KEY `domain_email` (`domain`,`email`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
We don't have primary key for this table.
MySql server details
It is Google cloud MySql (version is 5.6) and storage capacity is 10TB.
As of now we are having 350 million rows in our table and table size is 300 GB. We are storing all of our customer details in the same table even though there is no relation between one customer to another.
Problem 1: For few of our customers having huge number of rows in table, so performance of queries against these customers are very slow.
Example Query 1:
SELECT count(DISTINCT sid) AS count,count(sid) AS total FROM page_views WHERE domain = 'aaa' AND stats_time BETWEEN CONVERT_TZ('2015-02-05 00:00:00','+05:30','+00:00') AND CONVERT_TZ('2016-01-01 23:59:59','+05:30','+00:00');
+---------+---------+
| count | total |
+---------+---------+
| 1056546 | 2713729 |
+---------+---------+
1 row in set (13 min 19.71 sec)
I will update more queries here. We need results in below 5-10 seconds, will it be possible?
Problem 2: The table size is rapidly increasing, we might hit table size 5 TB by this year end so we want to shard our table. We want to keep all records related to one customer in one machine. What are the best practises for this sharding.
We are thinking following approaches for above issues, please suggest us best practices to overcome these issues.
Create separate table for each customer
1) What are the advantages and disadvantages if we create separate table for each customer. As of now we are having 30k customers we might hit 100k by this year end that means 100k tables in DB. We access all tables simultaneously for Read and Write.
2) We will go with same table and will create partitions based on date range
UPDATE : Is a "customer" determined by the domain? Answer is Yes
Thanks
First, a critique if the excessively large datatypes:
`domain` varchar(50) DEFAULT NULL, -- normalize to MEDIUMINT UNSIGNED (3 bytes)
`guid` varchar(100) DEFAULT NULL, -- what is this for?
`sid` varchar(100) DEFAULT NULL, -- varchar?
`url` varchar(2500) DEFAULT NULL,
`ip` varchar(20) DEFAULT NULL, -- too big for IPv4, too small for IPv6; see below
`is_new` varchar(20) DEFAULT NULL, -- flag? Consider `TINYINT` or `ENUM`
`ref` varchar(2500) DEFAULT NULL,
`user_agent` varchar(255) DEFAULT NULL, -- normalize! (add new rows as new agents are created)
`stats_time` datetime DEFAULT NULL,
`country` varchar(50) DEFAULT NULL, -- use standard 2-letter code (see below)
`region` varchar(50) DEFAULT NULL, -- see below
`city` varchar(50) DEFAULT NULL, -- see below
`city_lat_long` varchar(50) DEFAULT NULL, -- unusable in current format; toss?
`email` varchar(100) DEFAULT NULL,
For IP addresses, use inet6_aton(), then store in BINARY(16).
For country, use CHAR(2) CHARACTER SET ascii -- only 2 bytes.
country + region + city + (maybe) latlng -- normalize this to a "location".
All these changes may cut the disk footprint in half. Smaller --> more cacheable --> less I/O --> faster.
Other issues...
To greatly speed up your sid counter, change
KEY `domain_statstime` (`domain`,`stats_time`),
to
KEY dss (domain_id,`stats_time`, sid),
That will be a "covering index", hence won't have to bounce between the index and the data 2713729 times -- the bouncing is what cost 13 minutes. (domain_id is discussed below.)
This is redundant with the above index, DROP it:
KEY domain_index (domain)
Is a "customer" determined by the domain?
Every InnoDB table must have a PRIMARY KEY. There are 3 ways to get a PK; you picked the 'worst' one -- a hidden 6-byte integer fabricated by the engine. I assume there is no 'natural' PK available from some combination of columns? Then, an explicit BIGINT UNSIGNED is called for. (Yes that would be 8 bytes, but various forms of maintenance need an explicit PK.)
If most queries include WHERE domain = '...', then I recommend the following. (And this will greatly improve all such queries.)
id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
domain_id MEDIUMINT UNSIGNED NOT NULL, -- normalized to `Domains`
PRIMARY KEY(domain_id, id), -- clustering on customer gives you the speedup
INDEX(id) -- this keeps AUTO_INCREMENT happy
Recommend you look into pt-online-schema-change for making all these changes. However, I don't know if it can work without an explicit PRIMARY KEY.
"Separate table for each customer"? No. This is a common question; the resounding answer is No. I won't repeat all the reasons for not having 100K tables.
Sharding
"Sharding" is splitting the data across multiple machines.
To do sharding, you need to have code somewhere that looks at domain and decides which server will handle the query, then hands it off. Sharding is advisable when you have write scaling problems. You did not mention such, so it is unclear whether sharding is advisable.
When sharding on something like domain (or domain_id), you could use (1) a hash to pick the server, (2) a dictionary lookup (of 100K rows), or (3) a hybrid.
I like the hybrid -- hash to, say, 1024 values, then look up into a 1024-row table to see which machine has the data. Since adding a new shard and migrating a user to a different shard are major undertakings, I feel that the hybrid is a reasonable compromise. The lookup table needs to be distributed to all clients that redirect actions to shards.
If your 'writing' is running out of steam, see high speed ingestion for possible ways to speed that up.
PARTITIONing
PARTITIONing is splitting the data across multiple "sub-tables".
There are only a limited number of use cases where partitioning buys you any performance. You not indicated that any apply to your use case. Read that blog and see if you think that partitioning might be useful.
You mentioned "partition by date range". Will most of the queries include a date range? If so, such partitioning may be advisable. (See the link above for best practices.) Some other options come to mind:
Plan A: PRIMARY KEY(domain_id, stats_time, id) But that is bulky and requires even more overhead on each secondary index. (Each secondary index silently includes all the columns of the PK.)
Plan B: Have stats_time include microseconds, then tweak the values to avoid having dups. Then use stats_time instead of id. But this requires some added complexity, especially if there are multiple clients inserting data. (I can elaborate if needed.)
Plan C: Have a table that maps stats_time values to ids. Look up the id range before doing the real query, then use both WHERE id BETWEEN ... AND stats_time .... (Again, messy code.)
Summary tables
Are many of the queries of the form of counting things over date ranges? Suggest having Summary Tables based perhaps on per-hour. More discussion.
COUNT(DISTINCT sid) is especially difficult to fold into summary tables. For example, the unique counts for each hour cannot be added together to get the unique count for the day. But I have a technique for that, too.
I wouldn't do this if i were you. First thing that come to mind would be, on receive a pageview message, i send the message to a queue so that a worker can pickup and insert to database later (in bulk maybe); also i increase the counter of siteid:date in redis (for example). Doing count in sql is just a bad idea for this scenario.

How to move from MySQL to Cassandra modeling

I am trying to move from MySQL to Cassandra for a music service application I am building.
I have read the following stackexchange: MySQL Data Model to Cassandra Help?
and checked out https://wiki.apache.org/cassandra/DataModel - also the DataStax Cassandra Modeling they did with the music service also, but the documentation so far are very small and narrow that I can't ditch MySql type queries away, so I would need help on.
This is my album table that works so far in mysql
CREATE TABLE `albums` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`title` varchar(150) NOT NULL,
`description` varchar(300) NOT NULL,
`release_date` int(10) unsigned NOT NULL,
`status` enum('active','inactive','pending') NOT NULL,
`licensor_id` int(11) NOT NULL,
`score` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `status` (`status`),
KEY `licensor_id` (`licensor_id`),
KEY `batch_id` (`batch_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1720100 ;
I also have a one to many relationship on the following tables:, artist (many artist to one album), genre(many genre to one album), songs(1 album contains many songs).
I have many pivot tables going around in order to couple these around.
So because Cassandra doesn't allow joins, I figure that doing set,list,map would help me resolve to the proper dataset.
at first my thoughts were to solve my maping by just reusing the same table:
CREATE TABLE `albums` (
`id` int(10) ,
`title` varchar(150) ,
`description` varchar(300) ,
`release_date` date ,
`status` enum('active','inactive','pending') ,
`licensor_id` int(11) ,
`data_source_provider_id` int(10) ,
`score` int(10)
`genre` <set>
`artist` <set>
PRIMARY KEY (`id`),
) ;
(apologies if the above are not the correct syntax for Cassandra, Ive only begun installing the system on a dev system)
My queries are of the following:
Give me all albums sorted by Score (Descending)
Give me all albums from a particular genre, sorted by score
Give me all albums from a particular artist, sorted by score
Give me all albums sorted by release date, then by score.
In SQL the 4 are easy when doing the join - however since Cassandra doesn't allow joins i figure that my modelling was adequent enough however #4 cannot be satisified (there are no double order by as far as i can tell).
Multiple indexes are slow, and considering that its on a large dataset (there are 1.8M records for now, but I'm planning on pumping triple the amount at least, hence why Cassandra would be useful)
My question are:
1) Is my path to go from MySQL to Cassandra correct despite being stuck on my 4 questions - or did it do it wrong? (I've done some Active Records before with MongoDB where you can have a sub entity within the document, but Cassandra only has set,list and map).
2) If I want to expand my modelling to: " I want to create a list X that contains a predefined number elements from the albums table". Would tagging each Albums element with a new field "tag" that has X be the smart way to filter things OR would it be best to create a new table, that contains all the elements that I need and just query that.
The general advice for Cassandra is to write your tables based on your queries. Don't be shy about writing the same data to multiple tables if some of those queries are not compatible with each other. (Twitter, for example would write each tweet to a table of all the followers of that user.)
That said, looking at your queries, your challenge will be that Cassandra does not inherently have a way of handling some of your sorting needs. You will need to add an analytics engine like Spark or Hadoop's M/R to sort on a non-unique (constantly changing?) field like score.
Let's look at some table definitions that will be a good start. Then you can determine if you need a full blown distributed analytics engine or whether locally sorting the results of the query will be enough.
CREATE TABLE albums(
id uuid,
title text,
description text,
releasedate timestamp,
status text,
license_id varint,
data_source_provider_id varint,
score counter,
genre set<text>,
artist set<text>,
PRIMARY KEY (id)
);
This table will store all your albums by id. Based on your use case, selecting all the albums and sorting them by score would definitely be out of the question. You could, potentially, do something clever like modulo-ing the score and putting the albums in buckets, but I'm not convinced that would scale. Any of your queries could be answered using this table plus analytics, but in the interest of completeness, let's look at some other options for putting your data in Cassandra. Each of the following tables could readily reduce the load from any analytics investigations you run that have additional parameters (like a range of dates or set of genres).
CREATE TABLE albums(
id uuid,
title text,
description text,
releasedate timestamp,
status text,
license_id varint,
data_source_provider_id varint,
score counter,
genre set<text>,
artist text,
PRIMARY KEY (artist, releasedate, title)
);
Cassandra can automatically sort immutable fields. The table above will store each artist's albums in a separate partition (each partition is colocated in your cluster and replicated based on your replication factor). If an album has multiple artists, this record would be duplicated under each artist's entry, and that's OKAY. The second and third keys (releasedate and title) are considered sorting keys. Cassandra will sort the albums first by releasedate and second by title (for the other priority, reverse their order above). Each combo of artist, releasedate and title is logically one row (although on disk, they will be stored as a widerow per artist only). For one artist, you can probably sort the entries by score locally, without direct intervention from the database.
Sorting by release date can easily be accomplished by a similar table, but changing the PRIMARY KEY to: PRIMARY KEY (releasedate, ..?). In this case, however, you probably will face a challenge in sorting (locally) if you have a substantial range of release dates.
Finally, don't try something similar for genre. Genre is too large a set to be contained in a single partition key. Hypothetically if you had a secondary way of splitting that set up, you could do PRIMARY KEY ((genre, artist)), (double parens intentional) but I don't think this fits well with your particular use case as both of such keys are required to look up an entry.

How to optimize MySQL table containing 1.6+ million records for LIKE '%abc%' querying

I have a table with this structure and it currently contains about 1.6 million records.
CREATE TABLE `chatindex` (
`timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`roomname` varchar(90) COLLATE utf8_bin NOT NULL,
`username` varchar(60) COLLATE utf8_bin NOT NULL,
`filecount` int(10) unsigned NOT NULL,
`connection` int(2) unsigned NOT NULL,
`primaryip` int(10) unsigned NOT NULL,
`primaryport` int(2) unsigned NOT NULL,
`rank` int(1) NOT NULL,
`hashcode` varchar(12) COLLATE utf8_bin NOT NULL,
PRIMARY KEY (`timestamp`,`roomname`,`username`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
Both the roomname and username columns can contain the same exact data, but the uniqueness and the important bit of each item comes from combining the timestamp with those two items.
The query that is starting to take a while (10-20 seconds) is this:
SELECT timestamp,roomname,username,primaryip,primaryport
FROM `chatindex`
WHERE username LIKE '%partialusername%'
What exactly can I do to optimize this? I can't do partialusername% because for some queries I will only have a small bit of the center of the actual username, and not the first few characters from the beginning of the actual value.
Edit:
Also, would sphinx be better for this particular purpose?
Use Fulltext indexes , these are actually designed for this purpose. Now InnoDb support fulltext indexes in MySQL 5.6.4.
Create Index on table column username (full-text indexing).
As an idea, you can create some views on this table that will contain filtered data on the basis of alphabets or other criteria and based on that your code will decide which view to use to fetch the search results.
You should use MyISAM table to do Fulltext search as it supports FULLTEXT indexes, MySQL v5.6+ is still under development phase you should not use it as a production servers and it may take ~1 year to go GA.
Now, You should convert this table as MyISAM and add FULLTEXT index which refers column in where clause:
These links can be useful:
http://dev.mysql.com/doc/refman/5.0/en/create-index.html
http://dev.mysql.com/doc/refman/5.1/en/fulltext-fine-tuning.html
On MSSQL this is a perfect case to use fulltext indexes together with CONTAIN clause. The LIKE clause fails to obtain a good performance on such big table and with so many variants of text to search for.
Take a look onto this link, there are many issues related to dinamic search conditions.
If you do an explain on the current query, you will see that you are doing a full table scan of the table which is why it is so slow. An index on username will materially speed up the search as the index can be cached by MySQL and the table row entries will only be accessed for matching users.
A fulltext index will not materially help searches like %fred% to match oldfredboy etc. so I am at loss as to why others are recommending using this. What a fulltext index does is to create a wordlist based index so that list you search for something like "explain the current query" the fulltext engine does a intersect of row IDs containing "explain" with those containing "current" and those containing "query" to get a list of ID which contain all three. Adding a fulltext index materially increases the insert , update on delete costs for the table, so it does add a performance penalty. Furthermore, you need to use the fulltext-specific "MATCH" syntax to make full use of a fulltext index.
If you do a question search on "[mysql] fulltext like" to see further discussion on this.
A normal index will do everything that you need. Searches like '%fred%' require a full scan of the index what ever you do so you need to keep the index as lean as possible. Also if a high % of hits match 'fred%', then it might also be worth first trying a like 'fred%' search first as this will do an index range scan.
One other point, why are you using the timestamp, roomname, username as the primary key? This doesn't make sense to me. If you don't use the primary key as an access path then an auto_increment id is easier. I would have thought roomname, timestamp, username would make some sense as you surely tend to access rooms within a time window.
Only add indexes that you will use.
Table index(full text indexes) is must for such high volumes of data.
Further if possible go for partitioning of table. so these will definitely improve the performance.