Cassandra + Mysql - mysql

hi i am new to Cassandra. I have little bit confusion in DB design in below scenario.
Currently i have 3 table : Post, User, PostLike.
Post : store post info
User : store user info
PostLIke :
CREATE TABLE PostLike (
like_time timestamp
post_id bigint,
user_id bigint,
PRIMARY KEY (like_time,post_id,user_id)
);
like_time : used to store post order by post like time.
cassandra provide this in OrderPreservingPartitioner
requirement is:
All Users Id which like a given post order by like_time and got them using :
select * from PostLike where post_id = ?
All posts liked by a user
select * from PostLike where user_id = ? : it gave error
[Invalid query] message="PRIMARY KEY column "post_id" cannot be
restricted (preceding column "ColumnDefinition{name=user_id,
type=org.apache.cassandra.db.marshal.LongType, kind=CLUSTERING_COLUMN,
componentIndex=0, indexName=null, indexType=null}" is either not
restricted or by a non-EQ relation)"
pls suggest what i need to do here :
need to Use MySQL with Cassandra for these relation
OR
Create 2 separate table in cassandra
CREATE TABLE PostLike (
like_time timestamp
post_id bigint,
PRIMARY KEY (like_date,post_id)
);
CREATE TABLE UserLike (
like_time timestamp
user_id bigint,
PRIMARY KEY (like_date,user_id)
);
or any other solution. Please help.

First of all, you are getting that error because you are specifying the second part of the primary key, without specifying the first part. When querying in Cassandra by a compound primary key, you cannot skip parts of the key. You can leave parts off of the end of the key (as in, just query by the partitioning key (see below), but it won't work if you try to skip parts of the key.
Next, secondary indexes do not work the same in Cassandra as they do in MySQL. In Cassandra, they are provided for convenience, and not for performance. The cardinality of post_id and user_id will likely be too high to be efficient. Especially in a large cluster with millions of rows, secondary index query performance will drop-off significantly on a high-cardinality secondary index.
The proper way to solve this, is to use your second option (as etherbunny recommended), but with a re-order of your primary keys.
CREATE TABLE PostLike (
like_time timestamp
post_id bigint,
PRIMARY KEY (post_id,like_date)
);
CREATE TABLE UserLike (
like_time timestamp
user_id bigint,
PRIMARY KEY (user_id,like_date)
);
The first key in a Cassandra primary key is known as the partitioning key. This key will determine which token range your row will be stored in.
The remaining keys in a Cassandra primary key are known as clustering columns. The clustering columns help to determine the on-disk sort order within a partitioning key.
That last part is important, as it (clustering order, as well as the ORDER BY keyword) behaves very differently from MySQL or any RDBMS. This way, if you SELECT * FROM user_like WHERE user_id=34574398 ORDER BY like_date you should see the likes for that user_id ordered by like_date. In fact, even without the ORDER BY clause, they should still be sorted by like_date. However, if you were to SELECT * FROM user_like ORDER BY like_date, your data would not sort in the expected order, because ordering only works when a partitioning key is specified.

Below Error resolve if i create index.
CREATE INDEX post_id_PostLike_indx ON post_like (post_id);
CREATE INDEX user_id_PostLike_indx ON post_like (user_id);
[Invalid query] message="PRIMARY KEY column "post_id" cannot be
restricted (preceding column "ColumnDefinition{name=user_id,
type=org.apache.cassandra.db.marshal.LongType, kind=CLUSTERING_COLUMN,
componentIndex=0, indexName=null, indexType=null}" is either not
restricted or by a non-EQ relation)"

Related

Efficient way to search database rows from a large database table mysql

A my-sql database table is having millions of data records.This table consists of a primary key [say user id],serial number [can have duplicates] and some other columns which allows null values.
Eg: say the schema is
CREATE TABLE IF NOT EXISTS SAMPLE_TABLE (
USER_ID INTEGER NOT NULL,
SERIAL_NO NOT NULL,
DESCRIPTION VARCHAR(100),
PRIMARY KEY (USER_ID)
)ENGINE INNODB;
Now I want to search a data row,based on the serial number.
I tried first adding a unique index including both columns [user id and serial no.] as
CREATE UNIQUE INDEX INDEX_USERS ON U=SAMPLE_TABLE (USER_ID,SERIAL_NO);
and then search for the data query based on serial number as below;
SELECT * FROM SAMPLE_TABLE WHERE SERIAL_NO=?
But it didn't success and I'm getting OOM error in mysql server side when I execute above select query. Appreciate any help on this.
You should not have added user_id intobthecindex you created. You just need an index on serial_no for that query.
If you provides necessary codes,it would be better than given explainations..However first you should find the id references to seraial number,then search the column corresponding to id

Comma separated list on MySQL database

I am implementing a friends list for users in my database, where the list will store the friends accountID.
I already have a similar structure in my database for achievements where I have a separate table that has a pair of accountID to achievementID, but my concern with this approach is that it is inefficient because if there are 1 million users with 100 achievements each there are 100million entries in this table. Then trying to get every achievement for a user with a certain accountID would be a linear scan of the table (I think).
I am considering having a comma separated string of accountIDs for my friends list table, I realize how annoying it will be to deal with the data as a string, but at least it would be guaranteed to be log(n) search time for a user with accountID as the primary key and the second column being the list string.
Am I wrong about the search time for these two different structures?
MySQL can make effective use of appropriate indexes, for queries designed to use those indexes, avoiding a "scan" operation on the table.
If you are ALWAYS dealing with the complete set of achievements for a user, retrieving the entire set, and storing the entire set, then a comma separated list in a single column can be a workable approach.
HOWEVER... that design breaks down when you want to deal with individual achievements. For example, if you want to retrieve a list of users that have a particular achievement. Now, you're doing expensive full scans of all achievements for all users, doing "string searches", dependent on properly formatted strings, and MySQL is unable to use an index scan to efficiently retrieve that set.
So, the rule of thumb, if you NEVER need to individually access an achievement, and NEVER need to remove an achievement from user in the database, and NEVER need to add an individual achievement for a user, and you will ONLY EVER pull the achievements as an entire set, and only store them as an entire set, in and out of the database, the comma separated list is workable.
I hesitate to recommend that approach, because it never turns out that way. Inevitably, you'll want a query to get a list of users that have a particular achievement.
With the comma separated list column, you're into some ugly SQL:
SELECT a.user_id
FROM user_achievement_list a
WHERE CONCAT(',',a.list,',') LIKE '%,123,%'
ugly in the sense that MySQL can't use an index range scan to satisfy the predicate; MySQL has to look at EVERY SINGLE list of achievements, and then do a string scan on each and every one of them, from the beginning to the end, to find out if a row matches or not.
And it's downright excruciating if you want to use the individual values in that list to do a join operation, to "lookup" a row in another table. That SQL just gets horrendously ugly.
And declarative enforcement of data integrity is impossible; you can't define any foreign key constraints that restrict the values that are added to the list, or remove all occurrences of a particular achievement_id from every list it occurs in.
Basically, you're "giving up" the advantages of a relational data store; so don't expect the database to be able to do any work with that type of column. As far as the database is concerned, it's just a blob of data, might as well be .jpg image stored in that column, MySQL isn't going to help with retrieving or maintaining the contents of that list.
On the other hand, if you go with a design that stores the individual rows, each achievement for each user as a separate row, and you have an appropriate index available, the database can be MUCH more efficient at returning the list, and the SQL is more straightforward:
SELECT a.user_id
FROM user_achievements a
WHERE a.achievement_id = 123
A covering index would be appropriate for that query:
... ON user_achievements (achievement_id, user_id)
An index with user_id as the leading column would be suitable for other queries:
... ON user_achievements (user_id, achievement_id)
FOLLOWUP
Use EXPLAIN SELECT ... to see the access plan that MySQL generates.
For your example, retrieving all achievements for a given user, MySQL can do a range scan on the index to quickly locate the set of rows for the one user. MySQL doesn't need to look at every page in the index, the index is structured as a tree (at least, in the case of B-Tree indexes) so it can basically eliminate a whole boatload of pages it "knows" that the rows you are looking for can't be. And with the achievement_id also in the index, MySQL can return the resultset right from the index, without a need to visit the pages in the underlying table. (For the InnoDB engine, the PRIMARY KEY is the cluster key for the table, so the table itself is effectively an index.)
With a two column InnoDB table (user_id, achievement_id), with those two columns as the composite PRIMARY KEY, you would only need to add one secondary index, on (achievement_id, user_id).
FOLLOWUP
Q: By secondary index, do you mean a 3rd column that contains the key for the composite (userID, achievementID) table. My create table query looks like this
CREATE TABLE `UserFriends`
(`AccountID` BIGINT(20) UNSIGNED NOT NULL
,`FriendAccountID` BIGINT(20) UNSIGNED NOT NULL
,`Key` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT
, PRIMARY KEY (`Key`)
, UNIQUE KEY `AccountID` (`AccountID`, `FriendAccountID`)
);
A: No, I don't mean the addition of a third column. If the only two columns in the table are the foreign keys to another table (looks like they refer to the same table, and the columns are both NOT NULL and there is a UNIQUE constraint on the combination of the columns... and there are no other attributes on the table, I would consider not using a surrogate as the primary key at all. I would make the UNIQUE KEY the PRIMARY KEY.
Personally, I would be using InnoDB, with the innodb_file_per_table option enabled. And my table definition would look something like this:
CREATE TABLE user_friend
( account_id BIGINT(20) UNSIGNED NOT NULL COMMENT 'PK, FK ref account.id'
, friend_account_id BIGINT(20) UNSIGNED NOT NULL COMMENT 'PK, FK ref account.id'
, PRIMARY KEY (account_id, friend_account_id)
, UNIQUE KEY user_friend_UX1 (friend_account_id, account_id)
, CONSTRAINT FK_user_friend_user FOREIGN KEY (account_id)
REFERENCES account (id) ON UPDATE CASCADE ON DELETE CASCADE
, CONSTRAINT FK_user_friend_friend FOREIGN KEY (friend_account_id)
REFERENCES account (id) ON UPDATE CASCADE ON DELETE CASCADE
) Engine=InnoDB;

Sql Query join suggestions

I was wondering when having a parent table and a child table with foreign key like:
users
id | username | password |
users_blog
id | id_user | blog_title
is it ok to use id as auto increment also on join table (users_blog) or will i have problems of query speed?
also i would like to know which fields to add as PRIMARY and which as INDEX in users_blog table?
hope question is clear, sorry for my bad english :P
I don't think you actually need the id column in the users_blog table. I would make the id_user the primary index on that table unless you have another reason for doing so (perhaps the users_blog table actually has more columns and you are just not showing it to us?).
As far as performance, having the id column in the users_blog table shouldn't affect performance by itself but your queries will never use this index since it's very unlikely that you'll ever select data based on that column. Having the id_user column as the primary index will actually be of benefit for you and will speed up your joins and selects.
What's the cardinality between the user and user_blog? If it's 1:1, why do you need an id field in the user_blog table?
is it ok to use id as auto increment also on join table (users_blog)
or will i have problems of query speed?
Whether a field is auto-increment or not has no impact on how quickly you can retrieve data that is already in the database.
also i would like to know which fields to add as PRIMARY and which as
INDEX in users_blog table?
The purpose of PRIMARY KEY (and other constraints) is to enforce the correctness of data. Indexes are "just" for performance.
So what fields will be in PRIMARY KEY depends on what you wish to express with your data model:
If a users_blog row is identified with the id alone (i.e. there is a "non-identifying" relationship between these two tables), put id alone in the PRIMARY KEY.
If it is identified by a combination of id_user and id (aka. "identifying" relationship) then you'll have these two fields together in your PK.
As of indexes, that depends on how you are going to access your data. For example, if you do many JOINs you may consider an index on id_user.
A good tutorial on index performance can be found at:
http://use-the-index-luke.com
I don't see any problem with having an auto increment id column on users_blog.
The primary key can be id_user, id. As for indexing, this heavily depends on your usage.
I doubt you will be having any database related performance issue with a blog engine though, so indexing or not doesn't make much of a difference.
You dont have to use id column in users_blog table you can join the id_user with users table. also auto increment is not a problem to performance
It is a good idea to have an identifier column that is auto increment - this guarantees a way of uniquely identifying the row (in case all other columns are the same for two rows)
id is a good name for all table keys and it's the standard
<table>_id is the standard name for foreign keys - in your case use user_id (not id_user as you have)
mysql automatically creates indexes for columns defined as primary or foreign keys - there is no need to do anything here
IMHO, table names should be singular - ie user not users
You SQL should look something like:
create table user (
id int not null auto_increment primary key,
...
);
create table user_blog (
id int not null auto_increment primary key,
id_user int not null references user,
...
);

MySQL index design with table partitioning

I have 2 MySQL tables with the following schemas for a web site that's kinda like a magazine.
Article (articleId int auto increment ,
title varchar(100),
titleHash guid -- a hash of the title
articleText varchar(4000)
userId int)
User (userId int autoincrement
userName varchar(30)
email etc...)
The most important query is;
select title,articleText,userName,email
from Article inner join user
on article.userId = user.UserId
where titleHash = <some hash>
I am thinking of using the articleId and titleHash columns together as a clustered primary y for the Article table. And userId and userName as a primary key for the user table.
As the searches will be based on titlehash and userName columns.
Also titlehash and userName are unqiue by design and will not change normally.
The articleId and userid columns are not business keys and are not visible to the application, so they'll only be used for joins.
I'm going to use mysql table partitioning on the titlehash column so the selects will be faster as the db will be able to use partition elimination based on that column.
I'm using innoDB as the storage engine;
So here are my questions;
Do I need to create another index on
the titlehash column as the primary
key (articleId,titlehash) is not
good for the searches on the
titlehash column as it is the second
column on the primary key ?
What are the problems with this
design ?
I need the selects to be very fast and expects the tables to have millions of rows and please note that the int Id columns are not visible to the business layer and can never be used to find a record
I'm from a sql server background and going to use mysql as using the partitioning on sql server will cost me a fortune as it is only available in the Enterprise edition.
So DB gurus, please help me; Many thanks.
As written, your "most important query" doesn't actually appear to involve the User table at all. If there isn't just something missing, the best way to speed this up will be to get the User table out of the picture and create an index on titleHash. Boom, done.
If there's another condition on that query, we'll need to know what it is to give any more specific advice.
Given your changes, all that should be necessary as far as keys should be:
On Article:
PRIMARY KEY (articleId) (no additional columns, don't try to be fancy)
KEY (userId)
UNIQUE KEY (titleHash)
On User:
PRIMARY KEY (userId)
Don't try to get fancy with composite primary keys. Primary keys which just consist of an autoincrementing integer are handled more efficiently by InnoDB, as the key can be used internally as a row ID. In effect, you get one integer primary key "for free".
Above all else, test with real data and look at the results from EXPLAINing your query.

primary key with multiple column lookup

I am new to databases. If I make a primary key with two columns, userId and dayOfWeek. How can I search this primary key?
(dayOfWeek is an int where 1 = Monday, 2 = Tuesday, etc.)
Would I do use:
SELECT *
WHERE userId = 1
AND dayOfWeek = 4
Would that scan the entire database or use information provided by the primary key? Is there another more appropriate query I could use?
A primary key index is internally much like any other index, except you're explicitly saying that THIS is the key which uniquely identifies a record. It's simply an index with a "must be unique" constraint imposed on it.
Whether the DB uses the composite primary key depends on how you specify the key fields in the query. Given your PK(userID, dayOfWeek), then
SELECT * FROM mytable WHERE (userID = 1);
SELECT * FROM mytable WHERE (userID = 1) AND (dayOfWeek = 4);
would both use the primary key index, because you've used the fields in the order they're specified within the key.
However,
SELECT * FROM mytable WHERE (dayOfWeek = 4)
will not, because userID was not specified, and userID comes before dayOfWeek in the key definition.
No query will scan the entire database, unless you specified all the tables within that database. At worst, you could expect a table scan (which is what I think you really meant) if you were searching by columns that were not the primary key or indexed.
Your example is a composite primary key because it uses more than one column as the key. MySQL now automatically indexes the primary key (since v5?), so searching by all primary key columns is less likely to result in a table scan but rather an index seek/scan. It depends on if any other criteria is used. Searching by part of the primary key (IE: user_id only) might make use of the index - assuming it's a covering index, if the user_id column is the first from the left then the index could be used. Otherwise not.