Alternate to storing Large number of tables -- MySQL - mysql

Well, I have been working with large amount of network data. In which I have to filter out some IP address and store their communication with other IP's. But the number of IP's are huge, hundreds of thousands, for which I have to create so many tables. Ultimately I my MySQL access will slow down, everything will slow down. Each table will have few columns, many rows.
My Questions:
Is there a better way to deal with this, I mean storing data of each IP?
Is there something like table of tables?
[Edit]
The reason I am storing in different tables is, I have to keep removing and add entries as time passes by.
Here is the table structure
CREATE TABLE IP(syn_time datetime, source_ip varchar(18), dest_ip varchar(18));
I use C++ to access with ODBC connector

Don't DROP/CREATE tables frequently. MySQL is very buggy with doing that, and understandably so--it should only be done once when the database is created on a new machine. It will hurt things like your buffer pool hit ratio, and disk IO will spike out.
Instead, use InnoDB or xtradb, which means you can delete old rows whilst inserting new ones.
Store the IP in a column of type int(10) unsigned e.g. 192.168.10.50 would be stored as (192 * 2^24) + (168 * 2^16) + (10 * 2^8) + 50 = 3232238130
Put all the information into 1 table, and just use an SELECT ... WHERE on an indexed column

Creating tables dynamically is almost always a bad idea. The alternative is normalisation. I won't go into the academic details of that, but I'll try to explain it in more simple terms.
You can separate relationships between data into three types: one-to-one, one-to-many and many-to-many. Think about how each bit of data relates to other bits and which type of relationship it has.
If a data relationship is one-to-one,
then you can usually just stick it in
the same row of the same table.
Occasionally there may be a reason to
separate it as if it were
one-to-many, but generally speaking,
stick it all in the same place.
If a data relationship is
one-to-many, it should be referenced
between two tables by it's primary
key (you've given each table a
primary key, right?). The "one" side
of one-to-many should have a field
which references the primary key of
the other table. This field is called
a foreign key.
Many-to-many is the most complex
relationship, and it sounds like you
have a few of these. You have to
create a join table. This table will
contain two foreign key fields, one
for one table and another for the
other. For each link between two
records, you'll add one record to
your join table.
Hopefully this should get you started.

Related

Index every column to add foreign keys

I am currently learning about foreign keys and trying to add them as much as I can in my application to ensure data-integrity. I am using INNODB on Mysql.
My clicks table has a structure something like...
id, timestamp, link_id, user_id, ip_id, user_agent_id, ... etc for about 12 _id columns.
Obviously these all point to other tables, so should I add a foreign key on them? MySQL is creating an index automatically for every foreign key, so essentially I'll have an index on every column? Is this what I want?
FYI - this table will essentially be my most bulky table. My research basically tells me I'm sacrificing performance for integrity but doesn't suggest how harsh the performance drop will be.
Right before inserting such a row, you did 12 inserts or lookups to get the ids, correct? Then, as you do the INSERT, it will do 12 checks to verify that all of those ids have a match. Why bother; you just verified them with the code.
Sure, have FKs in development. But in production, you should have weeded out all the coding mistakes, so FKs are a waste.
A related tip -- Don't do all the work at once. Put the raw (not-yet-normalized) data into a staging table. Periodically do bulk operations to add new normalization keys and get the _id's back. Then move them into the 'real' table. This has the added advantage of decreasing the interference with reads on the table. If you are expecting more than 100 inserts/second, let's discuss further.
The generic answer is that if you considered a data item so important that you created a lookup table for the possible values, then you should create a foreign key relationship to ensure you are not getting any orphan records.
However, you should reconsider, whether all data items (fields) in your clicks table need a lookup table. For example ip_id field probably represents an IP address. You can simply store the IP address directly in the clicks table, you do not really need a lookup table, since IP addresses have a wide range and the IP addresses are unique.
Based on the re-evaluation of the fields, you may be able to reduce the number of related tables, thus the number of foreign keys and indexes.
Here are three things to consider:
What is the ratio of reads to writes on this table? If you are reading much more often than writing, then more indexes could be good, but if it is the other way around then the cost of maintaining those indexes becomes harder to bear.
Are some of the foreign keys not very selective? If you have an index on the gender_id column then it is probably a waste of space. My general rule is that indexes without included columns should have about 1000 distinct values (unless values are unique) and then tweak from there.
Are some foreign keys rarely or never going to be used as a filter for a query? If you have a last_modified_user_id field but you never have any queries that will return a list of items which were last modified by a particular user then an index on that field is less useful.
A little bit of knowledge about indexes can go a long way. I recommend http://use-the-index-luke.com

How to make a users table the easiest to search through in MySQL?

I have a question I would like to ask about a database I'm designing. I have one table that stores all user information, contains 12 columns including the email and password fields. I want to make this table efficient for searching. I'm looking at multiple options in doing this.
Have the user table have a primary key on the email value
I only have one table with the email values, so if emails are changed I don't have to worry about updating a bunch of tables.
Have the user table have both a user id primary key that's auto incremented and a key on the email
I need to put a key on the email because when the user logs in, they user their email.
Have a separate "registration" table that contains an index on both the email and an auto incremented user id.
I can then join this table to a user values table that uses the userid as a foreign key.
Which of these options will be most efficient if there a large number of users? (>100,000) I want to design it right from the start so that I don't have to redesign once I see performance issues. My intuition says having a 3rd table "registration" with just those two values would be most effective, so that I don't have to use a string comparison when looking for the bulk of the user data. But I'm not 100% sure.
I've looked through other questions and didn't get a definate answer for my type of situation. I considered the options I found and integrated my thoughts of each type above.
A table with only 12 (assuming related columns) does not stand out as an automatic candidate for normalizing. If you over-normalize, you will just have to undo your work down the road due to performance issues and over-complicated queries. The whole point of a relational database is to have similar information in the same table so that you don't have to JOIN on every query!
Adding an index to any columns being searched (certainly on your 'email' column!) is the first route to take to optimize WHERE clause performance on your tables. And 100,000+ rows is not considered large by modern database standards.
If after adding indexes on all frequently scanned columns, you still desire more performance, you should consider other optimizations more related to the database engine. For example, using the MyISAM table format with ROW_FORMAT = FIXED can offer drastic improvements over InnoDB. The prerequisites would be to not use any variable-length fields (CHAR() instead of VARCHAR()), and that this table would be read much more than it is inserted to or deleted from (because the latter options do use table-level locking as opposed to row-level locking in MyISAM-based tables). But the user information table is not likely deleted from or inserted to very often (compared to the number of reads/selects), and also a fixed maximum of 100 or 200 chars could easily be used for name/email/address/etc. fields, so it is likely a great candidate. Here is an article showing a 44% speedup from making this change.
Only after all these steps, should you then consider over-normalization or table partitioning or other such schemes.

Is it ever practical to store primary keys of one table as text in another table?

Imagine if we had millions of rows in Table A.
For each large row (10+ columns) of Table A, we might have 20+ rows that are exact duplicates except for a singular column where we store an ID for Table B.
Would it be more EFFICIENT and/or MEMORY SAVING to store in Table A, the ID's for Table B in a text field ---> "B_ID1|B_ID2|B_ID3" etc and then return this data client-side, parse it, and then send it out for the actual data from Table B.
This is assuming we had 2+ million rows of unique data in Table A and if we stored that additional column outside the text field, we would add 2 Million*20+ Rows to that individual table with all that extra wasted space.
Or am I very naive in my approach and understanding of SQL? I literally just started using it like a week ago and taught myself the basics for my app.
This is where a weak entity (table) is best used.
Instead of duplicating all the data in table A, you simply create a new table that links A to B. In it, you can have only the ID to table A that links to the several ID's in table B (and set the primary key to be both of the foreign keys).
If you find yourself duplicating a lot of data across multiple rows, it may indicate that your database isn't normalized (http://en.wikipedia.org/wiki/Database_normalization).
This means that you might be able to break it into multiple smaller tables that reference each other to avoid data duplication.
SQL provides the ability to index your table in a variety of ways. I'm not an expert on big data, but my first hunch would be no. Having an auto-incrementing, indexed primary key lets the SQL server do the work of maintaining the list of records in a way it can easily look up info you need.
The real question comes down to how you are needing to parse/interact with this 2 million some odd rows. Is it a bunch of split document info? User profiles? Is it real-time inputs from some hardware device? Context is key to determining if SQL is even the best way to approach the problem.
Can you give us a little context on what sort of project you're theorizing? Or is this a more hypothetical question?
UPDATE: Check out W3 Schools for a brief intro to SQL concepts (among other coding references)

DB Design - any way to avoid duplicating columns here?

I've got a database that stores hash values and a few pieces of data about the hash, all in one table. One of the fields is 'job_id', which is the ID for the job that the hash came from.
The problem I'm trying to solve is that with this design, a hash can only belong to one job - in reality a hash can occur in many jobs, and I'd like to know each job in which a hash occurs.
The way I'm thinking of doing this is to create a new table called 'Jobs', with fields 'job_id', 'job_name' and 'hash_value'. When a new batch of data is inserted into the DB, the job ID and name would be created here and each hash would go into here as well as the original hash table, but in the Jobs table it'd also be stored against the job.
I don't like this, because I'd be duplicating the hash column across tables. Is there a better way? I can add to the hash table but can't take away any columns because closed-source software depends on it. The hash value is the primary key. It's MySQL and the database stores many millions of records. Thanks in advance!
Adding the new job table is the way to go. It's the normative practice, for representing a one-to-many relationship.
It's good to avoid unnecessary duplication of values. But in this case, you aren't really "duplicating" the hash_value column; rather, you are really defining a relationship between job and the table that has hash_value as the primary key.
The relationship is implemented by adding a column to the child table; that column holds the primary key value from the parent table. Typically, we add a FOREIGN KEY constraint on the column as well.
The problem I'm trying to solve is that with this design, a hash can
only belong to one job - in reality a hash can occur in many jobs, and
I'd like to know each job in which a hash occurs.
The way I'm thinking of doing this is to create a new table called
'Jobs', with fields 'job_id', 'job_name' and 'hash_value'.
As long as you can also get a) the foreign keys right and b) the cascades right for both "job_id" and "hash_value", that should be fine.
Duplicate data and redundant data are technical terms in relational modeling. Technical term means they have meanings that you're not likely to find in a dictionary. They don't mean "the same values appear in multiple tables." That should be obvious, because if you replace the values with surrogate ID numbers, those ID numbers will then appear in multiple tables.
Those technical terms actually mean "identical values with identical meaning." (Relevant: Hugh Darwen's article for definition and use of predicates.)
There might be good, practical reasons for replacing text with an ID number, but there are no theoretical reasons to do that, and normalization certainly doesn't require it. (There's no "every row has an ID number" normal form.)
If i read your question correctly, your design is fundamentally flawed, because of these two facts:
the hash is the primary key (quoted from your question)
the same hash can be generated from multiple different inputs (fact)
you have millions of hashes (from question)
With the many millions of rows/hashes, eventually you'll get a hash collision.
The only sane approach is to have job_id as the primary key and hash in a column with a non-unique index on it. Finding job(s) given a hash would be straightforward.

Add Foreign Key relationships as bulk operation

I've inherited a database with hundreds of tables. Tables may have implicit FK relations that are not explicitly defined as such. I would like to be able to write a script or query that would be able to do this for all tables. For instance, if a table has a field called user_id, then we know there's a FK relationship with the users table on the id column. Is this even doable?
Thanks in advanced,
Yes, possible but I would want to explore more. Many folks design relational databases without foreign keys especially in the MySQL world. Also people reuse column names in different tables in the same schema (often with less than optimal results). Double check that what you think is a foreign key can be used that way (same data type, width, collation/character set, etc.).
Then i would recommend you copy the tables to a test machine and start doing your ALTER TABLES to add foreign keys. Test like heck.