Dedicated SQL table containing only unique strings - mysql

I can't seem to find any examples of anyone doing this on the web, so am wondering if maybe there's a reason for that (or maybe I haven't used the right search terms). There might even already be a term for this that I'm unaware of?
To save on database storage space for regularly reoccurring strings, I'm thinking of creating a MySQL table called unique_string. It would only have two columns:
"id" : INT : PRIMARY_KEY index
"string" : varchar(255) : UNIQUE index
Any other tables anywhere in the database can then use INT columns instead of VARCHAR columns. For example a varchar field called browser would instead be an INT field called browser_unique_string_id.
I would not use this for anything where performance matters. In this case I'm using it to track details of every single page request (logging web stats) and an "audit trial" of user actions on intranets, but other things potentially too.
I'm also aware the SELECT queries would be complex, so I'm not worried about that. I'll most likely write some code to generate the queries to return the "real" string data.
Thoughts? I feel like I might be overlooking something obvious here.
Thanks!

I have used this structure for a similar application -- keeping track of URIs for web logs. In this case, the database was Oracle.
The performance issues are not minimal. As the database grows, there are tens of millions of URIs. So, just identifying the right string during an INSERT is challenging. We handled this by building most of the update logic in hadoop, so the database table was, in essence, just a copy of a hadoop table.
In a regular database, you would get around this by building an index, as you suggest in your question. And, an index solution would work well up to your available memory. In fact, this is a rather degenerate case for an index, because you really only need the index and not the underlying table. I'm do not know if mysql or SQL Server recognize this, although columnar databases (such as Vertica) should.
SQL Server has another option. If you declare the string as VARCHAR(max), then it is stored no a separate data page from the rest of the data. During a full table scan, there is no need to load the additional page in memory, if the column is not being referenced in the query.

This is a very common design pattern in databases where the cardinality of the data is relatively small compared to the transaction table that it's linked to. The queries wouldn't be very complex, just a simple join to the lookup table. You can include more than just a string on the lookup table, other information that is commonly repeated. You're simply normalizing your model to remove duplicate data.
Example:
Request Table:
Date
Time
IP Address
Browser_ID
Browser Table:
Browser_ID
Browser_Name
Browser_Version
Browser_Properties

If you planning on logging data in real time (as opposed to a batch job) then you want to ensure your time to write a record to the database is as quick as possible. If you are logging synchronously then obviously the record creating time will directly affect the time it takes for a http request to complete. If this is async then slow record creation times will lead to a bottleneck. However if this is batch job then performance will not matter so long as you can confidently create all the batched records before the next batch runs.
In order to reduce the time it takes to create a record you really want to flatten out your database structure, your current query in pseudo might look like
SELECT #id = id from PagesTable
WHERE PageName = #RequestedPageName
IF #id = 0
THEN
INSERT #RequestedPageName into PagesTable
#id = SELECT ##IDENTITY 'or whatever method you db supports for
'fetching the id for a newly created record
END IF
INSERT #id, #BrowserName INTO BrowersLogTable
Where as in a flat structure you would just need 1 INSERT
If you are concerned about data Integrity, which you should be, then typically you would normalise this data by querying at writing it into a separate set of tables (or a separate database) at regular intervals and use this for querying against.

Related

Is select faster than insert

I have an big load file that I downloaded. This contains records that I will have to load into the database. Based on the size of the data, it will likely take 2 weeks or more to finish (since there is preprocessing etc). A coworker asked me to make what she called a delta file, which checks the current database to see if the data already exists based on a certain field in the database and IFF it exists then we will keep that in the load file, otherwise we will discard it.
I'm confused because to implement this I would need to do a select query for every file in the load file to check if it exists. a select would take O(n) I'm assuming. Then the insert (for a smaller data set) an additional O(1).
whereas an insert would just take O(1).
I'd like to 1) understand why this implementation is faster (If I don't understand things properly) and 2) a possible solution to implementation of this delta file if you can think of something smarter than what I suggested
Thanks
Databases make indexes for columns specified in the schema. The way your data is indexed can make a massive difference in performance. Without an index, a select operation may be O(n) but with an index it may be O(1).
Insert operations must maintain the index. For large data loading operations you may be well off to disable indexing until the end so you are doing a single index update on all the data instead of many index updates on each record you insert.
Some measurements I did the other day indicate that selects are faster than inserts in my situation. I came across this question because I am trying to learn if this is generally true or reflects something specific about the way I have it setup.

MySQL or NoSQL? Recommended way of dealing with large amount of data

I have a database of which will be used by a large amount of users to store random long string (up to 100 characters). The table columns will be: userid, stringid and the actual long string.
So it will look pretty much like this:
Userid will be unique and stringid will be unique for each user.
The app is like a simple todo-list app, so each user will have an average amount of 50 todo's.
I am using the stringid in order that users will be able to delete the specific task at any given time.
I assume this todo app could end up with 7 million tasks in 3 years time and that scares me of using MySQL.
So my question is if this is the actual recommended way of dealing with large amount of data with long string (every new task gets a new row)? and is MySQL is the right database solution to choose for this kind of projects ?
I have not experienced with large amount of data yet and I am trying to save myself for the far future.
This is not a question of "large amounts" of data (mysql handles large amounts of data just fine and 2 mio rows isn't "large amounts" in any case).
MySql is a relational database. So if you have data that can be normalized, that is distributed among a number of tables that ensures every datapoint is saved only once then you should use MySql (or Maria, or any other relational database).
If you have schema-less data and speed is more important than consistency than you can/should use some NoSql database. Personally I don't see how a todo list would profit from NoSql (doesn't really matter in this case, but I guess as of now most programmig frameworks have better support for relational databases than for Nosql).
This is a pretty straightforward relational use case. I wouldn't see a need for NoSQL here.
The table you present should work fine however, I personally would question the need for the compound primary key as you would present this. I would probably have a primary key on stringid only to enforce uniqueness across all records. Rather than a compound primary key across userid and stringid. I would then put a regular index on userid.
The reason for this is in case you just want to query by stringid only (i.e. for deletes or updates), you are not tied into always having to query across both field to leverage your index (or adding having to add individual indexes on stringid and userid to enable querying by each field, which means my space in memory and disk taken up by indexes).
As far as whether MySQL is the right solution, this would really be for you to determine. I would say that MySQL should have no problem handling tables with 2 million rows and 2 indexes on two integer id fields. This is assuming you have allocated enough memory to hold these indexes in memory. There is certainly a ton of information available on working with MySQL, so if you are just trying to learn, it would likely be a good choice.
Regardless of what you consider a "large amount of data", modern DB engines are designed to handle a lot. The question of "Relational or NoSQL?" isn't about which option can support more data. Different relational and NoSQL solutions will handle the large amounts of data differently, some better than others.
MySQL can handle many millions of records, SQLite can not (at least not as effectively). Mongo (NoSQL) attempts to hold it's collections in memory (as well as the file system) so I have seen it fail with less than 1 million records on servers with limited memory, although it offers sharding which can help it scale more effectively.
The bottom line is: The number of records you store should not play into SQL vs NoSQL decisions, that decision should be left to how you will save and retrieve the data. It sounds like your data is already normalized (e.g. UserID) and if you also desire consistency when you i.e. delete a user (the TODO items also get deleted) then I would suggest using a SQL solution.
I assume that all queries will reference a specific userid. I also assume that the stringid is a dummy value used internally instead of the actual task-text (your random string).
Use an InnoDB table with a compound primary key on {userid, stringid} and you will have all the performance you need, due to the way a clustered index works.

Formula to calculate MySql single row size (engine MyISAM)

I have a situation where I have to create tables dynamically. Depending on some criteria I am going to vary the size of the columns of a particular table.
For that purpose I need to calculate the size of one row.
e.g.
If I am going to create a following table
CREATE TABLE sample(id int, name varchar(30));
so that formula would give me the size of a single row for the table above considering all overheads for storing a row in a mysql table.
Is possible to do so and Is it feasible to do so?
It depends on the storage engine you use and the row format chosen for that table, and also your indexes. But it is not a very useful information.
Edit:
I suggest going against normalization only when you know exactly what you're doing. A DBMS is created to deal with large amount of data. You probably don't need to serialize your strctured data into a single field.
Keep in mind that your application layer then has to tokenie (or worse) the serialized field data to get the original meaning back, which has certainly larger overhead than getting the data already in a structured form, from the DB.
The only exeption I can think of is a client-heavy architcture, when moving processing to the client side actually takes burden off the server, and you would serialize our data anyway for the sake of the transfer. - In server-side code (like php) it is not a good practive to save serialized stye data into the DB.
(Though, using php's built in serialization may be a good idea in some cases. Your current project does not seem to benefit from it.)
The VARCHAR is a variable-length data type, it has a length property, but the value can be empty; calculation may be not exact. Have a look at 'Avg_row_length' field in information_schema.tables.

Indexing only one MySQL column value

I have a MySQL InnoDB table with a status column. The status can be 'done' or 'processing'. As the table grows, at most .1% of the status values will be 'processing,' whereas the other 99.9% of the values will be 'done.' This seems like a great candidate for an index due to the high selectivity for 'processing' (though not for 'done'). Is it possible to create an index for the status column that only indexes the value 'processing'? I do not want the index to waste an enormous amount of space indexing 'done.'
I'm not aware of any standard way to do this but we have solved a similar problem before by using two tables, Processing and Done in your case, the former with an index, the latter without.
Assuming that rows don't ever switch back from done to processing, here's the steps you can use:
When you create a record, insert it into the Processing table with the column set to processing.
When it's finished, set the column to done.
Periodically sweep the Processing table, moving done rows to the Done table.
That last one can be tricky. You can do the insert/delete in a transaction to ensure it transfers properly or you could use a unique ID to detect if it's already transferred and then just delete it from Processing (I have no experience with MySQL transaction support which is why I'm also giving that option).
That way, you're only indexing a few of the 99.9% of done rows, the ones that have yet to be transferred to the Done table. It will also work with multiple states of processing as you have alluded to in comments (entries are only transferred when they hit the done state, all other states stay in the Processing table).
It's akin to having historical data (stuff that will never change again) transferred to a separate table for efficiency. It can complicate some queries where you need access to both done and non-done rows since you have to join two tables so be aware there's a trade-off.
Better solution: don't use strings to indicate statuses. Instead use constants in your code with descriptive names => integer values. Then that integer is stored in the database, and MySQL will work a LOT faster than with strings.
I don't know what language you use, but for example in PHP:
class Member
{
const STATUS_ACTIVE = 1;
const STATUS_BANNED = 2;
}
if ($member->getStatus() == Member::STATUS_ACTIVE)
{
}
instead of what you have now:
if ($member->getStatus() == 'active')
{
}

What's the best way to implement a counter field in MySQL

I want to start counting the numbers of times a webpage is viewed and hence need some kind of simple counter. What is the best scalable method of doing this?
Suppose I have a table Frobs where each row corresponds to a page - some obvious options are:
Have an unsigned int NumViews field
in the Frobs table which gets
updated upon each view using UPDATE
Frobs SET NumViews = NumViews + 1. Simple but not so good at scaling as I understand it.
Have a separate table FrobViews
where a new row is inserted for each view. To display the
number of views, you then need to do a simple SELECT COUNT(*) AS NumViews FROM FrobViews WHERE FrobId = '%d' GROUP BY FrobId. This doesn't involve any updates so can avoid table locking in MyISAM tables - however, the read performance will suffer if you want to display the number of views on each page.
How do you do it?
There's some good advice here:
http://www.mysqlperformanceblog.com/2007/07/01/implementing-efficient-counters-with-mysql/
but I'd like to hear the views of the SO community.
I'm using InnoDb at the moment, but am interested in answers for both InnoDb and MyISAM.
If scalability is more important to you than absolute accuracy of the figures then you could cache the view count in your application for a short time rather than hitting the database on every page view - eg, only update the database once every 100 views.
If your application crashes between database updates then obviously you'll lose some of your data, but if you can tolerate a certain amount of inaccuracy then this might be a useful approach.
Inserting into a Database is not something you want to do on page views. You are likely to run into problems with updating your slave databases with all of the inserts since replication is single threaded on MySQL.
At my company we serve 25M page views a day and we have taken a tiered approach.
The view counter is stored in a separate table with 2 columns (profileId, viewCounter) both are unsigned integers.
For items that are infrequently viewed we update the table on page view.
For frequently viewed items we update MySQL about 1/10 of the time. For both types we update Memcache on every hit.
int Memcache::increment ( string $key [, int $value = 1 ] )
if (pageViews < 10000) { UPDATE page_view SET viewCounter=viewCounter+1 WHERE profileId = :? }
else if ((int)rand(10) == 1) { //UPDATE page_view SET viewCounter= ?:cache_value WHERE profileId = :? }
doing count(*) is very inefficient in InnoDB (MyISAM keeps count stats in the index), but MyISAM will lock the table on reads reducing concurrency. doing a count() for 50,000 or 100,000 rows is going to take a long time. Doing a select on a PK will be very fast.
If you require more scalability, you might want to look at redis
I would take your second approach and aggregate the data into the table from your first solution on a regular base. On this way you get the advandages of both solutions. To be clearer:
On every hit you insert a row into a table (lets name it hit_counters). This table got only one field (the pageid). Every x seconds you run a script (via a cronjob) which aggregates the data from the hit_counters table and put it into a second table (lets name it 'hits'. There you got two fields: the pageid and the total hits.
Im not sure but imho does innodb not help you very much for solution 1 if you get many hits on the same page: Innodb locks the row while updating so all other updates to this row will be delayed.
Depending on whats your program written in you could also batch the updates together by counting in your application and updating the database only every x seconds. This would only work if you use a programing language where you got persistent storage (like Java Servlets but not PHP)
What I do, and it may not apply to your scenario, is in the stored procedure that prepares/returns the data that is displayed on the page, I make the table counter update at the same time it returns the data - that way, there is only one call to the server that both gets the data, and updates the counter in the same call.
If you are not using SP's,(or if there is no database data on your page) this option may not be available to you, but if you are, its something to consider.