MySQL partitioning or NoSQL like AWS DynamoDB? - mysql

Business logic:
My application crawls a lot(hundreds or sometimes thousands) of webpages every few hours and stores all the links(i.e. all anchor tags) on that webpage in a MySQL database table say links. This table is growing very big day by day (already around 20 million records as of now).
Technical:
I have a unique index combined on [webpage_id, link] in the links table. Also, I have a column crawl_count in the same table.
Now whenever I crawl a webpage, I already know webpage_id (the foreign key to webpages table) and I get links in that webpage (i.e. array of link) which I just do an insert or update query without worrying about what is already in the table.
INSERT INTO ........ ON DUPLICATE KEY UPDATE crawl_count=crawl_count+1
Problem:
The table grows big every day & I want to optimize the table for performance. Options I considered are,
Partitioning: Partition table by domains. All webpages belong to a particular domain. For example: Webpage https://www.amazon.in/gp/goldbox?ref_=nav_topnav_deals belong to the domain https://www.amazon.in/
NoSQL like DynamoDB. I have other tables of application in MySQL DB which I do not want to migrate to DynamoDB unless it's absolutely required. Also I have considered change in application logic (eg: change the structure of webpages table to something like
{webpage: "http://example.com/new-brands", links: [link1, link2, link3]}
and migrate this table to DynamoDB so I don't have a links table. But again, there is a limit for every record in DynamoDB(400kb). What if it exceeds this limit?
I have read pros & cons of using either of the approach. As far my understanding goes, DynamoDB doesn't seem to be a good fit for my situation. But still wanted to post this question so I can make a good decision for this scenario.

PARTITION BY domain -- No. There won't be any performance gain. Anyway, you will find that one domain dominates the table, and a zillion domains show up only once. (I'm speaking from experience.)
The only concept of an "array" is a separate table. It would have, in your case, webpage_id and link as a 2-column PRIMARY KEY (which is 'unique').
Normalize. This is to avoid having lots of copies of each domain and each link. This saves some space.
I assume you have two categories of links -- the ones for pages you have scanned, and the ones for pages waiting to scan. And probably the two sets are similar in size. I don't understand the purpose of `crawl count, but it adds to the cost.
I may be able to advise further if I could see the queries -- both inserting and selecting. Also, how big are the tables (GB) and what is the value of innodb_buffer_pool_size? Putting these together, we can discuss likely points if sluggishness.
Also the slowlog would help.
Are you dealing with non-ascii urls? Urls too long to index? Do you split urls into domain + path? Do you strip off "#..."? And "?..."?

Related

Work around to use two unique keys on partitioned table

We have a dataset of roughly 400M rows, 200G in size. 200k rows are added in a daily batch. It mainly serves as an archive that is indexed for full text search by another application.
In order to reduce the database footprint, the data is stored in plain myIsam.
We are considering a range-partitioned table to streamline the backup process, but cannot figure a good way to handle unique keys. We absolutely need two of them. One to be directly compatible with the rest of the schema (ex: custId), another to be compatible with the full text search app (ex: seqId).
My understanding is that partitions do not support more than one globally unique key. We would have to merge both unique keys (custId, seqId), which will not work in our case.
Am I missing something?

Can I have one million tables in my database?

Would there be any advantages/disadvantages to having one million tables in my database.
I am trying to implement comments. So far, I can think of two ways to do this:
1. Have all comments from all posts in 1 table.
2. Have a separate table for each post and store all comments from that post in it's respective table.
Which one would be better?
Thanks
You're better off having one table for comments, with a field that identifies which post id each comment belongs to. It will be a lot easier to write queries to get comments for a given post id if you do this, as you won't first need to dynamically determine the name of the table you're looking in.
I can only speak for MySQL here (not sure how this works in Postgresql) but make sure you add an index on the post id field so the queries run quickly.
You can have a million tables but this might not be ideal for a number of reasons[*]. Classical RDBMS are typically deployed & optimised for storing millions/billions of rows in hundreds/thousands of tables.
As for the problem you're trying to solve, as others state, use foreign keys to relate a pair of tables: posts & comments a la [MySQL syntax]:
create table post(id integer primary key, post text);
create table comment(id integer primary key, postid integer , comment text, key fk (postid));
{you can add constraints to enforce referential integrity between comment and posts to avoid orphaned comments but this requires certain capabilities of the storage engine to be effective}
The generation of primary key IDs is left to the reader, but something as simple as auto increment might give you a quick start [http://dev.mysql.com/doc/refman/5.0/en/example-auto-increment.html].
Which is better?
Unless this is a homework assignment, storing this kind of material in a classic RDBMS might not fit with contemporary idioms. Keep the same spiritual schema and use something like SOLR/Elasticsearch to store your material and benefit from the content indexing since I trust that you'll want to avoid writing your own search engine? You can use something like sphinx [http://sphinxsearch.com] to index MySQL in an equal manner.
[*] Without some unconventional structuring of your schema, the amount of metadata and pressure on the underlying filesystem will be problematic (for example some dated/legacy storage engines, like MyISAM on MySQL will create three files per table).
When working with relational databases, you have to understand (a little bit about) normalization. The third normal form (3NF) is easy to understand and works in almost any case. A short tutorial can be found here. Use Google if need more/other/better examples.
One table per record is a red light, you know you're missing something. It also means you need dynamic DDL, you must create new tables when you have new records. This is also a security issue, the database user needs to many permissions and becomes a security risk.

MySQL Queries Pegging Server Resources -- Indexes aren't being used

I took on a volunteer project a few years ago. The site is set up with Joomla, but most of the articles are rendered with php scripts that pull info from non-Joomla tables. The database is now almost 50MB and several of the non-Joomla tables have 60,000+ rows -- I had no idea it would get this big. Even just pulling up the list of the articles that contain these scripts takes a long time -- and right now there are only about 30 of them. I initially thought the problem was because I'm on dial-up, so everything is slow, but then we started getting "resources exceeded" notices, so I figured I better find out what's going on. It's not a high traffic site -- we get less than 2,000 unique visitors in any given month.
In one particular instance, I have one table where the library holdings (books, etc.) are listed by title, author, pub date, etc. The second table contains the names mentioned in those books. I have a Joomla! article for each publication that lists the names found in that book. I also have an article that lists all of the names from all of the books. That is the query below -- but even the ones for the specific books that pull up only 1,000 or so entries are very slow.
I originally set up indexes for these tables (MyISAM), but when I went back to check, they weren't there. So I thought re-configuring the indexes would solve the problem. Not even -- and according to EXPLAIN, they aren't even being used.
One of my problematic queries is as follows:
SELECT *
FROM pub_surnames
WHERE pub_surname_last REGEXP '^[A-B]'
ORDER BY pub_surname_last, pub_surname_first, pub_surname_middle
EXPLAIN gave:
id 1
select_type SIMPLE
table pub_surnames
type ALL
possible_keys NULL
key NULL
key_len NULL
ref NULL
rows 56422
Extra Using where; Using filesort
Also, phpmyadmin says "Current selection does not contain a unique column."
All of the fields are required for this query, but I read here that it would help if I listed them individually, so I did. The table contains a primary key, and I added a second unique index containing the primary key for the table, as well as the primary key for the table that holds the information about the publication itself. I also added an index for the ORDER BY fields. But I still get the same results when I use EXPLAIN and the performance isn't improved at all.
I set these tables up within the Joomla! database that the site uses for connection purposes and it makes it easier to back everything up. I'm wondering now if it would help if I used a separate database for our non-Joomla tables? Or would that just make it worse?
I'm not really sure where to go from here.
I think you are probably approaching this the wrong way. Probably it was the quick way to get it done when you first set it up, but now that the data has grown you are paying the price.
It sounds like you are recreating a massive list "inside" an article each time a page is rendered. Even though the source data is constantly being updated you would probably be better off storing the results. (Assuming I understand your data structure correctly.) Not knowing exactly what your php scripts are doing makes it a little complicated .. it could be that it would make more sense to actually make a very simple component to read the data from the other tables but I'll assume that doesn't make sense.
Here's what I think you might want to do.
Create a cron job (really easy to make a script using Joomla, go take a look at the jacs respository) and use it to run whatever your php is doing. You can schedule it once a day or once an hour or every 10 minutes, whatever makes sense.
Save the results. These could go into a data base table or you could cache them in the file system. Or both. Or possibly have the script update the articles since they seem to be fixed (you aren't adding new ones etc)
Then when the user comes you just want to either read the article if you stored there or you want to have a component that renders the results or make a plugin that will manage the queries for you. You should not be doing queries directly from inside an article layout, it's just wrong, even if no one knows it's there. If you have to run queries, use a content plugin similar to maybe the profile plugin, which does the queries in the right place architecturally.
Not knowing the exact purpose of what you are doing, it's hard to advise more, but I think if you are managing searches for people you'd likely be better off creating a way to use finder to index and search the results.
Check out below suggestions
Try changing your database engine to InnoDB which will work better for large datasets.
Also use RegEx alternative, which is used in "WHERE" part of query which hugely affects queries execution time.
Instead selecting all the columns with "*" just select needed columns.

DataBase Design(Big Table)

What is big table(i.e Google DataBase design), I have such type of requirement, but I don't know how to design it.
In Big Table, how to maintain relations among them?
create all table with innodb storage engine which maintain relationships
Choose table fields according to and limited to requirement
The paper of Big table, published by google, may be hard to read. Hope my answer can help you to start understanding.
In old days, RDBMS stores data according rows, one record one row, 1,2,3,4,5.....
Then if you want to find record 5, it's ok, database will seek in a B+ tree(or something similar) to get the address of record 5, load it for you.
But the nightmare is when you want to get records that have column "user=Michael", the database has no way but seek every record to check out if the user is "Michael".
Big Table has a different way to store data. It stores all the columns by an inverted table. When we want to find out all the records that satisfy "user=Michael", it seeks this as a key via a B+ tree or hash table, and gets the address of inverted table where stores the list of all records satisfying.
Maybe a good starting point is Lucene, an open source full text search engine, a fully implementation of big table principles.
Be noticed, inverted table is not a column-based storage in RDBMS. They are different, please must remember this.

Big tables and analysis in MySql

For my startup, I track everything myself rather than rely on google analytics. This is nice because I can actually have ips and user ids and everything.
This worked well until my tracking table rose about 2 million rows. The table is called acts, and records:
ip
url
note
account_id
...where available.
Now, trying to do something like this:
SELECT COUNT(distinct ip)
FROM acts
JOIN users ON(users.ip = acts.ip)
WHERE acts.url LIKE '%some_marketing_page%';
Basically never finishes. I switched to this:
SELECT COUNT(distinct ip)
FROM acts
JOIN users ON(users.ip = acts.ip)
WHERE acts.note = 'some_marketing_page';
But it is still very slow, despite having an index on note.
I am obviously not pro at mysql. My question is:
How do companies with lots of data track things like funnel conversion rates? Is it possible to do in mysql and I am just missing some knowledge? If not, what books / blogs can I read about how sites do this?
While getting towards 'respectable', 2 Millions rows is still a relatively small size for a table. (And therefore a faster performance is typically possible)
As you found out, the front-ended wildcard are particularly inefficient and we'll have to find a solution for this if that use case is common for your application.
It could just be that you do not have the right set of indexes. Before I proceed, however, I wish to stress that while indexes will typically improve the DBMS performance with SELECT statements of all kinds, it systematically has a negative effect on the performance of "CUD" operations (i.e. with the SQL CREATE/INSERT, UPDATE, DELETE verbs, i.e. the queries which write to the database rather than just read to it). In some cases the negative impact of indexes on "write" queries can be very significant.
My reason for particularly stressing the ambivalent nature of indexes is that it appears that your application does a fair amount of data collection as a normal part of its operation, and you will need to watch for possible degradation as the INSERTs queries get to be slowed down. A possible alternative is to perform the data collection into a relatively small table/database, with no or very few indexes, and to regularly import the data from this input database to a database where the actual data mining takes place. (After they are imported, the rows may be deleted from the "input database", keeping it small and fast for its INSERT function.)
Another concern/question is about the width of a row in the cast table (the number of columns and the sum of the widths of these columns). Bad performance could be tied to the fact that rows are too wide, resulting in too few rows in the leaf nodes of the table, and hence a deeper-than-needed tree structure.
Back to the indexes...
in view of the few queries in the question, it appears that you could benefit from an ip + note index (an index made at least with these two keys in this order). A full analysis of the index situation, and frankly a possible review of the database schema cannot be done here (not enough info for one...) but the general process for doing so is to make the list of the most common use case and to see which database indexes could help with these cases. One can gather insight into how particular queries are handled, initially or after index(es) are added, with mySQL command EXPLAIN.
Normalization OR demormalization (or indeed a combination of both!), is often a viable idea for improving performance during mining operations as well.
Why the JOIN? If we can assume that no IP makes it into acts without an associated record in users then you don't need the join:
SELECT COUNT(distinct ip) FROM acts
WHERE acts.url LIKE '%some_marketing_page%';
If you really do need the JOIN it might pay to first select the distinct IPs from acts, then JOIN those results to users (you'll have to look at the execution plan and experiment to see if this is faster).
Secondly, that LIKE with a leading wild card is going to cause a full table scan of acts and also necessitate some expensive text searching. You have three choices to improve this:
Decompose the url into component parts before you store it so that the search matches a column value exactly.
Require the search term to appear at the beginning of the of the url field, not in the middle.
Investigate a full text search engine that will index the url field in such a way that even an internal LIKE search can be performed against indexes.
Finally, in the case of searching on acts.notes, if an index on notes doesn't provide sufficient search improvement, I'd consider calculating and storing an integer hash on notes and searching for that.
Try running 'EXPLAIN PLAN' on your query and look to see if there are any table scans.
Should this be a LEFT JOIN?
Maybe this site can help.