Does MySQL compress query responses? - mysql

I am currently trying to optimise some DB queries that get run a lot, the queries are run by using a SELECT query against a view, this view does a lot of joins. I thought I might be able to speed things up by caching the results of the view into a table and selecting from the table instead of the view.
Let's say I have 2 tables
People:
PersonId
Name
1
Anne
2
Brian
3
Charlie
4
Doug
CustomerPeople:
CustomerId
PersonId
1
1
1
2
1
3
1
4
2
1
2
2
and I have a view that joins the two tables to give a list of people, by name, belonging to the customer:
CustomerId
PersonName
1
Anne
1
Brian
1
Charlie
1
Doug
2
Anne
2
Brian
When I query the view, I look at the Duration/Fetch and it is 0.10 sec/4.00 sec
I decide to cache the view data into a table and create a new table:
CustomerNamedPeople
CustomerId
PersonName
1
Anne
1
Brian
1
Charlie
1
Doug
2
Anne
2
Brian
Which contains the exact same data, however now when I query the table, I look at the Duration/Fetch and it is 0.05 sec/6.00 sec
My understanding is the Duration is the time it takes MySQL engine to run the query, and Fetch is the time it takes the data to be returned to the client (over the network). Unsurprisingly the Duration was faster, and took only 50% of the time, which makes sense, there is no longer a join occurring, however the Fetch took 150% of the time, and is slower.
My question here is: Does MySQL do some sort of response stream compression, since it knows that Anne and Brian are repeated, it can send them only once and have the client "decompress" the data?
The reason I ask is because I am doing something similar but with 1,000,000 rows returned, the data in the two responses is identical, but the view Fetch takes 20 seconds, and the table Fetch is 60 seconds, most of the PersonNames are repeated more than once, so I am wondering if perhaps there is some sort of compression occurring in the response, should I not expect MySQL to take the same time to Fetch two sets of identical data?

Related

DB design little or too much data

I'm currently working on a little project that uses MySQL. However I'm struggling with the database design. Currently I've come up with 2 designs, one stores more data but is actually the way I want it to be, however this way makes it really hard to work with the data. The other way is I think more basic and simplifies a lot of things but stores less data.
Design 1
Example data items table
id
description
time_created
1
Car
2021-04-17 17:30:00
2
Bike
2021-04-17 17:30:00
Example data user_items table
id
user_id
item_id
time_achieved
1
1
1
2021-04-17 17:30:04
2
1
1
2021-04-17 17:30:03
3
1
1
2021-04-17 17:30:17
4
1
1
2021-04-17 17:30:22
5
1
1
2021-04-17 17:30:34
6
1
2
2021-04-17 17:30:42
7
1
2
2021-04-17 17:30:54
Design 2
Example data items table
id
description
time_created
1
Car
2021-04-17 17:30:00
2
Bike
2021-04-17 17:30:00
Example data user_items table
id
user_id
item_id
count
1
1
1
5
2
1
2
2
Basically we have items that can be anything, they include a description to specify what they actually are. A user can collect items (a lot). These are stored in the user_items table which contains a FK user_id and item_id to the users and items table. The users table is left out for simplicity.
As you can see design 1 stores a lot more rows for the user_items table, this allows us to add more information (time_achieved and more) per item that a user achieved. However this results in more rows and probably a harder time queriyng. Design 2 on the other hand simply adds a count column to determine how many items the user has, but this is very limiting because we cannot add more data (achieved time..) per user_item.
I'm not sure if design 1 is the right and only design for what we want to achieve. Basically we really want to store additional metadata per user_item but I just don't know if this is the right design since it quickly fills up the database. Does anyone have a suggestion/idea for an alternative design which stores less data than design 1 but still allows to add more info per user_item?
Thanks in advance.
Does anyone have a suggestion/idea for an alternative design which stores less data than design 1 but still allows to add more info per user_item?
Design 1 should work.
This design will also work but quickly fills up, more efficient.
id, item_id,Item_des,Item_qty,user_id,username,time_created all in one table.
some of the values will be repeated.

MYSQL DB Best method to store keywords and URL index

Which of these methods would be the most efficient way of storing, retrieving, processing and searching a large (millions of records) index of stored URLs along with there keywords.
Example 1: (Using one table)
TABLE_URLs-----------------------------------------------
ID DOMAIN KEYWORDS
1 mysite.com videos,photos,images
2 yoursite.com videos,games
3 hissite.com games,images
4 hersite.com photos,pictures
---------------------------------------------------------
Example 2: (one-to-one Relationship from one table to another)
TABLE_URLs-----------------------------------------------
ID DOMAIN KEYWORDS
1 mysite.com
2 yoursite.com
3 hissite.com
4 hersite.com
---------------------------------------------------------
TABLE_URL_KEYWORDS---------------------------------------------
ID DOMAIN_ID KEYWORDS
1 1 videos,photos,images
2 2 videos,games
3 3 games,images
4 4 photos,pictures
---------------------------------------------------------
Example 3: (one-to-one Relationship from one table to another (Using a reference table))
TABLE_URLs-----------------------------------------------
ID DOMAIN
1 mysite.com
2 yoursite.com
3 hissite.com
4 hersite.com
---------------------------------------------------------
TABLE_URL_TO_KEYWORDS------------------------------------
ID DOMAIN_ID KEYWORDS_ID
1 1 1
2 2 2
3 3 3
4 4 4
---------------------------------------------------------
TABLE_KEYWORDS-------------------------------------------
ID KEYWORDS
1 videos,photos,images
2 videos,games
3 games,images
4 photos,pictures
---------------------------------------------------------
Example 4: (many-to-many Relationship from url to keyword ID (using reference table))
TABLE_URLs-----------------------------------------------
ID DOMAIN
1 mysite.com
2 yoursite.com
3 hissite.com
4 hersite.com
---------------------------------------------------------
TABLE_URL_TO_KEYWORDS------------------------------------
ID DOMAIN_ID KEYWORDS_ID
1 1 1
2 1 2
3 1 3
4 2 1
5 2 4
6 3 4
7 3 3
8 4 2
9 4 5
---------------------------------------------------------
TABLE_KEYWORDS-------------------------------------------
ID KEYWORDS
1 videos
2 photos
3 images
4 games
5 pictures
---------------------------------------------------------
My understanding is that Example 1 would take the largest amount of storage space however searching through this data would be quick (Repeat keywords saved multiple times, however keywords are sat next to the relevant domain)
wWhereas Example 4 would save a tons on storage space but searching through would take longer. (Not having to store duplicate keywords, however referencing multiple keywords for each domain would take longer)
Could anyone give me any insight or thoughts on which the best method would be to utilise when designing a database that can handle huge amounts of data? With the foresight that you may want to display a URL with its assosicated keywords OR search for one or more keywords and bring up the most relevant URLs
You do have a many-to-many relationship between url and keywords. The canonical way to represent this in a relational database is to use a bridge table, which corresponds to example 4 in your question.
Using the proper data structure, you will find out that the queries will be much easier to write, and as efficient as it gets.
I don't know what drives you to think that searchin in a structure like the first one will be faster. This requires you to do pattern matching when searching for each single keyword, which is notably slow. On the other hand, using a junction table lets you search for exact matches, which can take advantage of indexes.
Finally, maintaining such a structure is also much easier; adding or removing keywords can be done with insert and delete statements, while other structures require you do do string manipulation in delimited list, which again is tedious, error-prone and inefficient.
None of the above.
Simply have a table with 2 string columns:
CREATE TABLE domain_keywords (
domain VARCHAR(..) NOT NULL,
keyword VARCHAR(..) NOT NULL,
PRIMARY KEY(domain, keyword),
INDEX(keyword, domain)
) ENGINE=InnoDB
Notes:
It will be faster.
It will be easier to write code.
Having a plain id is very much a waste.
Normalizing the domain and keyword buys little space savings, but at a big loss in efficiency.
"Huse database"? I predict that this table will be smaller than your Domains table. That is, this table is not your main concern for "huge".

Database table structure for storing statistics data

I am trying to create a table in my MYSQL database for storing click data to my posts on daily basis, what I come up is something like this:
ID | post_id | click_type | created_date
1 1 page_click 2015-12-11 18:13:13
2 2 page_click 2015-12-13 11:16:34
3 3 page_click 2015-12-13 13:24:01
4 1 page_click 2015-12-15 15:31:10
For this type of storing I can get how many clicks does the post number 1 get in December 2015 and even I can get how many clicks does the post number something gets in 15 December between 01-11pm. However let's say I am getting 2000 clicks per day which means it will create 2000 rows per day which means 60.000 per month and 720.000 per year.
Another approach that comes to my mind is like this which stores a row for one day per post and if there is more than one click in that day it will increase the count
ID | post_id | click_type | created_date | count
1 1 page_click 2015-12-11 13
2 2 page_click 2015-12-11 26
3 3 page_click 2015-12-11 152
4 1 page_click 2015-12-12 14
5 2 page_click 2015-12-12 123
6 3 page_click 2015-12-12 163
In this approach if every page is clicked at least one time (which means creating the row) in every day it will generate 1000 rows each day (let's say I have 1000 posts) and 30.000 per month and 360.000 per year.
I am looking for an advice to how to store these statistics and if I want to get daily click statistics. I have some concerns about the performance (of course it's nothing for big data guys :D but sorry for my lack of experience). Do you think it will be ok if there is over 1 million rows in that table after 2-3 years? And which one is do you thing is going to be more effective for me?
720,000 records per year is not necessarily a lot of data. One option may be not to worry about it. Something to consider may be how long the click data matters. If after a year you don't really care anymore then you can have an historical data cleanup protocol that removes data that is older than you care about.
If you are worried about storing large amounts of data and you don't want to erase history, then you can consider pre-calculating your summarized statistics and storing them instead of your transaction detail.
The issue with this is that you have to know in advance what the smallest resolution of time will be that you will continue to care about. Also, if your motivation is saving space then you have to be careful that your summary data doesn't end up taking more space than the original transactions. This can easily happen if you store summarized data at multiple resolutions, as you might in a data warehouse arrangement.
This seems like a good application for rrdtool (http://oss.oetiker.ch/rrdtool/). Here you can specify several resolutions for different time intervals, e.g:
average 5 min for 1 day
average 30 min for 1 week
average 2 hours for 1 month
average 1 day for 1 Year
etc. This is also often used for graphs. Usually this is used with rrd-files, but it can also be based on mysql with rrdgraph_libdbi

SQL query to identify max value in an subset of records to be used as boundary condition for Batch Job partitioning

I have around 2 million records in the database and I want to us the concept of partitions in one of my batch jobs. In order to do this I need to first identify the boundary records of the partition. Can anyone help out to identify boundry values using SQL query. To illustrate consider i have student records as follows
STUDENT_ID STUDENT_NAME
1 JACK
2 SPARROW
3 JONNY
4 WALKER
5 SKY
6 DANNY
Now if i want to create 2 partitions by boundary condition of first partition will be STUDENT_ID between 1 to 3 and STUDENT_ID between 4 to 6. consider similar situation incase student_id is a string or random id. How to identify the bounday condition. Currently I am thinking of first querying all the records in the database and then partitioning them in the java code. But if I have 2 million records this is highly not recommended what should i do in this condition?
You can use limit command in mySql as follow:
SELECT...
LIMIT y OFFSET x

When is it better to flatten out data using comma separated values to improve search query performance?

My question about SEARCH query performance.
I've flattened out data into a read-only Person table (MySQL) that exists purely for search. The table has about 20 columns of data (mostly limited text values, dates and booleans and a few columns containing unlimited text).
Person
=============================================================
id First Last DOB etc (20+ columns)...
1 John Doe 05/02/1969
2 Sara Jones 04/02/1982
3 Dave Moore 10/11/1984
Another two tables support the relationship between Person and Activity.
Activity
===================================
id activity
1 hiking
2 skiing
3 snowboarding
4 bird watching
5 etc...
PersonActivity
===================================
id PersonId ActivityId
1 2 1
2 2 3
3 2 10
4 2 16
5 2 34
6 2 37
7 2 38
8 etc…
Search considerations:
Person table has potentially 200-300k+ rows
Each person potentially has 50+ activities
Search may include Activity filter (e.g., select persons with one and/or more activities)
Returned results are displayed with person details and activities as bulleted list
If the Person table is used only for search, I'm wondering if I should add the activities as comma separated values to the Person table instead of joining to the Activity and PersonActivity tables:
Person
===========================================================================
id First Last DOB Activity
2 Sara Jones 04/02/1982 hiking, snowboarding, golf, etc.
Given the search considerations above, would this help or hurt search performance?
Thanks for the input.
Horrible idea. You will lose the ability to use indexes in querying. Do not under any circumstances store data in a comma delimited list if you ever want to search on that column. Realtional database are designed to have good performance with tables joined together. Your database is relatively small and should have no performance issues at all if you index properly.
You may still want to display the results in a comma delimted fashion. I think MYSQL has a function called GROUP_CONCAT for that.