I'm implement an algorithm that returns popular posts at the moment, given his likes and dislikes.
To do this, for each post I add all his likes (1) and dislikes (-1) to get his score but each like/dislike is weighted : the latest, the heaviest. For example, at the moment an user likes a post, his like weights 1. After 1 day, it weights 0.95 (or -0.95 if it's a dislike), after 2 days, 0.90, and so on... With a minimal of 0.01 reached after 21 days. (PS: Theses are totally approximate values)
Here are how my tables are made :
Posts table
id | Title | user_id | ...
-------------------------------------------
1 | Random post | 10 | ...
2 | Another post | 36 | ...
n | ... | n | ...
Likes table
id | vote | post_id | user_id | created
----------------------------------------
1 | 1 | 2 | 10 | 2014-08-18 15:34:20
2 | -1 | 1 | 24 | 2014-08-15 18:54:12
3 | 1 | 2 | 54 | 2014-08-17 21:12:48
Here is the SQL query I'm currently using which does the job
SELECT Post.*, Like.*,
SUM(Like.vote *
(1 - IF((TIMESTAMPDIFF(MINUTE, Like.created, NOW()) / 60 / 24) / 21 > 0.99, 0.99, (TIMESTAMPDIFF(MINUTE, Like.created, NOW()) / 60 / 24) / 21))
) AS score
FROM posts Post
LEFT JOIN likes Like ON (Post.id = Like.post_id)
GROUP BY Post.id
ORDER BY score DESC
PS: I'm using TIMESTAMPDIFF with MINUTE and not DAY directly because I'm calculating the day myself otherwise it returns me an integrer and I want a float value, in order to gradually decay overtime and not day per day. So TIMESTAMPDIFF(MINUTE, Like.created, NOW())/60/24 just gives me the number of day passed since the like creation with the decimal part.
Here are my questions :
Look at the IF(expr1, expr2, expr3) part : it is necessary in order to set minimal value for the like's weight, so it will not go under 0.01 and become negative (and so the like, even older still has a little weight). But I'm calculating 2 times the same thing : expr1 is the same as expr2. Isn't there a way to avoid this duplicate expression ?
I was going to cache this query and update it every 5 minutes, as I think it will be pretty heavy on a big Post and Like table. Is the cache really necessary or not ? I'm aiming to run this query on a table with 50 000 entries, and for each 200 associated likes (that makes a 10 000 000 entries Like table).
Should I create Index in Like table for post_id ? And for created ?
Thank you !
EDIT: Imagine a Post can have multiple tags, and each tag can belong to multiple posts. If I want to get populars Posts given a Tag or multiple Tag, I can't cache each query ; as there is a good amount of possible queries. Is the query still viable so ?
EDIT FOR FINAL SOLUTION: I finally did some tests. I created a table Post with 30 000 entries and Like with 250 000 entries.
Without index, the query was incredibly long (timed out > 10mn), but with indexes on Post.id (primary), Like.id(primary) and Like.post_id it took ~0.5s.
So I'm not caching the data, neither using update every 5mn. If the table keeps growing this is still possible solution (over 1s it's not acceptable).
2: I was going to cache this query and update it every 5 minutes, as I think it will be pretty heavy on a big Post and Like table. Is the cache really necessary or not ? I'm aiming to run this query on a table with 50 000 entries, and for each 200 associated likes (that makes a 10 000 000 entries Like table).
10000 and 50000 are considered small on current hardware. With those table sizes you probably won't need any cache, unless the query will run several times per second.
Anyway, I would do a performance test before deciding to have a cache.
3: Should I create Index in Like table for post_id ? And for created ?
I would create an index for (post_id, created, vote). That way the query can get all information from the index and doesn't need to read the table at all.
Edit (response to comments):
An extra index will slow down inserts/updates slightly. In the end, the path you choose will dictate the characteristics of what you need in terms of CPU/RAM/Disk I/O.
If you have enough RAM for the DB so that you expect the entire Like table to be cached in RAM then you might be better off with an index on just post_id.
In terms of total load you need to consider the ratio between insert and select and the relative cost of insert and select with or without the index.
My gut feeling is that the total load will be lower with the index.
Regarding your question on concurrency (selecting and inserting simultaneously). What happens depends on the isolation level. The general advice is to keep inserts/updates as short as possible. If you don't do unneccessary things between the start of the insert and the commit you should be fine.
Related
I am currently trying to optimise some DB queries that get run a lot, the queries are run by using a SELECT query against a view, this view does a lot of joins. I thought I might be able to speed things up by caching the results of the view into a table and selecting from the table instead of the view.
Let's say I have 2 tables
People:
PersonId
Name
1
Anne
2
Brian
3
Charlie
4
Doug
CustomerPeople:
CustomerId
PersonId
1
1
1
2
1
3
1
4
2
1
2
2
and I have a view that joins the two tables to give a list of people, by name, belonging to the customer:
CustomerId
PersonName
1
Anne
1
Brian
1
Charlie
1
Doug
2
Anne
2
Brian
When I query the view, I look at the Duration/Fetch and it is 0.10 sec/4.00 sec
I decide to cache the view data into a table and create a new table:
CustomerNamedPeople
CustomerId
PersonName
1
Anne
1
Brian
1
Charlie
1
Doug
2
Anne
2
Brian
Which contains the exact same data, however now when I query the table, I look at the Duration/Fetch and it is 0.05 sec/6.00 sec
My understanding is the Duration is the time it takes MySQL engine to run the query, and Fetch is the time it takes the data to be returned to the client (over the network). Unsurprisingly the Duration was faster, and took only 50% of the time, which makes sense, there is no longer a join occurring, however the Fetch took 150% of the time, and is slower.
My question here is: Does MySQL do some sort of response stream compression, since it knows that Anne and Brian are repeated, it can send them only once and have the client "decompress" the data?
The reason I ask is because I am doing something similar but with 1,000,000 rows returned, the data in the two responses is identical, but the view Fetch takes 20 seconds, and the table Fetch is 60 seconds, most of the PersonNames are repeated more than once, so I am wondering if perhaps there is some sort of compression occurring in the response, should I not expect MySQL to take the same time to Fetch two sets of identical data?
I'm trying to display weighted random results from my database and I'm unable to get results with expected accuracy. I've followed what I learnt here and here.
This would be my table:
+--------+-----------+
| weight | image |
+--------+-----------+
| 50 | A |
| 25 | B |
| 25 | C |
+--------+-----------+
I need the image A to appear 50% of the times, the image B the other 25% of the times and C the remaining 25% of the times.
The SQL estatement I'm using goes like this:
SELECT image FROM images WHERE weight > 0 ORDER BY -LOG(1.0 - RAND()) / weight LIMIT 10
So in order to test this properly I made a php script to have this iterate 10,000 times, counting how many times a, b or c was being shown and I display the results on my test script with percentages, like this:
a total: 4976 - 49,76%
b total: 2538 - 25,38%
c total: 2486 - 24,86%
With only 10,000 results and considering the RAND() is just a randomization function I would consider this results to be accurate enough. The problem is that I run this script about 100 times and I realized that 98 out of 100 times b had a higher percentage count than c.
I'm trying to understand what's wrong, both values (b and c) on the table are the same and I'm not introducing any other ordering factor. I took it up a notch and I went for 100,000 iterations of the SQL clause. These are the results:
a total: 50185 - 50,185%
b total: 25201 - 25,201%
c total: 24614 - 24,614%
I run this last test about 50 times (with long wait times between each). This time b was above c every time and accuracy was worse than the accuracy at 10000 iterations. You would expect that as you go higher on the number of iterations, the percentage variation should be getting smaller and the results more accurate. It's obvious that either I'm doing something wrong or RAND() is not really random enough.
Matematically speaking if it was perfectly random it should be improving accuracy the more iterations you make and not the opposite.
Any explanation/solution is welcome.
I want to store large amount of cryptocurrencies data in db. Then I want to show nice javascript price graphs with historical prices on webpage.
Problem is that I am not sure what database design is best for this problem, I was thinking about Mysql DB, but maybe NOSQL db are better in this case, I don’t know.
What I need:
I need to track at least 100 crypto currencies with historical and
current prices and other stock information like volume etc…
I am going to insert new data every 10 minutes for each crypto ((6
records / hour * 24h * 365 days) * 100 for each crypto = 5 256 000
new records per year )
I need to query various time ranges for each coin to draw graph on webpage.
My idea:
I came with this solution but I need to know if this is ok or I am completely wrong and naive.
In this case I would have 2 tables, first parent table where I would store all necessary info about coins, children table where would be all prices, but this child table would have to contain a huge amount of data, which is worrying me.
My table structure example:
tbl_coin_detail:
id. |Tick_name | Name |Algorithm |Icon
1 | BTC |Bitcoin |SHA256 |path/to/img
2 | ETH |Ethereum |Ethash |path/to/img
.
.
.
tbl_prices:
id | price_USD | price_EUR | datetime | Volume_Day_BTC | FK_coin
1 | 6537.2 | 5 632,28 | 2018-07-01 15:00:00 | 62121.7348556964 | 1
2 | 466.89 | 401.51 | 2018-07-01 15:01:00 | 156373.79481106618 | 2
.
.
.
Another idea is to make separate table for each coin prices, that would mean 100 tables with all historical and current prices and stock info instead of one huge table.
I am really not sure here, what is better, all prices in one table are good for simple querying, but I guess it can be huge performance bottleneck, make queries from separated table will be worse for querying, because I will need to write query for each table but it can help with performance.
Can you point me to right direction how to solve this? SQL DB or NOSQL what is better?
Thank you in advance.
MySQL recommendations...
You have Volume_Day_BTC, yet you say "6 records/hour" -- is the record daily or more fine grained.
The volume of data is not that great, but it will be beneficial to shrink the datatypes before you get started.
id is unnecessary; use PRIMARY KEY(coin, datetime) instead.
Think carefully about the datatype for prices and volumes. At one extreme is space (hence, somewhat, speed); at the other, precision.
DOUBLE -- 8 bytes, about 16 significant digits, large range
DECIMAL(17, 11) -- 8 bytes, limited to $1M and 11 decimal places (not enough?)
DECIMAL(26, 13) -- 12 bytes, maybe big enough?
etc.
Would it be OK to summarize data over, say, one month to save space? Hourly or daily avg/hi/low, etc. This would be very useful for speeding up fetching data for graphing.
In particular, I recommend keeping a Summary table by coin+day with volume, price, etc. Consider using FLOAT (4 bytes, 7 significant digits, sufficient range) as more than good enough for graphing.
So, I am recommending 3 tables:
Coins -- 100 rows with meta info about the currencies.
Prices -- 5M rows/year of details -- unless trimmed (400MB/year)
Summary -- 36500 rows/year for graphing range more than, say, a week. (4MB/yr)
It may be worth it to have an hourly summary table for shorter-range graphs. There is no need to go with weekly or monthly summaries; they can be derived from the daily with sufficient efficiency.
Use InnoDB.
Summary tables
To be honest, that's far from 'huge'. We aren't talking billions of records here, so any properly indexed DB will do just fine.
We have a database for patients that shows the details of their various visits to our office, such as their weight during that visit. I want to generate a report that returns the visit (a row from the table) based on the difference between the date of that visit and the patient's first visit being the largest value possible but not exceeding X number of days.
That's confusing, so let me try an example. Let's say I have the following table called patient_visits:
visit_id | created | patient_id | weight
---------+---------------------+------------+-------
1 | 2006-08-08 09:00:05 | 10 | 180
2 | 2006-08-15 09:01:03 | 10 | 178
3 | 2006-08-22 09:05:43 | 10 | 177
4 | 2006-08-29 08:54:38 | 10 | 176
5 | 2006-09-05 08:57:41 | 10 | 174
6 | 2006-09-12 09:02:15 | 10 | 173
In my query, if I were wanting to run this report for "30 days", I would want to return the row where visit_id = 5, because it's 28 days into the future, and the next row is 35 days into the future, which is too much.
I've tried a variety of things, such as joining the table to itself, or creating a subquery in the WHERE clause to try to return the max value of created WHERE it is equal to or less than created + 30 days, but I seem to be at a loss at this point. As a last resort, I can just pull all of the data into a PHP array and build some logic there, but I'd really rather not.
The bigger picture is this: The database has about 5,000 patients, each with any number of office visits. I want to build the report to tell me what the average wait loss has been for all patients combined when going from their first visit to X days out (that is, X days from each individual patient's first visit, not an arbitrary X-day period). I'm hoping that if I can get the above resolved, I'll be able to work the rest out.
You can get the date of the first and next visit using query like this (Note that this doesn't has correct syntax for date comparing and it is just an schema of the query):
select
first_visits.patient_id,
first_visits.date first_date,
max(next_visit.created) next_date
from (
select patient_id, min(created) as "date"
from patient_visits
group by patient_id
) as first_visits
inner join patient_visits next_visit
on (next_visit.patient_id = first_visits.patient_id
and next_visit.created between first_visits.created and first_visits.created + 30 days)
group by first_visits.patient_id, first_visits.date
So basically you need to find start date using grouping by patient_id and then join patient_visits and find max date that is within the 30 days window.
Then you can join the result to patient_visits to get start and end weights and calculate the loss.
I've got a table which keeps track of article views. It has the following columns:
id, article_id, day, month, year, views_count.
Let's say I want to keep track of daily views / each day for every article. If I have 1,000 user written articles. The number of rows would compute to:
365 (1 year) * 1,000 => 365,000
Which is not too bad. But let say. The number of articles grow to 1M. And as time passes by to 3 years. The number of rows would compute to:
365 * 3 * 1,000,000 => 1,095,000,000
Obviously, over time, this table will keep growing. And quite fast. What problems will this cause? Or should I not worry since RDBM's handle situations like this quite commonly?
I plan on using the views data in our reports. Either break it down to months or even years. Should I worry about 1B+ rows in a table?
The question to ask yourself (or your stakeholders) is: do you really need 1-day resolution on older data?
Have a look into how products like MRTG, via RRD, do their logging. The theory is you don't store all the data at maximum resolution indefinitely, but regularly aggregate them into larger and larger summaries.
That allows you to have 1-second resolution for perhaps the last 5-minutes, then 5-minute averages for the last hour, then hourly for a day, daily for a month, and so on.
So, for example, if you have a bunch of records like this for a single article:
year | month | day | count | type
-----+-------+-----+-------|------
2011 | 12 | 1 | 5 | day
2011 | 12 | 2 | 7 | day
2011 | 12 | 3 | 10 | day
2011 | 12 | 4 | 50 | day
You would then at regular periods create a new record(s) that summarises these data, in this example just the total count for the month
year | month | day | count | type
-----+-------+-----+-------|------
2011 | 12 | 0 | 72 | month
Or the average per day:
year | month | day | count | type
-----+-------+-----+-------+------
2011 | 12 | 0 | 2.3 | month
Of course you may need some flag to indicate the "summarised" status of the data, in this case I've used a 'type' column for finding the "raw" records and the processed records, allowing you to purge out the day records as required.
INSERT INTO statistics (article_id, year, month, day, count, type)
SELECT article_id, year, month, max(day), sum(count), 'month'
FROM statistics
WHERE type = 'day'
GROUP BY article_id, year, month, type
(I haven't tested that query, it's just an example)
The answer is "it depends". but yes, it will probably be a lot to deal with.
However - this is generally a problem of "cross that bridge when you need to". It's a good idea to think about what you could do if this becomes a problem for you in the future.. but it's probably too early to actually implement any suggestions until they're necessary.
My suggestion, if it ever occurs, is to not keep the individual records for longer than X-months (where you adjust X according to your needs). Instead, you'd store the aggregated data that you currently feed into your reports. What you'd do is run, say, a daily script that looks at your records and grabs any that are over X months old... and create a "daily_stats" object of some sort, then delete the originals (or better yet, archives them somewhere).
This will ensure that only X-months worth of data are ever in the db - but you still have quick access to an aggregated form of the stats for long-timeline reports.
It's not something you need to worry about if you can put some practices in place.
Partition the table; this should make archiving easier to do
Determine how much data you need at present
Determine how much data you can archive
Ensure that the table has the right build, perhaps in terms of data types and indexes
Schedule for a time when you will archive partitions that meet the aging requirements
Schedule for index checking (and other table checks)
If you have a DBA in your team, then you can discuss it with him/her, and I'm sure they'll be glad to assist.
Also, like what is used in many data warehouses, and I just saw #Taryn's post (which I agree with -> )store aggregated data as well. This is quickly suggested based on the data you keep in the involved table. If you have trouble with possible editing/updating of records, then it brings to light (even more) the fact that you will just have to set restrictions like how much data to keep (which means this data is what can be modified) and have procedures+jobs in place to ensure that the aggregated data is checked/updated daily and can be updated/checked manually when any changes are made. This way, data integrity is maintained. Discuss with your DBA what other approaches you can take...
By the way, in case you didn't already know.. Aggregated data are normally needed for weekly or monthly reports, and many other reports based upon an interval. Granulize your aggregation as needed, but not so much that it becomes too tedious or seemingly exaggerated.