A Complicated MySQL Query - mysql

I want to perform a very complicated Query on a MySQL Table. Currently this MySQL Table stores user info like IP, Country, event_id and many other statistics like date_start date_end for specific events.
A specific event_id starts with date_start and when the user ends it a time() value is being written to the date_end column.
I want a query to find somehow all the suspicous users (ids return). Below are the rules that defines a suspicous user.
There are rows in the database for the user_id that has been connected from multiple countries. In this case where the country column has different values
There are many rows in the database for a specific event_id that the SUM OF (date_end-date_start) has a value for example +50% than all the other SUM of (date-end-date_start) of others events. With a simple words, the query should report the user_ids that have spent too much time on some events whereas they didn't spend too much time on all the others. The % percent value should be configurable.
I know it sounds crazy, however i tried to do it and i failed so much. I did that using PHP but it's slow and i'm sure that it can be done with queries.
Hope you understand me
Thank you

This problem is too big. Figure out how to find the users who have come from multiple countries. Then figure out how to get statistics on event durations. Then figure out how to identify outliers. Then, finally, try to merge all three solutions.
In general, use SQL to filter the data down to a manageable size, then PHP to do any further processing.

Related

How to store recent usage frequency in MySQL

I'm working on the Product Catalog module of an Invoicing application.
When the user creates a new invoice the product name field should be an autocomplete field which shows the most recently used products from the product catalog.
How can I store this "usage recency/frequency" in the database?
I'm thinking about adding a new field recency which would be increased by 1 every time the product was used, and decreased by 1/(count of all products), when an other product is used. Then use this recency field for ordering, but it doesn't seem to me the best solution.
Can you help me what is the best practice for this kind of problem?
Solution for the recency calculation:
Create a new column in the products table, named last_used_on for example. Its data type should be TIMESTAMP (the MySQL representation for the Unix-time).
Advantages:
Timestamps contains both date and time parts.
It makes possible VERY precise calculations and comparisons in regard
to dates and times.
It lets you format the saved values in the date-time format of your
choice.
You can convert from any date-time format into a timestamp.
In regard to your autocomplete fields, it allows you to filter
the products list as you wish. For example, to display all products
used since [date-time]. Or to fetch all products used between
[date-time-1] and [date-time-2]. Or get the products used only on Mondays, at 1:37:12 PM, in the last two years, two months and three
days (so flexible timestamps are).
Resources:
Unix-Time
The DATE, DATETIME, and TIMESTAMP Types
How should unix timestamps be stored in int columns?
How to convert human date to unix timestamp in Mysql?
Solution for the usage rate calculation:
Well, actually, you are not speaking about a frequency calculation, but about a rate - even though one can argue that frequency is a rate, too.
Frequency implies using the time as the reference unit and it's measured in Hertz (Hz = [1/second]). For example, let's say you want to query how many times a product was used in the last year.
A rate, on the other hand, is a comparison, a relation between two related units. Like for example the exchange rate USD/EUR - they are both currencies. If the comparison takes place between two terms of the same type, then the result is a number without measurement units: a percentage. Like: 50 apples / 273 apples = 0.1832 = 18.32%
That said, I suppose you tried to calculate the usage rate: the number of usages of a product in relation with the number of usages of all products. Like, for a product: usage rate of the product = 17 usages of the product / 112 total usages = 0.1517... = 15.17%. And in the autocomplete you'd want to display the products with a usage rate bigger than a given percentage (like 9% for example).
This is easy to implement. In the products table add a column usages of type int or bigint and simply increment its value each time a product is used. And then, when you want to fetch the most used products, just apply a filter like in this sql statement:
SELECT
id,
name,
(usages*100) / (SELECT sum(usages) as total_usages FROM products) as usage_rate
FROM products
GROUP BY id
HAVING usage_rate > 9
ORDER BY usage_rate DESC;
Here's a little study case:
In the end, recency, frequency and rate are three different things.
Good luck.
To allow for future flexibility, I'd suggest the following additional (*) table to store the entire history of product usage by all users:
Name: product_usage
Columns:
id - internal surrogate auto-incrementing primary key
product_id (int) - foreign key to product identifier
user_id (int) - foreign key to user identifier
timestamp (datetime) - date/time the product was used
This would allow the query to be fine tuned as necessary. E.g. you may decide to only order by past usage for the logged in user. Or perhaps total usage within a particular timeframe would be more relevant. Such a table may also have a dual purpose of auditing - e.g. to report on the most popular or unpopular products amongst all users.
(*) assuming something similar doesn't already exist in your database schema
Your problem is related to many other web-scale search applications, such as e.g. showing spell corrections, related searches, or "trending" topics. You recognized correctly that both recency and frequency are important criteria in determining "popular" suggestions. In practice, it is desirable to compromise between the two: Recency alone will suffer from random fluctuations; but you also don't want to use only frequency, since some products might have been purchased a lot in the past, but their popularity is declining (or they might have gone out of stock or replaced by successor models).
A very simple but effective implementation that is typically used in these scenarios is exponential smoothing. First of all, most of the time it suffices to update popularities at fixed intervals (say, once each day). Set a decay parameter α (say, .95) that tells you how much yesterday's orders count compared to today's. Similarly, orders from two days ago will be worth α*α~.9 times as today's, and so on. To estimate this parameter, note that the value decays to one half after log(.5)/log(α) days (about 14 days for α=.95).
The implementation only requires a single additional field per product,
orders_decayed. Then, all you have to do is to update this value each night with the total daily orders:
orders_decayed = α * orders_decayed + (1-α) * orders_today.
You can sort your applicable suggestions according to this value.
To have an individual user experience, you should not rely on a field in the product table, but rather on the history of the user.
The occurrences of the product in past invoices created by the user would be a good starting point. The advantage is that you don't need to add fields or tables for this functionality. You simply rely on data that is already present anyway.
Since it is an auto-complete field, maybe past usage is not really relevant. Display n search results as the user types. If you feel that results are better if you include recency in the calculation of the order, go with it.
Now, implementation may defer depending on how and when product should be displayed. Whether it has to be user specific usage frequency or application specific (overall). But, in both case, I would suggest to have a history table, which later you can use for other analysis.
You could design you history table with atleast below columns:
Id | ProductId | LastUsed (timestamp) | UserId
And, now you can create a view, which will query this table for specific time range (something like product frequency of last week, last month or last year) and will give you highest sold product for specific time range.
Same can be used for User's specific frequency by adding additional condition to filter by Userid.
I'm thinking about adding a new field recency which would be increased
by 1 every time the product was used, and decreased by 1/(count of all
products), when an other product is used. Then use this recency field
for ordering, but it doesn't seem to me the best solution.
Yes, it is not a good practice to add a column for this and update every time. Imagine, this product is most awaiting product and people love to buy it. Now, at a time, 1000 people or may be more requested for this product and for every request you are going to update same row, since to maintain the concurrency database has to lock that specific row and update for each request, which is definitely going to hit your database and application performance instead you can simply insert a new row.
The other possible solution is, you could use your existing invoice table as it will definitely have all product and user specific information and create a view to get frequently used product as I mentioned above.
Please note that, this is an another option to achieve what you are expecting. But, I would personally recommend to have history table instead.
The scenario
When the user creates a new invoice the product name field should be an autocomplete field which shows the most recently used products from the product catalogue.
your suggested solution
How can I store this "usage recency/frequency" in the database?
If it is a web application, don't store it in a Database in your server. Each user has different choices.
Store it in the user's browser as Cookie or Localstorage because it will improve the User Experience.
If you still want to store it in MySQL table,
Do the following
Create a column recency as said in question.
When each time the item used, increase the count by 1 as said in question.
Don't decrease it when other items get used.
To get the recent most used item,
query
SELECT * FROM table WHERE recence = (SELECT MAX(recence) FROM table);
Side note
Go for the database use only if you want to show the recent most used products without depending the user.
As you aren't certain on wich measure to choose, and it's rather user experience related problem, I advice you have a number of measures and provide a user an option to choose one he/she prefers. For example the set of available measures could include most popular product last week, last month, last 3 months, last year, overall total. For the sake of performance I'd prefer to store those statistics in a separate table which is refreshed by a scheduled job running every 3 hours for example.

how to create tables of events that the number's are unknown with SQLite

i ll start to develop an iPhone application but i got a question/problem.
i was thinking about storing all the data in one single and huge table but than while i was drawing a schema, i noticed that i ll store events- trigger events like placed in IBActions, or in viewDidLoad's i ll keep the count but the real question is, i need to store the dates and timestamps of this events as well.Like one user may trigger "home screen appeared" 100 times, keeping the count is easy but how can i store the dates?Should i create a separate table to keep each events and their timestamps?
If thats the case i don't know how many events there will be, wouldn't it be so much of a garbae tables?
In the end i'll send these SQLite informations to my back-end so it should be neat.
Can this be done in one single table?Am i missing some points?
To do this in one table, you would need a record ( row ) for each event. You could
select count(1) from events ....
to get the count, order by date_created with a limit N clause to get the most N recent, etc. If you insisted on keeping just one row per event, then no, I can't think of a clean way to keep track of all event dates without a second table.
To answer your other questions though, you can automatically assign the data of a record's entry by defining the column like this ..
DATE_CREATED DATE DEFAULT CURRENT_TIMESTAMP
and not including that field on your insert statement. That is really your cleanest solution.

How should I setup the structure of my MySQL database to work for my needs?

I am working on an application that awards the top person of each category for being first. The way you become first in a category is by having the most number of votes in the past 30 (or so) days. So even if you had a total of 2,000 votes but got only 2 votes within the past 30 days, someone with 10 votes but got all 10 within the past 30 days would be ranked above you. I am just trying to seek advise on the best way to create this type of system with a MySQL database and how to structure the database.
I am pretty unsure of the best way to go about this, any advice would be greatly appreciated!
The first desicion you have to make is, whether you want to keep a record for every vote cast: This has the potential for a huge table, but it lets you keep a lot of information, so you trade storage and performance against information. This must be answered by business logic, not implementation.
Assuming you DO want to keep every vote, keep it with a timestamp and the only thing you have to do is to join the user person table with the vote table, use a WHERE clause to select only the last N days and a COUNT() aggregate to count your votes.
If you do NOT want to keep every vote, you should have an accumulation table with person, day and votecount - an analogous query with SUM() instead of COUNT() will do what you want.

MYSQL Database Schema Question

I need opinions on the best way to go about creating a table or collection of tables to handle this unique problem. Basically, I'm designing this site with business profiles. The profile table contains all your usual things such as name, uniqueID, address, ect. Now, the whole idea of the site is that it's going to be collecting a small string of informative text. I want to allow the clients to be able to store one per date, with as many as 30 days in advance. The program is only going to show the information from the current date on forward, with expired dates not being shown.
The only way I can really see this being done is a table consisting of the uniqueID, date, and the informative block of text, but this creates pretty extensive queries. Eventually this table is going to be at least 20 times larger than the table of businesses in the first place as these businesses are going to be able to post up to 30 items in this table using their uniqueID.
Now, imagine the search page brings up a list of businesses in the area, it's then got to query the new table for all of those ids to get that block of information I want to show based on the date. I'm pretty sure it would be a rather intensive couple of queries just to show a rather simple block of text, but I imagine this is how status updates work for social networking sites in general? Does facebook store updates in a table of updates tied to a users ID number or have they come up with a better way?
I'm just trying to gain a little more insight into DB design, so throw out any ideas you might have.
The only way I can really see this being done is a table consisting of the uniqueID, date, and the informative block of text...
Assuming you mean the profile uniqueID, and not a unique ID for the text table, you're correct.
As pascal said in his comment, you'd need a primary index on uniqueID and date. A person could only enter one row of text for a given date.
If you want to retrieve the next text row for a person, your SQL query would have the following clauses:
WHERE UNIQUE_ID = PROFILE.UNIQUE_ID
AND DATE >= CURRENT_DATE
LIMIT 1
Since you have an index on uniqueID and date, this should be a fast query.
If you want to retrieve the next 5 texts for a particular person, you'd just have to make one change:
WHERE UNIQUE_ID = PROFILE.UNIQUE_ID
AND DATE >= CURRENT_DATE
LIMIT 5

MySQL Query eliminate duplicates but only adjacent to each other

I have the following query..
SELECT Flights.flightno,
Flights.timestamp,
Flights.route
FROM Flights
WHERE Flights.adshex = '400662'
ORDER BY Flights.timestamp DESC
Which returns the following screenshot.
However I cannot use a simple group by as for example BCS6515 will appear a lot later in the list and I only want to "condense" the rows that are the same next to each other in this list.
An example of the output (note BCS6515 twice in this list as they were not adjacent in the first query)
Which is why a GROUP BY flightno will not work.
I don't think there's a good way to do so in SQL without a column to help you. At best, I'm thinking it would require a subquery that would be ugly and inefficient. You have two options that would probably end up with better performance.
One would be to code the logic yourself to prune the results. (Added:) This can be done with a procedure clause of a select statement, if you want to handle it on the database server side.
Another would be to either use other information in the table or add new information to the table for this purpose. Do you currently have something in your table that is a different value for each instance of a number of BCS6515 rows?
If not, and if I'm making correct assumptions about the data in your table, there will be only one flight with the same number per day, though the flight number is reused to denote a flight with the same start/end and times on other days. (e.g. the 10a.m. from NRT to DTW is the same flight number every day). If the timestamps were always the same day, then you could use DAY(timestamp) in the GROUP BY. However, that doesn't allow for overnight flights. Thus, you'll probably need something such as a departure date to group by to identify all the rows as belonging to the same physical flight.
GROUP BY does not work because 'timestamp' value is different for 2 BCS6515 records.
it will work only if:
SELECT Flights.flightno,
Flights.route
FROM Flights
WHERE Flights.adshex = '400662'
GROUP BY (Flights.flightno)