SQL: to decide to use it or not - mysql

Hallo:
I did some text processing of a database using Shell and Python. For interoperability, I am thinking to do it using SQL. SQL is good for some query task. But i am not sure if the SQL can handle all my tasks. Consider one example database:
item | time | value
-----+------+-------
1 | 134 | 3
2 | 304 | 1
3 | 366 | 2
4 | 388 | 2
5 | 799 | 6
6 | 111 | 7
I need to profile the sum of #values over certain #time interval. Suppose the time interval is 100, the result should be:
time_interval | sumvalue
--------------+----------
1 | 10 -- the time interval from 100 to 199
3 | 5 -- the time interval from 300 to 399
7 | 6 -- the time interval from 700 to 799
I could not find a better way to do it from the SQL text book than to do it using shell and python.
So my SO friends, any suggestion?
Thanks!

You should be able to do it in mysql with a pretty simple query:
SELECT time DIV 100, SUM(value) FROM yourtable
GROUP BY time DIV 100
The query takes advantage of the fact that integer division by 100 will give you the interval groupings you have described (eg. 111 DIV 100 = 1 and 134 DIV 100 = 1)

Question is not clear to me.
There is a database and you want to process data from there, and you are asking to use or not to use SQL? Answer:Yes, SQL is an interface to many databases, it is quite standart for major databases with minor changes. Use it.
If you cannot decide to use or not to use a database for storing and processing some values, then data type, amount of data and relations in data is important. If you want to handle large amount of data and there is relation between datasets, then you may want to use relational database systems such as MySql. The problem you told is a very simple problem for RMDBS. Let me give an example:
select sum(value) from items where
time>=100 and time<=200
But if dataset is small you can easily handle it with file I/O.
If you will use Python, you may want to use Sqlite as database, it is very lightweight, simple, easy to use and widely used database. You can use SQL with Sqlite too.
If you can give clearer details, we can help more.

Yes, a SQL-based databased like MySQL will probably be a fine choice for your project. You may also want to look at SQLite if you don't want to have to set up a server.
A good introductory text on SQL would be helpful for you. I suggest SQL For Dummies by Allen Taylor.

Related

Questionnaire Database Structure for Analysis in SPSS

I am using MySQL to store questionnaire data for a study. The structure of the questionnaire is fairly simple. Each participant will complete four identical questionnaires - baseline (0 weeks), 6 weeks, 12 weeks, 36 weeks. There are 30 questions which all use a coded Likert Scale.
My proposed table to store the responses was like so:
ID | Participant | Week | Q1 | Q2 | Q3 | Q4 ...
That way I can insert a new row for each response. However, I spoke with the statistician yesterday (a professor) who told me that for analysis in SPSS it would be preferable if the data was structured more like so:
ID | Participant | W0Q1 | W0Q2 ... W6Q1 | W6Q2 ... W12Q1 | W12Q2 ...
In this case, I would have to update the entry for each participant rather than inserting. It sounded illogical to me.
I only have limited experience with SPSS. What you be the general consensus on this matter?
Whether you would use long form (the first) or wide form (the second) depends on the analysis you will do. However SPSS Statistics provides CASESTOVAR and VARSTOCASES commands that make it easy to restructure either one into the other, so it doesn't much matter how you structure the cases initially.

Aggregate data tables

I am building a front-end to a largish db (10's of millions of rows). The data is water usage for loads of different companies and the table looks something like:
id | company_id | datetime | reading | used | cost
=============================================================
1 | 1 | 2012-01-01 00:00:00 | 5000 | 5 | 0.50
2 | 1 | 2012-01-01 00:01:00 | 5015 | 15 | 1.50
....
On the frontend users can select how they want to view the data, eg: 6 hourly increments, daily increments, monthly etc. What would be the best way to do this quickly. Given the data changes so much and the number of times any one set of data will be seen, caching the query data in memcahce or something similar is almost pointless and there is no way to build the data before hand as there are too many variables.
I figured using some kind of agregate aggregate table would work having tables such as readings, readings_6h, readings_1d with exactly the same structure, just already aggregated.
If this is a viable solution, what is the best way to keep the aggregate tables upto date and accurate. Besides the data coming in from meters the table is read only. Users don't ever have to update or write to it.
A number of possible solutions include:
1) stick to doing queries with group / aggregate functions on the fly
2) doing a basic select and save
SELECT `company_id`, CONCAT_WS(' ', date(`datetime`), '23:59:59') AS datetime,
MAX(`reading`) AS reading, SUM(`used`) AS used, SUM(`cost`) AS cost
FROM `readings`
WHERE `datetime` > '$lastUpdateDateTime'
GROUP BY `company_id`
3) duplicate key update (not sure how the aggregation would be done here, also making sure that the data is accurate not counted twice or missing rows.
INSERT INTO `readings_6h` ...
SELECT FROM `readings` ....
ON DUPLICATE KEY UPDATE .. calculate...
4) other ideas / recommendations?
I am currently doing option 2 which is taking around 15 minutes to aggregate +- 100k rows into +- 30k rows over 4 tables (_6h, _1d, _7d, _1m, _1y)
TL;DR What is the best way to view / store aggregate data for numerous reports that can't be cached effectively.
This functionality would be best served by a feature called materialized view, which MySQL unfortunately lacks. You could consider migrating to a different database system, such as PostgreSQL.
There are ways to emulate materialized views in MySQL using stored procedures, triggers, and events. You create a stored procedure that updates the aggregate data. If the aggregate data has to be updated on every insert you could define a trigger to call the procedure. If the data has to be updated every few hours you could define a MySQL scheduler event or a cron job to do it.
There is a combined approach, similar to your option 3, that does not depend on the dates of the input data; imagine what would happen if some new data arrives a moment too late and does not make it into the aggregation. (You might not have this problem, I don't know.) You could define a trigger that inserts new data into a "backlog," and have the procedure update the aggregate table from the backlog only.
All these methods are described in detail in this article: http://www.fromdual.com/mysql-materialized-views

Database suggestions for storing a "count" for every hour of the day

I have had an online archive service for over a year now. Unfortunately, I didn't put in the infrastructure to keep statistics. All I have now are the archive access logs.
For every hour there are two audio files (0-30 min in one and 30-60 min in other). I've currently used MySQL to put in the counts. Looks something like this:
| DATE | TIME | COUNT |
| 2012-06-12 | 20:00 | 39 |
| 2012-06-12 | 20:30 | 26 |
| 2012-06-12 | 21:00 | 16 |
and so on...
That makes 365 days * 24 hrs * 2 (2 halves in an hour) > 17500 rows. This makes read/write slow and I feel a lot of space is wasted storing it this way.
So do you know of any other database that will store this data more efficiently and is faster?
That's not too many rows. If it's properly indexed, reads should be pretty fast (writes will be a little slower, but even with tables of up to about half a million rows I hardly notice).
If you are selecting items from the database using something like
select * from my_table where date='2012-06-12'
Then you need to make sure that you have an index on the date column. You can also create multiple column indexes if you are using more than one column in your where statement. That will make your read statements very fast (like I said up to on the order of a million rows).
If you're unacquainted with indexes, see here:
MySQL Indexes

mysql query - optimizing existing MAX-MIN query for a huge table

I have a more or less good working query (concerning to the result) but it takes about 45seconds to be processed. That's definitely too long for presenting the data in a GUI.
So my demand is to find a much faster/efficient query (something around a few milliseconds would be nice)
My data table has something around 3000 ~2,619,395 entries and is still growing.
Schema:
num | station | fetchDate | exportValue | error
1 | PS1 | 2010-10-01 07:05:17 | 300 | 0
2 | PS2 | 2010-10-01 07:05:19 | 297 | 0
923 | PS1 | 2011-11-13 14:45:47 | 82771 | 0
Explanation
the exportValue is always incrementing
the exportValue represents the actual absolute value
in my case there are 10 stations
every ~15 minutes 10 new entries are written to the table
error is just an indicator for a proper working station
Working query:
select
YEAR(fetchDate), station, Max(exportValue)-MIN(exportValue)
from
registros
where
exportValue > 0 and error = 0
group
by station, YEAR(fetchDate)
order
by YEAR(fetchDate), station
Output:
Year | station | Max-Min
2008 | PS1 | 24012
2008 | PS2 | 23709
2009 | PS1 | 28102
2009 | PS2 | 25098
My thoughts on it:
writing several queries with between statements like 'between 2008-01-01 and 2008-01-02' to fetch the MIN(exportValue) and between 2008-12-30 and 2008-12-31' to grab the MAX(exportValue) - Problem: a lot of queries and the problem with having no data in a specified time range (it's not guaranteed that there will be data)
limiting the resultset to my 10 stations only with using order by MIN(fetchDate) - problem: takes also a long time to process the query
Additional Info:
I'm using the query in a JAVA Application. That means, it would be possible to do some post-processing on the resultset if necessary. (JPA 2.0)
Any help/approaches/ideas are very appreciated. Thanks in advance.
Adding suitable indexes will help.
2 compound indexes will speed things up significantly:
ALTER TABLE tbl_name ADD INDEX (error, exportValue);
ALTER TABLE tbl_name ADD INDEX (station, fetchDate);
This query running on 3000 records should be extremely fast.
Suggestions:
do You have PK set on this table? station, fetchDate?
add indexes; You should experiment and try with indexes as rich.okelly suggested in his answer
depending on experiments with indexes, try breaking your query into multiple statements - in one stored procedure; this way You will not loose time in network traffic between multiple queries sent from client to mysql
You mentioned that You tried with separate queries and there is a problem when there is no data for particular month; it is regular case in business applications, You should handle it in a "master query" (stored procedure or application code)
guess fetchDate is current date and time at the moment of record insertion; consider keeping previous months data in sort of summary table with fields: year, month, station, max(exportValue), min(exportValue) - this means that You should insert summary records in summary table at the end of each month; deleting, keeping or moving detail records to separate table is your choice
Since your table is rapidly growing (every 15 minutes) You should take the last suggestion into account. Probably, there is no need to keep detailed history at one place. Archiving data is process that should be done as part of maintenance.

Best to build a SQL Query or extrapolate with another program?

I am having trouble developing some queries on the fly for our clients and sometimes find myself asking "Would it be better to start with a subset of the data I know I'm looking for, then just import into a program like Excel and process the data accordingly using similar functions, such as Pivot Tables"?.
One instance in particular I am struggling with is the following example:
I have an online member enrollment system. For simplicity sake, let's assume the data captured is: Member ID, Sign Up Date, their referral code, their state.
A sample member table may look like the following:
MemberID | Date | Ref | USState
=====================================
1 | 2011-01-01 | abc | AL
2 | 2011-01-02 | bcd | AR
3 | 2011-01-03 | cde | CA
4 | 2011-02-01 | abc | TX
and so on....
ultimately, the types of queries I want to build and run with this data set can extend to:
"Show me a list of all referral codes and the number of sign ups they had by each month in a single result set".
For example:
Ref | 2011-01 | 2011-02 | 2011-03 | 2011-04
==============================================
abc | 1 | 1 | 0 | 0
bcd | 1 | 0 | 0 | 0
cde | 1 | 0 | 0 | 0
I have no idea how to build this type of query in MySQL to be honest (I imagine if it can be done it would require a LOT of code, joins, subqueries, and unions.
Similarly, another sample query may be how many members signed up in each state by month
USState | 2011-01 | 2011-02 | 2011-03 | 2011-04
==============================================
AL | 1 | 0 | 0 | 0
AR | 1 | 0 | 0 | 0
CA | 1 | 0 | 0 | 0
TX | 0 | 1 | 0 | 0
I suppose my question is two fold:
1) Is it in fact best to just try and build these out with the necessary data from within a MySQL GUI such as Navicat or just import the entire subset of data into Excel and work forward?
2) If I was to use the MySQL route, what is the proper way to build the subsets of data in the examples mentioned below (note that the queries could become far more complex such as "Show how many sign ups came in for each particular month by each state and grouped by each agent as well (each agent has 50 possible rows)"
Thank you so much for your assistance ahead of time.
I am a proponent of doing this kind of querying on the server side, at least to get just the data you need.
You should create a time-periods table. It can get as complex as you desire, going down to days even.
id year month monthstart monthend
1 2011 1 1/1/2011 1/31/2011
...
This gives you almost limitless ability to group and query data in all sorts of interesting ways.
Getting the data for the original referral counts by month query you mentioned would be quite simple...
select a.Ref, b.year, b.month, count(*) as referralcount
from myTable a
join months b on a.Date between b.monthstart and b.monthend
group by a.Ref, b.year, b.month
order by a.Ref, b.year, b.month
The result set would be in rows like ref = abc, year = 2011, month = 1, referralcount = 1 as opposed to a column for every month. I am assuming that since getting a larger set of data and manipulating it in Excel was an option, that changing the layout of this data wouldn't be difficult.
Check out this previous answer that goes into a little more detail about the concept with different examples: SQL query for Figuring counts by month
I work on an Excel based application that deals with multi-dimensional time series data, and have recently been working on implementing predefined pivot table spreadsheets, so I know exactly what you're thinking. I'm a big proponent of giving users tools rather than writing up individual reports or a whole query language for them to use. You can create pivot tables on the fly that connect to the database and it's not that hard. Andrew Whitechapel has a great example here. But, you will also need to launch that in Excel or setup a basic Excel VSTO program, which is fairly easy to do in Visual Studio 2010. (microsoft.com/vsto)
Another thing, don't feel like you have to create ridiculously complex queries. Every join that you have will slow down any relational database. I discovered years ago that doing multi-step queries into temp tables in most cases will be much clearer, faster, and easier to write and support.