I currently have a SQL table of timestamps, station_code, temperature, humidity, pressure.
The timestamp and station_code are unique keys.
The data is gathered at regular 5 minute intervals.
I am trying to restructure the data into 3 tables, temperature, humidity, and pressure that would have unique timestamps for rows, and station_code for columns.
I am currently at a loss on how the join statement would go.
Any help would be appreciated.
With data from Excel,
Can easily produce a pivot table.
The end goal here is to pull out the specific sensor data for analysis.
As long as the timestamps match exactly, you would just join on timestamp and station_code...
... but since MySQL does not have a true FULL outer join, that would mean you would either have to guarantee all 3 tables always have a match, or specific tables always had data in a certain order (so that a LEFT join could be used). Both of which suggest you probably should not decompose the original table.
It's unlikely to save you much space either as the redundant timestamp and station_code values are likely to take as much space as you would save by avoiding storing NULL measurements.
Related
now i'm trying to figure out, what should i do, to improve my query result.
Now, it's 47.55.
So, should i create any indexes for columns? Tell me please
SELECT bw.workloadId, lrer.lecturerId, lrer.lastname, lrer.name, lrer.fathername, bt.title, ac.activityname, cast(bw.exactday as char(45)) as "date", bw.exacttime as "time" FROM base_workload as bw
right join unioncourse as uc on uc.idunioncourse = bw.idunioncourse
right join basecoursea as bc on bc.idbasecoursea = uc.idbasecourse
right join lecturer as lrer on lrer.lecturerId = uc.lecturerId
right join basetitle as bt on bt.idbasetitle = bc.idbasetitle
right join activity as ac on ac.activityId = bc.activityId
where lrer.lecturerId is not null AND bc.idbasecoursea is not null and bw.idunioncourse != ""
ORDER BY bw.exactday, bw.exacttime ASC;
From MySQL 8.0 documentation:
Indexes are used to find rows with specific column values quickly. Without an index, MySQL must begin with the first row and then read through the entire table to find the relevant rows. The larger the table, the more this costs. If the table has an index for the columns in question, MySQL can quickly determine the position to seek to in the middle of the data file without having to look at all the data. This is much faster than reading every row sequentially.
MySQL use indexes for these operations:
To find the rows matching a WHERE clause quickly.
To eliminate rows from consideration.
If the table has a multiple-column index, any leftmost prefix of the index can be used by the optimizer to look up rows.
To retrieve rows from other tables when performing joins.
To find the MIN() or MAX() value for a specific indexed column key_col.
To sort or group a table if the sorting or grouping is done on a leftmost prefix of a usable index (for example, ORDER BY key_part1, key_part2).
In some cases, a query can be optimized to retrieve values without consulting the data rows.
As of your requirements, you could use index on the WHERE clause for faster data retrieval.
I think you can get rid of
lrer.lecturerId is not null
AND bc.idbasecoursea is not null
By changing the first 3 RIGHT JOINs to JOINs.
What is the datatype of exactday? What is the purpose of
cast(bw.exactday as char(45)) as "date"
The CAST may be unnecessary.
Re bw.exactday, bw.exacttime: It is usually better to use a single column for DATETIME instead of two columns (DATE and TIME).
What are the PRIMARY KEYs of the tables?
Please convert to LEFT JOIN if possible; I can't wrap my head around RIGHT JOINs.
This index on bw may help: INDEX(exactday, exacttime).
I have two tables imported into Access from Excel workbooks.
They are narrow tables:
INRMaster: MastDate Date/Time (Short Date)
CIInput: CIDate Date/Time (Short Date)
INRTestResult Number
Dose Number
OutOfRange Short Text
CIInput table was downloaded and the date was date/time with date and time of test. I reformatted that date field to mm:dd:yyyy to match the table I created, INRMaster.
There is no primary key on either table. I tried the join with primary keys of date in both tables and it returned nothing as well. Creating the query using the QBE Grid.
The generated SQL is as follows:
SELECT INRMaster.MastDate, CIINput.[INR test result], CIINput.Dose, CIINput.OutofRange
FROM INRMaster INNER JOIN CIINput ON INRMaster.MastDate = CIINput.CIDate
Office 365 Access, Windows 10.
Setting Format property does not change data. Unless you actually modify saved values, time part is still there and since it is unlikely values will agree to the second, a join will fail. Don't apply formatting in table - view full saved value.
Extract date part with expression in query. If both fields were saved with date and time components, then extract date portion from both. Consider:
SELECT INRMaster.MastDate, CIINput.[INR test result], CIINput.Dose, CIINput.OutofRange
FROM INRMaster INNER JOIN CIINput ON Int(INRMaster.MastDate) = Int(CIINput.CIDate);
DO NOT OPEN QUERY IN DESIGN VIEW. Query Designer cannot resolve this join. Would have to join nested subqueries to enable Design View.
I have 3 tables. All 3 tables have approximately 2 million rows. Everyday 10,000-100,000 new entries are entered. It takes approximately 10 seconds to finish the sql statement below. Is there a way to make this sql statement faster?
SELECT customers.name
FROM customers
INNER JOIN hotels ON hotels.cus_id = customers.cus_id
INNER JOIN bookings ON bookings.book_id = customers.book_id
WHERE customers.gender = 0 AND
customers.cus_id = 3
LIMIT 25 OFFSET 1;
Of course this statement works fine, but its slow. Is there a better way to write this code?
All database servers have a form of an optimization engine that is going to determine how best to grab the data you want. With a simple query such as the select you showed, there isn't going to be any way to greatly improve performance within the SQL. As others have said sub-queries won't helps as that will get optimized into the same plan as joins.
Reduce the number of columns, add indexes, beef up the server if that's an option.
Consider caching. I'm not a mysql expert but found this article interesting and worth a skim. https://www.percona.com/blog/2011/04/04/mysql-caching-methods-and-tips/
Look at the section on summary tables and consider if that would be appropriate. Does pulling every hotel, customer, and booking need to be up-to-the-minute or would inserting this into a summary table once an hour be fine?
A subquery don't help but a proper index can improve the performance so be sure you have proper index
create index idx1 on customers(gender , cus_id,book_id, name )
create index idex2 on hotels(cus_id)
create index idex3 on hotels(book_id)
I find it a bit hard to believe that this is related to a real problem. As written, I would expect this to return the same customer name over and over.
I would recommend the following indexes:
customers(cus_id, gender, book_id, name)
hotels(cus_id)
bookings(book_id)
It is really weird that bookings are not to a hotel.
First, these indexes cover the query, so the data pages don't need to be accessed. The logic is to start with the where clause and use those columns first. Then add additional columns from the on and select clauses.
Only one column is used for hotels and bookings, so those indexes are trivial.
The use of OFFSET without ORDER BY is quite suspicious. The result set is in indeterminate order anyway, so there is no reason to skip the nominally "first" value.
SELECT t1.*
FROM
( SELECT key_a,key_b,MAX(date) as date
FROM large_table
WHERE date <= **20150126**
group by key_a,key_b
) AS t2
JOIN large_table AS t1 USING(key_a,key_b ,date)
large_table = 1,223,001,206 rows of data
Primary Key key_a,key_b,date
key on key_b
key on date
There are numerous empty dates between rows for a & b that I want the most recent behind or on the "Date" entered.
Is it the Mysql Join settings causing it to be slow ?
I can copy the entire set of a & b data with an INSERT to a temp table just by selecting all the rows and then run the same query on the temp table, but why do multi queries (insert selected, then select from) when only 1 is needed.
The query above only has 4,128,548 total results in the temp insert all dates table, and the date specific returns under 180,000 total.
Not table optimization, not keys, is it Max sort length, Join Buffer size , I have 128 gig ram, on a 32 core server running this, there is no reason for it to be slow, just never bulk insert this large of a single table to run Join queries on prior if anyone else has dealt with tables this size any info greatly appreciated.
Edited query, yes it's late long day had Distinct when it wasn't needed or in actual query
WHERE date <= **20150126**
group by key_a,key_b
needs an index starting with date. It's about doing what you can with the WHERE clause, not sparse or dense.
Then... Since the inner query references only 3 columns, building a 'covering' index may be useful. (Probably useful in your case.) So, tack on the other two fields, in either order. Such as
INDEX(`date`, key_a, key_b)
For MyISAM this step is critical. For InnoDB, this is redundant, since each secondary key (such as your INDEX(date)) implicitly includes the rest of the fields of the PK.
No, the PRIMARY KEY(key_a, key_b, date) cannot serve the purpose. It's in the wrong order. Also, it is (if you are using InnoDB) "clustered" with the index.
The query above only has 4,128,548 total results in the temp insert all dates table, and the date specific returns under 180,000 total.
Sorry, I had trouble parsing that. I assume you are saying 4M rows had 'date<...' and the subquery delivered only 180K rows. Hence, the outer query also returned 180K rows.
The first goal is to get through the 4M rows as efficiently as possible. With the index I propose, that might be about 20K blocks (#16KB each) of index scanning. That's 300MB.
Next the MAX and GROUP BY are performed. At 300MB, this will involve a disk tmp table. (See max_heap_size and max_tmp_table_size.)
Then comes the JOIN to fetch t1.*. You are using a good technique for fetching a bunch of rows from a huge table, where you need a GROUP BY (or LIMIT or ...) that is clumsy when done the obvious way. It goes like this: Write the subquery to find the PKs. Get the best index for it. Then JOIN on the PK.
Now for the JOIN. (Again, I assume InnoDB.) Since you are JOINing on the PK, each lookup into t1 will be efficient -- drill down the PK's BTree to find a row. Do that 180K times.
If those 180K lookups are scattered around the table, then this could be 180K disk hits.
Total effort: 20K + 180K = 200K disk hits, possibly less. On commodity spinning disks, this would take about 30 minutes (plus time for the tmp table). (No, only one core will be used. Anyway, I/O is probably the bottleneck.)
OPTIMIZE TABLE -- almost always useless.
I assume innodb_buffer_pool_size is about 90G? If things are going to be cached, that is where it would happen (for InnoDB). Since 200K blocks is 3GB, it could be easily cached. That is, if you run the query twice, the first might be 30 minutes, but the second might be less than 3 minutes.
To get more numbers, you could do:
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS;
and look for 'Handler%', '%sort%', 'Innodb%' and maybe a few others.
What version are you running? Recent versions have a leapfrog technique that works better for max+groupby than what I described. I think it is called MRR. If so, your PK is actually optimal. (Hmmm... I should play around with that.)
PARTITIONing -- I don't see any benefit (for this query).
I'm not a database specialist, therefore I'm coming here for a little help.
I have planty of measured data and I want help myself with data manipulation. Here is my situation:
There are cca 10 stations, measuring every day. Everyday, one produces cca 3000 rows (with cca 15 columns) of data. Data have to be downloaded once a day from every station to the centralized server. That means cca 30 000 inserted rows into the database every day. (daily counts are mutable)
Now, I've already had data from a few past years, so for every station, I have a few milions of rows. There are also cca 20 "dead" stations - don't work anymore, but there are data from a few years.
Sum this all up and we'll get cca 50+ millions of rows, produced by 30 stations and cca 30 000 rows inserted every day. Looking ahead, let's assume 100 millions of rows in database.
My question is obvious - how would you suggest to store this data?
Measured values(columns) are only numbers (int, or double + datetime) - no text, or fulltext search, basically the only index I need is DATETIME.
Data will not be updated, nor deleted. I just need a fast select of a range of data (eg. from 1.1.2010 to 3.2.2010)
So as I wrote, I want to use MySQL because that's the database I know best. I've read, that it should easily handle this amount of data, but still, I appreciate any suggestion for this very situation.
Again:
10 stations, 3000 rows per day each => cca 30 000 inserts per day
cca 40-50 millions of rows yet to be inserted from binary files
DB is going to grow (100+ millions of rows)
The only thing I need is to SELECT data as fast as possible.
As far as I know, MySQL should handle this amount of data. I also know, that my only index will be date and time in DATETIME type (should be faster then others, am I right?)
The thing I can't decide is, whether create one huge table with 50+ millions of rows (with station id), or create table for every station separately. Basically, I don't need to perform any JOIN on these stations. If I need to do time coincidence, I can just select the same range of time on stations. Are there any dis/advanteges on these approaches?
Can anyone confirm/decline my thoughts? Do you think, that there is a better solution? I appreciate any help or discussion.
MySQL should be able to handle this pretty well. Instead of indexing just your DATETIME column, I suggest you create two compound indexes, as follows:
(datetime, station)
(station, datetime)
Having both these indexes in place will help accelerate queries that choose date ranges and group by stations or vice versa. The first index will also serve the purpose that just indexing datetime will serve.
You have not told us what your typical query is. Nor have you told us whether you plan to age out old data. Your data is an obvious candidate for range partitioning (http://dev.mysql.com/doc/refman/5.6/en/partitioning-range.html) but we'd need more information to help you design a workable partitioning criterion.
Edit after reading your comments.
A couple of things to keep in mind as you build up this system.
First, Don't bother with partitions for now.
Second, I would get everything working with a single table. Don't split stuff by station or year. Get yourself the fastest disk storage system you can afford and a lot of RAM for your MySQL server and you should be fine.
Third, take some downtime once in a while to do OPTIMIZE TABLE; this will make sure your indexes are good.
Fourth, don't use SELECT * unless you know you need all the columns in the table. Why? Because
SELECT datetime, station, temp, dewpoint
FROM table
WHERE datetime >= DATE(NOW() - INTERVAL 60 DAY)
ORDER BY station, datetime
can be directly satisfied from sequential access to a compound covering index on
(station, datetime, temp, dewpoint)
whereas
SELECT *
FROM table
WHERE datetime >= DATE(NOW() - INTERVAL 60 DAY)
ORDER BY station, datetime
needs to random-access your table. You should read up on compound covering indexes.
Fifth, avoid the use of functions with column names in your WHERE clauses. Don't say
WHERE YEAR(datetime) >= 2003
or anything like that. MySQL can't use indexes for that kind of query. Instead say
WHERE datetime >= '2003-01-01'
to allow the indexes to be exploited.