Analysis Services Partitioning Issue - sql-server-2008

I have a Measure Group, that is partitioned daily. I can process a particular partition, and the XMLA command completes successfully. Furthermore, I have ensure at least one partition is processed for every Measure Group, therefore my cube is "partially processed" and I should be able to browse it.
The problem... no data can be seen in the cube for any of the Measures within this Measure Group. What is really driving me crazy is that I can capture the SQL command that SSAS is executing against the server, and it returns rows!
Yet sure enough, when I check the properties of the partition I just processed, it says it's size is 0.0 MB. It also has no slice, don't know if that helps.
If anyone has seen anything like this and has any idea... I am all ears.

You must set a partition slice. That is how SSAS determines that the data should reside in that partition. Without the slicer it is discarding the rows read. Check out http://sqlcat.com/technicalnotes/archive/2007/09/11/ssas-partition-slicing.aspx for example.

Your immediate problem is unlikely to be because of the missing slices. As Mosha explains it here, defining slicing details for partitions is very important for performance reasons. Here is a quote from him:
If the data slice value for a partition is set properly, Analysis Services can quickly eliminate irrelevant partitions from the query processing and significantly reduce the amount of physical I/O and processor time needed for many queries issued against MOLAP and HOLAP partitions
Without the data slice, Analysis Services cannot limit a query to the appropriate partitions and must scan each partition even if zero cells will be returned.
The above says that if partition slices are not defined, then SSAS will not be able to optimize certain queries by scanning only from the relevant partitions.
But it also says that with no slices defined it should still return the correct results, albeit probably much slower. As a side-note, it is also implied that if slices ARE defined BUT improperly, then it could happen that wrong results get returned, or nothing at all.
Since your partitions have no slices defined, rather the problem somehow must be with the SQL query bindings used for creating the partitions. Have you checked that the data source is configured correctly in SSAS? When you were running the query manually you might have been connected to a different SQL server instance than the one configured for the SSAS cube (e.g. UAT vs. PROD).

Related

Running SUM across large amount of rows

I have a theoretical question which pertains to the SQL function SUM().
Imagine we have a table which contains a column called "value"
"value" is a DECIMAL number either positive or negative.
In our potential solution, we'd like to run a SUM() across all rows for column "value"
SELECT SUM(value)
FROM table
No problems so far, but the dataset is potentially millions of rows. Possibly even hundreds of millions of rows as the data will be retained over years.
So my questions are:
Can you run SUM() across hundreds of millions of rows?
What kind of performance could I expect on a query across that many rows? We haven't settled but looking at using MySQL or SQL Server.
You can take a look to the column store in SQL Server. In short, you are able to create a column store index on your tables - different from the traditional row store index.
These indexes are specially design for optimizing aggregate queries when huge amount of data is involved (for example, like in Data Warehouse star and snowflake schemes).
From the docs:
Columnstore indexes can achieve up to 100x better performance on
analytics and data warehousing workloads and up to 10x better data
compression than traditional rowstore indexes.
because:
Data compression - you can many benefits from here; for example, columnstore indexes read compressed data from disk, which means fewer bytes of data need to be read into memory;
Column elimination - columnstore indexes skip reading in columns that are not required for the query result and further reduces I/O for query execution and therefore improves query performance (not like rowstore indexes)
Rowgroup elimination - optimize table scans using metadata to eliminate specific rowgroups based on your filtering criteria;
Batch Mode Execution - prior to SQL Server 2019, only queries involving such indexes, can benefit from batch mode processing which reduce your execution time further (check this video to see how great is the this mode)
You may certainly run SUM() across an entire table, and the performance would depend roughly on how many records that table has. Note that things like indices would not really help performance in this case, because SQL Server has to touch every record to compute the sum.
If running SUM on the entire table in production might not scale well, then one option to consider would be maintain the sum in a separate table. Then, when a record gets inserted or deleted, you may use a trigger to update the running total appropriately. This way, accessing the sum would be roughly constant time, though you would have some additional overhead because of the trigger logic.
I'll throw out a couple of ideas. If the data sets that you are working with are absolutely massive, consider running an overnight job to create a view, or some kind of temp table, and refer to this aggregated data blob when you get into the office in the morning. Or, move everything to the cloud, like Azure Databricks, for instance, and run these jobs in Spark. Spark is blazing fast and runs jobs in parallel, so everything is done super-fast. Good luck.

mysql harddrive efficiency with millions of rows

I have a program that receives about 20 arbitrary measurements per second from some source. Each measurement has a type, timestamp, min, avg, and max value. I then need to create up to X aggregates of each measurement type.
The program can be set up with 100s of sources at the same time, this results in a lot of data that I need to store quickly and retrieve quickly.
The system that this will run on has no memory/storage/cpu limitations, but there is another service on there that is writing to the hdd at almost the limit of the its capability. For this question, let's assume that this is a "top of the line" HDD and I won't be able to upgrade to a hdd.
What I'm doing right now is generating a table per measurement type (20x source) with partitioning along the timestamp value of each measurement as new measurement types are encountered. I'm doing this so as not to fragment the measurement data across the HDD which will allow me to insert or query data with a minimum amount of 'seeking.'
Does this make sense? I have no need of doing any joins or complex queries, it's all either a straight-forward batch-inserts or a single measurement type query by a timestamp range.
How does MySql store the data in the tables across the HDD? How can I better design the DB to minimize the HDD seek during insert & query?
You are asking general questions which can be discovered by reading the documentation or browsing knowledgebase articles by using google or whatever search engine that you prefer. If you are using the MyISAM engine, which is the default then each table is stored as three files in a db-specific directory, with the big ones being the MYD file for the row-data and the MYI file for all of the indexes.
The most important thing that you can do is to get you configuration parameters correct so that it can optimise access and caching. MySQL will do a better job than you can realistically expect to do. See http://dev.mysql.com/doc/refman/5.1/en/option-files.html for more info and compare the settings for the my-small.cnf and my-large.cnf that you will find on your system as this section discusses.

Looking for approach to limit TempDb space needed. Does a VIEW have any place here?

We have a process that executes the following SQL pattern on a very large amount of data:
INSERT INTO Target
SELECT
Col1,Col2,Col33,Col44, (...) Col30
,SUM(Col31)
,SUM(Col32)
FROM
SOURCE
GROUP BY
1,2,3,4 ...30
Because of the large numbers of rows in the Source table and the large number of columns in the Group by clause, this results in very high TEMPDB space usage and on one occasion we ran out of space.
Given that the Business requirements dictate grouping by an usual number of columns, we need to find a way to reduce the amount of TEMPDB space that we use without effecting the performance of the resulting Target table, which serves as our main reporting table.
We were thinking to get the the Reporting Months in the SourceTable and then create a CURSOR in which we individually loop and execute the above SQL only WHERE Source.ReportingMonth = #CurrentReportingMonth from the Cursor and do this until all of the ReportingMonths have been processed. Unfortunately, historical data is allowed to change and so all of the data for each month must be examined each time we process a monthly cycle. The data in each month is approximately teh same volume.
When we told our DBA that this was our intent, his response was "I think that is a very good start, however, if the resultant table is basically used for reporting and not futher aggregation, we would probably be better off just replacing the table with a view since there is very little aggregation actually performed."
My thought is that a view still results in the execution of the SQL and that by converting the SQL to a view, performance could be impacted because the SQL may be executed many times by reports that need the data that used to be stored permanently in a persistent physical table. In SQL Server, are their views that can be persisted for performance reasons to avoid having to execute the SQL multiple times? If we only have one process that runs monthly to populate the Target table, is there any advantage to turning the Target table into a view of the Source table?
Q1: Is our cursoring idea a reasonable approach to pursue to solve the TEMPDB space issue?
Q2: Does a VIEW have any place here?
Note: we also suggested that the users identify if there is data that is "old enough" to be archived.

handling/compressing large datasets in multiple tables

In an application at our company we collect statistical data from our servers (load, disk usage and so on). Since there is a huge amount of data and we don't need all data at all times we've had a "compression" routine that takes the raw data and calculates min. max and average for a number of data-points, store these new values in the same table and removes the old ones after some weeks.
Now I'm tasked with rewriting this compression routine and the new routine must keep all uncompressed data we have for one year in one table and "compressed" data in another table. My main concerns now are how to handle the data that is continuously written to the database and whether or not to use a "transaction table" (my own term since I cant come up with a better one, I'm not talking about the commit/rollback transaction functionality).
As of now our data collectors insert all information into a table named ovak_result and the compressed data will end up in ovak_resultcompressed. But are there any specific benefits or drawbacks to creating a table called ovak_resultuncompressed and just use ovak_result as a "temporary storage"? ovak_result would be kept minimal which would be good for the compressing routine, but I would need to shuffle all data from one table into another continually, and there would be constant reading, writing and deleting in ovak_result.
Are there any mechanisms in MySQL to handle these kind of things?
(Please note: We are talking about quite large datasets here (about 100 M rows in the uncompressed table and about 1-10 M rows in the compressed table). Also, I can do pretty much what I want with both software and hardware configurations so if you have any hints or ideas involving MySQL configurations or hardware set-up, just bring them on.)
Try reading about the ARCHIVE storage engine.
Re your clarification. Okay, I didn't get what you meant from your description. Reading more carefully, I see you did mention min, max, and average.
So what you want is a materialized view that updates aggregate calculations for a large dataset. Some RDBMS brands such as Oracle have this feature, but MySQL doesn't.
One experimental product that tries to solve this is called FlexViews (http://code.google.com/p/flexviews/). This is an open-source companion tool for MySQL. You define a query as a view against your raw dataset, and FlexViews continually monitors the MySQL binary logs, and when it sees relevant changes, it updates just the rows in the view that need to be updated.
It's pretty effective, but it has a few limitations in the types of queries you can use as your view, and it's also implemented in PHP code, so it's not fast enough to keep up if you have really high traffic updating your base table.

Begin Viewing Query Results Before Query Ends

Lets say I query a table with 500K rows. I would like to begin viewing any rows in the fetch buffer, which holds the result set, even though the query has not yet completed. I would like to scroll thru the fetch buffer. If I scroll too far ahead, I want to display a message like: "REACHED LAST ROW IN FETCH BUFFER.. QUERY HAS NOT YET COMPLETED".
Could this be accomplished using fgets() to read the fetch buffer while the query continues building the result set? Doing this implies multi-threading*
Can a feature like this, other than the FIRST ROWS hint directive, be provided in Oracle, Informix, MySQL, or other RDBMS?
The whole idea is to have the ability to start viewing rows before a long query completes, while displaying a counter of how many rows are available for immediate viewing.
EDIT: What I'm suggesting may require a fundamental change in a DB server's architecture, as to the way they handle their internal fetch buffers, e.g. locking up the result set until the query has completed, etc. A feature like the one I am suggesting would be very useful, especially for queries which take a long time to complete. Why have to wait until the whole query completes, when you could start viewing some of the results while the query continues to gather more results!
Paraphrasing:
I have a table with 500K rows. An ad-hoc query without a good index to support it requires a full table scan. I would like to immediately view the first rows returned while the full table scan continues. Then I want to scroll through the next results.
It seems that what you would like is some sort of system where there can be two (or more) threads at work. One thread would be busy synchronously fetching the data from the database, and reporting its progress to the rest of the program. The other thread would be dealing with the display.
In the meantime, I would like to display the progress of the table scan, example: "Searching...found 23 of 500,000 rows so far".
It isn't clear that your query will return 500,000 rows (indeed, let us hope it does not), though it may have to scan all 500,000 rows (and may well have only found 23 rows that match so far). Determining the number of rows to be returned is hard; determining the number of rows to be scanned is easier; determining the number of rows already scanned is very difficult.
If I scroll too far ahead, I want to display a message like: "Reached last row in look-ahead buffer...query has not completed yet".
So, the user has scrolled past the 23rd row, but the query is not yet completed.
Can this be done? Maybe like: spawn/exec, declare scroll cursor, open, fetch, etc.?
There are a couple of issues here. The DBMS (true of most databases, and certainly of IDS) remains tied up as far as the current connection on processing the one statement. Obtaining feedback on how a query has progressed is difficult. You could look at the estimated rows returned when the query was started (information in the SQLCA structure), but those values are apt to be wrong. You'd have to decide what to do when you reach row 200 of 23, or you only get to row 23 of 5,697. It is better than nothing, but it is not reliable. Determining how far a query has progressed is very difficult. And some queries require an actual sort operation, which means that it is very hard to predict how long it will take because no data is available until the sort is done (and once the sort is done, there is only the time taken to communicate between the DBMS and the application to hold up the delivery of the data).
Informix 4GL has many virtues, but thread support is not one of them. The language was not designed with thread safety in mind, and there is no easy way to retrofit it into the product.
I do think that what you are seeking would be most easily supported by two threads. In a single-threaded program like an I4GL program, there isn't an easy way to go off and fetch rows while waiting for the user to type some more input (such as 'scroll down the next page full of data').
The FIRST ROWS optimization is a hint to the DBMS; it may or may not give a significant benefit to the perceived performance. Overall, it typically means that the query is processed less optimally from the DBMS perspective, but getting results to the user quickly can be more important than the workload on the DBMS.
Somewhere down below in a much down-voted answer, Frank shouted (but please don't SHOUT):
That's exactly what I want to do, spawn a new process to begin displaying first_rows and scroll through them even though the query has not completed.
OK. The difficulty here is organizing the IPC between the two client-side processes. If both are connected to the DBMS, they have separate connections, and therefore the temporary tables and cursors of one session are not available to the other.
When a query is executed, a temporary table is created to hold the query results for the current list. Does the IDS engine place an exclusive lock on this temp table until the query completes?
Not all queries result in a temporary table, though the result set for a scroll cursor usually does have something approximately equivalent to a temporary table. IDS does not need to place a lock on the temporary table backing a scroll cursor because only IDS can access the table. If it was a regular temp table, there'd still not be a need to lock it because it cannot be accessed except by the session that created it.
What I meant with the 500k rows, is nrows in the queried table, not how many expected results will be returned.
Maybe a more accurate status message would be:
Searching 500,000 rows...found 23 matching rows so far
I understand that an accurate count of nrows can be obtained in sysmaster:sysactptnhdr.nrows?
Probably; you can also get a fast and accurate count with 'SELECT COUNT(*) FROM TheTable'; this does not scan anything but simply accesses the control data - probably effectively the same data as in the nrows column of the SMI table sysmaster:sysactptnhdr.
So, spawning a new process is not clearly a recipe for success; you have to transfer the query results from the spawned process to the original process. As I stated, a multithreaded solution with separate display and database access threads would work after a fashion, but there are issues with doing this using I4GL because it is not thread-aware. You'd still have to decide how the client-side code is going store the information for display.
There are three basic limiting factors:
The execution plan of the query. If the execution plan has a blocking operation at the end (such as a sort or an eager spool), the engine cannot return rows early in the query execution. It must wait until all rows are fully processed, after which it will return the data as fast as possible to the client. The time for this may itself be appreciable, so this part could be applicable to what you're talking about. In general, though, you cannot guarantee that a query will have much available very soon.
The database connection library. When returning recordsets from a database, the driver can use server-side paging or client-side paging. Which is used can and does affect which rows will be returned and when. Client-side paging forces the entire query to be returned at once, reducing the opportunity for displaying any data before it is all in. Careful use of the proper paging method is crucial to any chance to display data early in a query's lifetime.
The client program's use of synchronous or asynchronous methods. If you simply copy and paste some web example code for executing a query, you will be a bit less likely to be working with early results while the query is still running—instead the method will block and you will get nothing until it is all in. Of course, server-side paging (see point #2) can alleviate this, however in any case your application will be blocked for at least a short time if you do not specifically use an asynchronous method. For anyone reading this who is using .Net, you may want to check out Asynchronous Operations in .Net Framework.
If you get all of these right, and use the FAST FIRSTROW technique, you may be able to do some of what you're looking for. But there is no guarantee.
It can be done, with an analytic function, but Oracle has to full scan the table to determine the count no matter what you do if there's no index. An analytic could simplify your query:
SELECT x,y,z, count(*) over () the_count
FROM your_table
WHERE ...
Each row returned will have the total count of rows returned by the query in the_count. As I said, however, Oracle will have to finish the query to determine the count before anything is returned.
Depending on how you're processing the query (e.g., a PL/SQL block in a form), you could use the above query to open a cursor, then loop through the cursor and display sets of records and give the user the chance to cancel.
I'm not sure how you would accomplish this, since the query has to complete prior to the results being known. No RDBMS (that I know of) offers any means of determining how many results to a query have been found prior to the query completing.
I can't speak factually for how expensive such a feature would be in Oracle because I have never seen the source code. From the outside in, however, I think it would be rather costly and could double (if not more) the length of time a query took to complete. It would mean updating an atomic counter after each result, which isn't cheap when you're talking millions of possible rows.
So I am putting up my comments into this answer-
In terms of Oracle.
Oracle maintains its own buffer cache inside the system global area (SGA) for each instance. The hit ratio on the buffer cache depends on the sizing and reaches 90% most of the time, which means 9 out of 10 hits will be satisfied without reaching the disk.
Considering the above, even if there is a "way" (so to speak) to access the buffer chache for a query you run, the results would highly depend on the cache sizing factor. If a buffer cache is too small, the cache hit ratio will be small and more physical disk I/O will result, which will render the buffer cache un-reliable in terms of temp-data content. If a buffer cache is too big, then parts of the buffer cache will be under-utilized and memory resources will be wasted, which in terms would render too much un-necessary processing trying to access the buffer cache while in order to peek in it for the data you want.
Also depending on your cache sizing and SGA memory it would be upto the ODBC driver / optimizer to determine when and how much to use what (cache buffering or Direct Disk I/O).
In terms of trying to access the "buffer cache" to find "the row" you are looking for, there might be a way (or in near future) to do it, but there would be no way to know if what you are looking for ("The row") is there or not after all.
Also, full table scans of large tables usually result in physical disk reads and a lower buffer cache hit ratio.You can get an idea of full table scan activity at the data file level by querying v$filestat and joining to SYS.dba_data_files. Following is a query you can use and sample results:
SELECT A.file_name, B.phyrds, B.phyblkrd
FROM SYS.dba_data_files A, v$filestat B
WHERE B.file# = A.file_id
ORDER BY A.file_id;
Since this whole ordeal is highly based on multiple parameters and statistics, the results of what you are looking for may remain a probability driven off of those facotrs.