SSAS Partition -Default or Full - sql-server-2008

We have a partition let say,
Log-2014,
Log 2015,
Log 2016-Jan-June,
Log 2016-July-Dec,
Log 2017-Jan-June,
Log 2017-July-Dec
Once the import routine started we will insert new data into the Log table and then be using ADMOMD.Net we will process the cube.
XMLA Process Type for Partition:
Log 2017-July-Dec - Process Full
All Other Partitions have Process Default
We are receiving new clients and getting old log data like 2015, 2016. Our import read all the data and insert into Log Table.
Does the "Process Default" work in this case for 2015, 2016 partitions?
Does this log data(2015,2016) will be merged into the correct partition once the cube has been processed?
Does the aggregation is re-calculated after processing the partition with process default type?
Thanks,
Chandru

Process Default will not bring updated members to your historical partition. It reads state of the partition and its components (Data, Index) and does minimal processing to bring partition to a fully processed state. If your partition has been processed at the moment of issuing Process Default command, the command will do nothing.
Process Default is usually used in development process to bring cube changes online.
In your case Process Full or Process Data followed by Process Index will bring data on historical partitions.

Related

Update large amount of data in SQL database via Airflow

I have large table in CloudSQL that needs to be updated every hour, and I'm considering Airflow as a potential solution. What is the best way to update a large amount of data in a CloudSQL database from Airflow?
The constrain are:
The table need still be readable while the job is running
The table need to be writable in case one of the job runs overtime and 2 jobs end up running at the same time
Some of the ideas I have:
Load data needs to update into a pandas framework and run pd.to_sql
Load data into a csv in Cloud Storage and execute LOAD DATA LOCAL INFILE
Load data in memory, break it into chunks, and run a multi-thread process that each update the table chunk by chunk using a shared connection pool to prevent exhausting connection limits
My recent airflow related ETL project could be a reference for you.
Input DB: LargeDB (billion row level Oracle)
Interim DB: Mediam DB( tens of million level HD5 file)
Output
DB: Mediam DB (tens of millsion level mysql )
As far as I encountered, write to db is main block for such ETL process. so as you can see,
For interim stage, I use HD5 as interim DB or file for data transforming. the pandas to_hdf function provide a seconds level performance to large data. in my case, 20 millison rows write to hdf5 using less than 3 minutes.
Below is the performance benchmarking for pandas IO. HDF5 format is top3 fastest and most popular format. https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-perf
For the output stage, I use to_sql with chunk_size parameter. in order to speed up to_sql , you has to manually mapping the column type to database colume type and length,especialy the string or varchar format. With manualy mapping it, to_sql will mapp to blob format or varchar(1000). the default mode is 10 times slow than manually mapping mode.
total 20millions rows write to db via to_sql(chunksize mode) spend about 20 minutes.
if you like the answer, pls vote it up
One clue for your reference based on postgresql partition table but need some DML operation define the partitioned table.
Currently, you main constrains are:
the table need still be readable while the job is running
It means no lock allowed.
the table need to be writable in case one of the job runs overtime and 2 jobs end up running at the same time
it should capable with multiple writing in sample time.
I add one things for you may considered as well:
reasonable read performance while writing.
** performance and user experience is key
Partition table could reach all requirements. It is transparence to client applicationi.
At present, you are doing ETL, soon will facing performance issue as the table size gain quickly. The partitioned table is only solution.
The main steps are:
Create partition table with partition list.
normal reading and writing to the table running as usual.
ETL process(could be in parallel):
-. ETL data and uploaded to new table. (very slow, minutes to hours. but no impact to main table)
-. Add the new table to the main table partition list. (super fast, micro seconds level to enable main table)
normal main table reading and write as usual with new data.
If you like the answer, pls vote it up.
Best Regards,
WY
A crucial step to consider while setting up your workflow is to always use good connection management practices to minimize your application's footprint and reduce the likelihood of exceeding Cloud SQL connection limits. Databases connections consume resources on the server and the connection application.
Cloud Composer has no limitations when it comes to your ability to interface with CloudSQL. Therefore, either of the first 2 options is good.
A Python dependency is installable if it has no external dependencies and does not conflict with Composer’s dependencies. In addition, 14262433 explicitly explains the process of setting up a "Large data" workflow using Pandas.
LOAD DATA LOCAL INFILE requires you to use --local-infile for the mysql client. To import data into Cloud SQL, make sure to follow the best practices.

how to get incremental data of mysql based on bin log

use below parameters to using bin log of one single mysql instance:
--log-bin=XXX
--server-id=XXX
--log-bin-index=XXX
--binlog-checksum=CRC32
--binlog_format=ROW
every thing starts from here:
read one table with repeatable read isolation level and dump all data into somewhere. meanwhile mark this read transaction by write something unique at the very beginning and very ending of this transaction so I can easily locate this transaction in bin log.
when read these data, active transactions are modifying these data at the same time.
when finish reading these data by add finish mark and commit this transaction, other transactions modified these data may commit before or after this transaction.
how to get precise diff(incremental) of the table data by only reading bin log next time?
each time locked table to read is the option, but lock table is heavy load and will be the last option.
are there any other options?

Getting stale results in multiprocessing environment

I am using 2 separate processes via multiprocessing in my application. Both have access to a MySQL database via sqlalchemy core (not the ORM). One process reads data from various sources and writes them to the database. The other process just reads the data from the database.
I have a query which gets the latest record from the a table and displays the id. However it always displays the first id which was created when I started the program rather than the latest inserted id (new rows are created every few seconds).
If I use a separate MySQL tool and run the query manually I get correct results, but SQL alchemy is always giving me stale results.
Since you can see the changes your writer process is making with another MySQL tool that means your writer process is indeed committing the data (at least, if you are using InnoDB it does).
InnoDB shows you the state of the database as of when you started your transaction. Whatever other tools you are using probably have an autocommit feature turned on where a new transaction is implicitly started following each query.
To see the changes in SQLAlchemy do as zzzeek suggests and change your monitoring/reader process to begin a new transaction.
One technique I've used to do this myself is to add autocommit=True to the execution_options of my queries, e.g.:
result = conn.execute( select( [table] ).where( table.c.id == 123 ).execution_options( autocommit=True ) )
assuming you're using innodb the data on your connection will appear "stale" for as long as you keep the current transaction running, or until you commit the other transaction. In order for one process to see the data from the other process, two things need to happen: 1. the transaction that created the new data needs to be committed and 2. the current transaction, assuming it's read some of that data already, needs to be rolled back or committed and started again. See The InnoDB Transaction Model and Locking.

In SQL Server CDC with SSIS, which data should be stored for windowing (LSN or Date)?

I have implemented delta detection while loading data warehouse from transaction systems using an identity column or date-time column in source transaction tables. When data needs to be extracted next time, the maximum date-time value extracted last time is used in the filter of extraction query to identify new or changed records. This was good enough except when there were multiple transactions at the same milli second.
But now we have Change Data Capture (CDC) with SQL Server 2008 and it provides a new stuff called LSN (Log Sequence Number) which is binary of length 10. Now I am confused. Which data should be stored for windowing purpose, the LSN or the date-time. Of course LSN eliminates the need for storing additional date-time values in large transaction tables, but does this have any disadvantages? Which one should I use? I feel, the mapping of LSN to date-time and then storing date-time is not a reliable method. What is your opinion?
PS: To, non-BI professionals, Sorry.
See Improving Incremental Loads with Change Data Capture for information on using CDC with SSIS.
After a lot of wait I don't see any further answers here. I have used LSN in my current project for windowing and I find it better than date time values as it is more precise and the process is simple. I recommend using LSN. If anyone out there disagree, please let me know...
If you set up CDC, you get a system table added to your database with the name cdc.lsn_time_mapping so you can use either.

What is binlog in mysql?

I've set mysql parameter innodb_flush_log_at_trx_commit=0. It means that mysql flushes transactions to HDD 1 time per second. Is it true that if mysql will fail with this flush (because of power off) i will lose my data from these transactions. Or mysql will save them in data file (ibdata1) after each transaction regardless of binlog flush?
Thanks.
The binary log contains “events” that describe database changes such as table creation operations or changes to table data. It also contains events for statements that potentially could have made changes (for example, a DELETE which matched no rows), unless row-based logging is used. The binary log also contains information about how long each statement took that updated data. The binary log has two important purposes:
For replication, the binary log on a primary replication server provides a record of the data changes to be sent to secondary servers. The primary server sends the events contained in its binary log to its secondaries, which execute those events to make the same data changes that were made on the primary.
Certain data recovery operations require the use of the binary log. After a backup has been restored, the events in the binary log that were recorded after the backup was made are re-executed. These events bring databases up to date from the point of the backup
The binary log is not used for statements such as SELECT or SHOW that do not modify data.
https://dev.mysql.com/doc/refman/8.0/en/binary-log.html
Here is the entry in the MySQL reference manual for innodb_flush_log_at_trx_commit. You can lose the last second of transactions with the value set to 0.
Note that the binlog is actually something different that is independent of innodb and is used for all storage engines. Here is the chapter on the binary log in the MySQL reference manual.