SQL Server 2008: Disable index on one particular table partition - sql-server-2008

I am working with a big table (~100.000.000 rows) in SQL Server 2008. Frequently, I need to add and remove batches of ~30.000.000 rows to and from this table. Currently, before loading a large batch into the table, I disable indexes, I insert the data, then I rebuild the index. I have measured this to be the fastest approach.
Since recently, I am considering implementing table partitioning on this table to increase speed. I will partition the table according to my batches.
My question, will it be possible to disable the index of one particular partition, and load the data into that one before enabling it again? In that case, the rest of my table will not have to suffer a complete index rebuild, and my loading can be even faster?

Indexes are typically on the Partition Scheme. For the scenario you are talking about you can actually load up a new table with the batch (identical structure, different name) and then use the SWITCH command to add this table as a new partition into your existing table.
I have included code that I use to perform this, you will need to modify it based on your table names:
DECLARE #importPart int
DECLARE #hourlyPart int
SET #importPart = 2 -- always, so long as the Import table is only made up of 1 partition
-- get the Hourly partition
SELECT
#hourlyPart = MAX(V.boundary_id) + 1
FROM
sys.partition_range_values V
JOIN sys.partition_functions F
ON V.function_id = F.function_id
AND F.name = 'pfHourly'
ALTER TABLE Import
SWITCH PARTITION #importPart
TO Hourly PARTITION #hourlyPart;

Related

How to manually add partition details into hive metastore tables?

In my HDFS, I've partitioned data by date and event_id, and have about 1.4 million parquet files. Today, to analyze the data in Apache Spark, I use spark.read.parquet("/path/to/root/"). This takes about 30 minutes to list the files, I have to do this every time, and it's getting annoying.
Now, I want to setup an external table, using MySQL as the Hive Metastore. I'm currently facing the know issue where discovering all 1.4 partitions taking forever. As we all known MSCK REPAIR TABLE my_table is out of the picture. I instead generated a long query (about 400 MB) that contains this query like this
ALTER TABLE my_table ADD
PARTITION (date = 'YYYY-MM-DD', event_id = "<some value>") LOCATION ''
PARTITION (date = 'YYYY-MM-DD', event_id = "<some value>") LOCATION ''
PARTITION (date = 'YYYY-MM-DD', event_id = "<some value>") LOCATION ''
...
PARTITION (date = 'YYYY-MM-DD', event_id = "<some value>") LOCATION ''
PARTITION (date = 'YYYY-MM-DD', event_id = "<some value>") LOCATION ''
It has been 3 hours, and it still has only processed less than 100,000 partitions. I have observed a few things:
Spark does one partition at a time.
Spark seems to check each path for existence.
All these adds to the long running time. I've searched, and haven't been able to find how to disable both operations.
So, I want to manually perform SQL operations against the MySQL database and table for the Hive metastore, to create and manage the tables. I've looked but have been unable to figure out how to manually manage those tables. Please, does anyone know how to do that? Specifically, I want the following:
How can can create an external table with partitions, by making direct entries into the Hive metastore tables?
How can I manage an External table partition by making direct upsert queries against the Hive metastore tables?
Is there a good resource I could use to learn about the backing tables in the metastore. I feel doing the inserts manually would be much much faster. Thank you.
I think the core problem here is that you have too many partitions. Partitioning should generally be done on a low-cardinality column (something with a relatively small number of distinct values, compared to the total number of records). Typically you want to err on the side of having a smaller number of large files, rather than a large number of small files.
In your example, date is probably a good partitioning column, assuming there are many records for each date. If there are a large number of distinct values for event_id, that's not a great candidate for partitioning. Just keep it as an unpartitioned column.
An alternative to partitioning for a high-cardinality column is bucketing. This groups similar values for the bucketed column so they're in the same file, but doesn't split each value across separate files. The AWS Athena docs have a good overview of the concept.
This can be an issue with statistics auto-gathering. As a workaround, switch off hive.stats.autogather before recovering partitions.
Switch-off statistics auto-gathering:
set hive.stats.autogather=false;
Run MSCK REPAIR or ALTER TABLE RECOVER PARTITIONS.
If you need statistics to be fresh, you can execute ANALYZE separately for new partitions only.
Related tickets are HIVE-18743, HIVE-19489, HIVE-17478

Most efficient way to add a column to a MySql UNIQUE KEY?

I'm working with a production database on a table that has > 2 million rows, and a UNIQUE KEY over col_a, col_b.
I need up modify that index to be over col_a, col_b, and col_c.
I believe this to be a valid, atomic command to make the change:
ALTER TABLE myTable
DROP INDEX `unique_cols`,
ADD UNIQUE KEY `unique_cols` (
`col_a`,
`col_b`,
`col_c`
);
Is this the most efficient way to do it?
I'm not certain that the following way is the best way for you. This is what worked for us after we suffered a few database problems ourselves and had to fix them quickly.
We work on very large tables, over 4-5GB in size.
Those tables have >2 million rows.
In our experience running any form of alter queries / Index creation on the table is dangerous if the table is being written to.
So in our case here is what we do if the table has writes 24/7:
Create a new empty table with the correct indexes.
Copy data to the new table row by row, using a tool like Percona or manually writing a script.
This allows for the table to use less Memory, and also saves you in case you have a MyISAM table.
In the scenario that you have a very large table that is not being written to regularly, you could create the indexes while it is not in use.
This is hard to predict and can lead to problems if you've not estimated correctly.
In either case, your goal should be to:
Save memory / load on the system.
Reduce locks on the tables
The above also holds true when we add / delete columns for our super large tables, so this is not something we do for just creating indexes, but also adding and subtracting columns.
Hope this helps, and anyone is free to disagree / add to my answer.
Some more helpful answers:
https://dba.stackexchange.com/questions/54211/adding-index-to-large-mysql-tables:
https://dba.stackexchange.com/a/54214
https://serverfault.com/questions/174749/modifying-columns-of-very-large-mysql-tables-with-little-or-no-downtime
most efficient way to add index to large mysql table

MySQL : avoid full scan when altering partitioning

We dispose of a database MySQL partitioned over every of our client with a unique ID (it is not a linear partition created with RANGE over a value).
When I create or delete a new partition, with requests :
ALTER TABLE "table_name"
ADD PARTITION (PARTITION "client_id_value" VALUES IN ("client_id_value"));
ALTER TABLE "table_name" DROP PARTITION "client_id_value";
SQL does a table full scan, even if there is no value corresponding to the new partition yet (and I know it). It is problematic as we create and delete a partition in every fitness test we do to ensure the good process of creating specific, separate space in data for the client. One solution would be to keep a partition for testing, but I wonder if we can skip the full scan while altering the table.
Any ideas ?

Create clustered index and/or partitioning on non-unique column?

I have a table containing log entries for a single week for about a thousand web servers. Each server writes about 60,000 entries per day to the table, so there are 420,000 entries per week for each server. The table is truncated weekly. Each log entry contains the servername, which is a varchar (this cannot be changed).
The main operation is to select * from table where servername = 'particular', so as to retrieve the 420,000 records for a server, and a C# program then analyzes the data from that server once selected.
Should I create a clustered index on the servername column to speed up the read operation? (It currently takes over half an hour to execute the above SQL statement.)
Would partitioning help? The computer has only two physical drives.
The query is run for each server once per week. After the query is run for all servers, the table is truncated.
The "standard" ideal clustered key is something like an INT IDENTITY that keeps increasing and is narrow.
However, if your primary use for this table is the listed query, then I think a clustered index on servername makes sense. You will see a big increase in speed if the table is wide, since you will eliminate an expensive key/bookmark lookup that runs on a SELECT * from a nonclustered index (unless you include all the fields in the table).
EDIT:
KM pointed out this will slow down inserts, which is true. For this scenario you may want to consider a two-field key on servername, idfield where idfield is an INT Identity. This would still allow access based only on servername in your query but will insert new records at the end PER SERVER. You will still have fragmentation and reordering.
based on:
The query is run for each server once per week. After the query is run
for all servers, the table is truncated.
and
for about a thousand web servers
I'd change the c# program to just run a single query one time:
select * from table Order By servername,CreateDate
and have it handle "breaking" on a server name changes.
One table scan is better than 1,000. I would not slow down the main application's INSERTS into a log table (with a clustered index) just so your once a week queries run faster.
Yes, it would be a good idea to create a clustered index on servername column as now database has to do table scan to find out which records satisfy the criteria of servername = 'particular'.
Also horizontally partition the table by date would help the cause further. So, at a time the database would need to worry about only a day's data for all servers.
Then make sure that you fire date-based queries:
SELECT * FROM table
WHERE date BETWEEN '20110801' AND '20110808'
AND servername = 'particular'

Is there any way to do a bulk/faster delete in mysql?

I have a table with 10 million records, what is the fastest way to delete & retain last 30 days.
I know this can be done in event scheduler, but my worry is if takes too much time, it might lock the table for much time.
It will be great if you can suggest some optimum way.
Thanks.
Offhand, I would:
Rename the table
Create an empty table with the same name as your
original table
Grab the last 30 days from your "temp" table and insert
them back into the new table
Drop the temp table
This will enable you to keep the table live through (almost) the entire process and get the past 30 days worth of data at your leisure.
You could try partition tables.
PARTITION BY LIST (TO_DAYS( date_field ))
This would give you 1 partition per day, and when you need to prune data you just:
ALTER TABLE tbl_name DROP PARTITION p#
http://dev.mysql.com/doc/refman/5.1/en/partitioning.html
Not that it helps you with your current problem, but if this is a regular occurance, you might want to look into a merge table: just add tables for different periods in time, and remove them from the merge table definition when no longer needed. Another option is partitioning, in which it is equally trivial to drop a (oldest) partition.
To expand on Michael Todd's answer.
If you have the space,
Create a blank staging table similar to the table you want to reduce in size
Fill the staging table with only the records you want to have in your destination table
Do a double rename like the following
Assuming:
table is the table name of the table you want to purge a large amount of data from
newtable is the staging table name
no other tables are called temptable
rename table table to temptable, newtable to table;
drop temptable;
This will be done in a single transaction, which will require an instantaneous schema lock. Most high concurrency applications won't notice the change.
Alternatively, if you don't have the space, and you have a long window to purge this data, you can use dynamic sql to insert the primary keys into a temp table, and join the temp table in a delete statement. When you insert into the temp table, be aware of what max_packet_size is. Most installations of MySQL use 16MB (16777216 bytes). Your insert command for the temp table should be under max_packet_size. This will not lock the table. You'll want to run optimize table to reclaim space for the rest of the engine to use. You probably won't be able to reclaim disk space, unless you were to shutdown the engine and move the data files.
Shutdown your resource,
SELECT .. INTO OUTFILE, parse output, delete table, LOAD DATA LOCAL INFILE optimized_db.txt - more cheaper to re-create, than to UPDATE.