Partitioning tables after creation using Crate - partitioning

I was wondering if it's possible to do partitioning after table creation.
I'm trying to import ~2 million entries in a table (cluster), and if I partition the table before adding entries I get memory exceptions.

It is not possible to partition a table after its creation.
2M records is not a lot of data, but if you have problems importing data into a partitioned table (e.g. if you have a lot of partitions), you could import data per partition:
COPY table_name PARTITION (partition_column=column_name) WITH (option=value);
See COPY FROM.

Related

How to manually add partition details into hive metastore tables?

In my HDFS, I've partitioned data by date and event_id, and have about 1.4 million parquet files. Today, to analyze the data in Apache Spark, I use spark.read.parquet("/path/to/root/"). This takes about 30 minutes to list the files, I have to do this every time, and it's getting annoying.
Now, I want to setup an external table, using MySQL as the Hive Metastore. I'm currently facing the know issue where discovering all 1.4 partitions taking forever. As we all known MSCK REPAIR TABLE my_table is out of the picture. I instead generated a long query (about 400 MB) that contains this query like this
ALTER TABLE my_table ADD
PARTITION (date = 'YYYY-MM-DD', event_id = "<some value>") LOCATION ''
PARTITION (date = 'YYYY-MM-DD', event_id = "<some value>") LOCATION ''
PARTITION (date = 'YYYY-MM-DD', event_id = "<some value>") LOCATION ''
...
PARTITION (date = 'YYYY-MM-DD', event_id = "<some value>") LOCATION ''
PARTITION (date = 'YYYY-MM-DD', event_id = "<some value>") LOCATION ''
It has been 3 hours, and it still has only processed less than 100,000 partitions. I have observed a few things:
Spark does one partition at a time.
Spark seems to check each path for existence.
All these adds to the long running time. I've searched, and haven't been able to find how to disable both operations.
So, I want to manually perform SQL operations against the MySQL database and table for the Hive metastore, to create and manage the tables. I've looked but have been unable to figure out how to manually manage those tables. Please, does anyone know how to do that? Specifically, I want the following:
How can can create an external table with partitions, by making direct entries into the Hive metastore tables?
How can I manage an External table partition by making direct upsert queries against the Hive metastore tables?
Is there a good resource I could use to learn about the backing tables in the metastore. I feel doing the inserts manually would be much much faster. Thank you.
I think the core problem here is that you have too many partitions. Partitioning should generally be done on a low-cardinality column (something with a relatively small number of distinct values, compared to the total number of records). Typically you want to err on the side of having a smaller number of large files, rather than a large number of small files.
In your example, date is probably a good partitioning column, assuming there are many records for each date. If there are a large number of distinct values for event_id, that's not a great candidate for partitioning. Just keep it as an unpartitioned column.
An alternative to partitioning for a high-cardinality column is bucketing. This groups similar values for the bucketed column so they're in the same file, but doesn't split each value across separate files. The AWS Athena docs have a good overview of the concept.
This can be an issue with statistics auto-gathering. As a workaround, switch off hive.stats.autogather before recovering partitions.
Switch-off statistics auto-gathering:
set hive.stats.autogather=false;
Run MSCK REPAIR or ALTER TABLE RECOVER PARTITIONS.
If you need statistics to be fresh, you can execute ANALYZE separately for new partitions only.
Related tickets are HIVE-18743, HIVE-19489, HIVE-17478

How to reduce index size of a table in mysql using innodb engine?

I am facing a performance issue in mysql due to large index size on my table. Index size has grown to 6GB and my instance is running on 32GB memory. Majority of rows is not required in that table after a few hours and can be removed selectively. But removing them is a time consuming solution and doesn't reduce index size.
Please suggest some solution to manage this index.
You can optimize your table to rebuild index and get back space if not getting even after deletion-
optimize table table_name;
But as your table is bulky so it will lock during optimze table and also you are facing issue how can remove old data even you don't need few hours old data. So you can do as per below-
Step1: during night hours or when there is less traffic on your db, first rename your main table and create a new table with same name. Now insert few hours data from old table to new table.
By this you can remove unwanted data and also new table will be optimzed.
Step2: In future to avoid this issue, you can create a stored procedure. Which will will execute in night hours only 1 time per day and either delete till previous day (as per your requirement) data from this table or will move data to any historical table.
Step3: As now your table always keep only sigle day data then you can execute optimize table statement to rebuild and claim space back on this table easily.
Note: delete statement will not rebuild index and will not free space on server. For this you need to do optimize your table. It can be by various ways like by alter statement or by optimize statement etc.
If you can remove all the rows older than X hours, then PARTITIONing is the way to go. PARTITION BY RANGE on the hour and use DROP PARTITION to remove an old hour and REORGANIZE PARTITION to create a new hour. You should have X+2 partitions. More details.
If the deletes are more complex, please provide more details; perhaps we can come up with another solution that deals with the question about index size. Please include SHOW CREATE TABLE.
Even if you cannot use partitions for purging, it may be useful to have partitions for OPTIMIZE. Do not use OPTIMIZE PARTITION; it optimizes the entire table. Instead, use REORGANIZE PARTITION if you see you need to shrink the index.
How big is the table?
How big is innodb_buffer_pool_size?
(6GB index does not seem that bad, especially since you have 32GB of RAM.)

Table data handling - optimum usage

I have a table with 8 millions records in mysql.
I want to keep last one week data and delete the rest, i can take a dump and recreate the table in another schema.
I am struggling to get the queries right, please share your views and best approaches to do this.Best way to delete so that it will not affect other tables in the production.
Thanks.
MySQL offers you a feature called partitioning. You can do a horizontal partition and split your tables by rows. 8 Million isn't that much, how is the insertion rate per week?
CREATE TABLE MyVeryLargeTable (
id SERIAL PRIMARY KEY,
my_date DATE
-- your other columns
) PARTITION BY HASH (YEARWEEK(my_date)) PARTITIONS 4;
You can read more about it here: http://dev.mysql.com/doc/refman/5.1/en/partitioning.html
Edit: This one creates 4 partitions, so this will last for 4 weeks - therefore I suggest changing to partitions based on months / year. Partition limit is quite high but this is really a question how the insertion rate per week/month/year looks like.
Edit 2
MySQL5.0 comes with an Archive Engine, you should use this for your Archive table ( http://dev.mysql.com/tech-resources/articles/storage-engine.html ). Now how to get your data into the archive table? It seems like you have to write a cron-job that runs on the beginning of every week, moving all records to the archive table and deleting them from the original one. You could write a stored procedure for this but the cron-job needs to run on the shell. Keep in mind this could affect your data integrity in some way. What about upgrading to MySQL 5.1?

Is there any way to do a bulk/faster delete in mysql?

I have a table with 10 million records, what is the fastest way to delete & retain last 30 days.
I know this can be done in event scheduler, but my worry is if takes too much time, it might lock the table for much time.
It will be great if you can suggest some optimum way.
Thanks.
Offhand, I would:
Rename the table
Create an empty table with the same name as your
original table
Grab the last 30 days from your "temp" table and insert
them back into the new table
Drop the temp table
This will enable you to keep the table live through (almost) the entire process and get the past 30 days worth of data at your leisure.
You could try partition tables.
PARTITION BY LIST (TO_DAYS( date_field ))
This would give you 1 partition per day, and when you need to prune data you just:
ALTER TABLE tbl_name DROP PARTITION p#
http://dev.mysql.com/doc/refman/5.1/en/partitioning.html
Not that it helps you with your current problem, but if this is a regular occurance, you might want to look into a merge table: just add tables for different periods in time, and remove them from the merge table definition when no longer needed. Another option is partitioning, in which it is equally trivial to drop a (oldest) partition.
To expand on Michael Todd's answer.
If you have the space,
Create a blank staging table similar to the table you want to reduce in size
Fill the staging table with only the records you want to have in your destination table
Do a double rename like the following
Assuming:
table is the table name of the table you want to purge a large amount of data from
newtable is the staging table name
no other tables are called temptable
rename table table to temptable, newtable to table;
drop temptable;
This will be done in a single transaction, which will require an instantaneous schema lock. Most high concurrency applications won't notice the change.
Alternatively, if you don't have the space, and you have a long window to purge this data, you can use dynamic sql to insert the primary keys into a temp table, and join the temp table in a delete statement. When you insert into the temp table, be aware of what max_packet_size is. Most installations of MySQL use 16MB (16777216 bytes). Your insert command for the temp table should be under max_packet_size. This will not lock the table. You'll want to run optimize table to reclaim space for the rest of the engine to use. You probably won't be able to reclaim disk space, unless you were to shutdown the engine and move the data files.
Shutdown your resource,
SELECT .. INTO OUTFILE, parse output, delete table, LOAD DATA LOCAL INFILE optimized_db.txt - more cheaper to re-create, than to UPDATE.

SQL Server 2008: Disable index on one particular table partition

I am working with a big table (~100.000.000 rows) in SQL Server 2008. Frequently, I need to add and remove batches of ~30.000.000 rows to and from this table. Currently, before loading a large batch into the table, I disable indexes, I insert the data, then I rebuild the index. I have measured this to be the fastest approach.
Since recently, I am considering implementing table partitioning on this table to increase speed. I will partition the table according to my batches.
My question, will it be possible to disable the index of one particular partition, and load the data into that one before enabling it again? In that case, the rest of my table will not have to suffer a complete index rebuild, and my loading can be even faster?
Indexes are typically on the Partition Scheme. For the scenario you are talking about you can actually load up a new table with the batch (identical structure, different name) and then use the SWITCH command to add this table as a new partition into your existing table.
I have included code that I use to perform this, you will need to modify it based on your table names:
DECLARE #importPart int
DECLARE #hourlyPart int
SET #importPart = 2 -- always, so long as the Import table is only made up of 1 partition
-- get the Hourly partition
SELECT
#hourlyPart = MAX(V.boundary_id) + 1
FROM
sys.partition_range_values V
JOIN sys.partition_functions F
ON V.function_id = F.function_id
AND F.name = 'pfHourly'
ALTER TABLE Import
SWITCH PARTITION #importPart
TO Hourly PARTITION #hourlyPart;