In my HDFS, I've partitioned data by date and event_id, and have about 1.4 million parquet files. Today, to analyze the data in Apache Spark, I use spark.read.parquet("/path/to/root/"). This takes about 30 minutes to list the files, I have to do this every time, and it's getting annoying.
Now, I want to setup an external table, using MySQL as the Hive Metastore. I'm currently facing the know issue where discovering all 1.4 partitions taking forever. As we all known MSCK REPAIR TABLE my_table is out of the picture. I instead generated a long query (about 400 MB) that contains this query like this
ALTER TABLE my_table ADD
PARTITION (date = 'YYYY-MM-DD', event_id = "<some value>") LOCATION ''
PARTITION (date = 'YYYY-MM-DD', event_id = "<some value>") LOCATION ''
PARTITION (date = 'YYYY-MM-DD', event_id = "<some value>") LOCATION ''
...
PARTITION (date = 'YYYY-MM-DD', event_id = "<some value>") LOCATION ''
PARTITION (date = 'YYYY-MM-DD', event_id = "<some value>") LOCATION ''
It has been 3 hours, and it still has only processed less than 100,000 partitions. I have observed a few things:
Spark does one partition at a time.
Spark seems to check each path for existence.
All these adds to the long running time. I've searched, and haven't been able to find how to disable both operations.
So, I want to manually perform SQL operations against the MySQL database and table for the Hive metastore, to create and manage the tables. I've looked but have been unable to figure out how to manually manage those tables. Please, does anyone know how to do that? Specifically, I want the following:
How can can create an external table with partitions, by making direct entries into the Hive metastore tables?
How can I manage an External table partition by making direct upsert queries against the Hive metastore tables?
Is there a good resource I could use to learn about the backing tables in the metastore. I feel doing the inserts manually would be much much faster. Thank you.
I think the core problem here is that you have too many partitions. Partitioning should generally be done on a low-cardinality column (something with a relatively small number of distinct values, compared to the total number of records). Typically you want to err on the side of having a smaller number of large files, rather than a large number of small files.
In your example, date is probably a good partitioning column, assuming there are many records for each date. If there are a large number of distinct values for event_id, that's not a great candidate for partitioning. Just keep it as an unpartitioned column.
An alternative to partitioning for a high-cardinality column is bucketing. This groups similar values for the bucketed column so they're in the same file, but doesn't split each value across separate files. The AWS Athena docs have a good overview of the concept.
This can be an issue with statistics auto-gathering. As a workaround, switch off hive.stats.autogather before recovering partitions.
Switch-off statistics auto-gathering:
set hive.stats.autogather=false;
Run MSCK REPAIR or ALTER TABLE RECOVER PARTITIONS.
If you need statistics to be fresh, you can execute ANALYZE separately for new partitions only.
Related tickets are HIVE-18743, HIVE-19489, HIVE-17478
Related
Our applications read data from sensor complexes and write them to a database, together with their timestamp. New data are inserted about 5 times per second per sensor complex (1..10 complexes per database server; data contain 2 blobs of typically 25kB and 50kB, resp.), they are read from 1..3 machines (simple reads like: select * from table where sensorId=?sensorId and timestamp>?lastTimestamp). Rows are never updated; no reports are created on the database side; old rows are deleted after several days. Only one of the tables receives occasional updates.
The primary index of that main table is an autogenerated id, with additional indices for sensorid and timestamp.
The performance is currently abysmal. The deletion of old data takes hours(!), and many data packets are not sent to the database because the insertion process takes longer than the interval between sensor reads. How can we optimize the performance of the database in such a specific scenario?
Setting the transaction isolation level to READ_COMMITTED looks promising, and also innodb_lock_timeout seems useful. Can you suggest further settings useful in our specific scenario?
Can we gain further possibilities when we get rid of the table which receives updates?
Deleting old data -- PARTITION BY RANGE(TO_DAYS(...)) lets you DROP PARTITION a looooot faster than doing DELETEs.
More details: http://mysql.rjweb.org/doc.php/partitionmaint
And that SELECT you mentioned needs this 'composite' index:
INDEX(sensorId, timestamp)
I am facing a performance issue in mysql due to large index size on my table. Index size has grown to 6GB and my instance is running on 32GB memory. Majority of rows is not required in that table after a few hours and can be removed selectively. But removing them is a time consuming solution and doesn't reduce index size.
Please suggest some solution to manage this index.
You can optimize your table to rebuild index and get back space if not getting even after deletion-
optimize table table_name;
But as your table is bulky so it will lock during optimze table and also you are facing issue how can remove old data even you don't need few hours old data. So you can do as per below-
Step1: during night hours or when there is less traffic on your db, first rename your main table and create a new table with same name. Now insert few hours data from old table to new table.
By this you can remove unwanted data and also new table will be optimzed.
Step2: In future to avoid this issue, you can create a stored procedure. Which will will execute in night hours only 1 time per day and either delete till previous day (as per your requirement) data from this table or will move data to any historical table.
Step3: As now your table always keep only sigle day data then you can execute optimize table statement to rebuild and claim space back on this table easily.
Note: delete statement will not rebuild index and will not free space on server. For this you need to do optimize your table. It can be by various ways like by alter statement or by optimize statement etc.
If you can remove all the rows older than X hours, then PARTITIONing is the way to go. PARTITION BY RANGE on the hour and use DROP PARTITION to remove an old hour and REORGANIZE PARTITION to create a new hour. You should have X+2 partitions. More details.
If the deletes are more complex, please provide more details; perhaps we can come up with another solution that deals with the question about index size. Please include SHOW CREATE TABLE.
Even if you cannot use partitions for purging, it may be useful to have partitions for OPTIMIZE. Do not use OPTIMIZE PARTITION; it optimizes the entire table. Instead, use REORGANIZE PARTITION if you see you need to shrink the index.
How big is the table?
How big is innodb_buffer_pool_size?
(6GB index does not seem that bad, especially since you have 32GB of RAM.)
I have a MySql DataBase. I have a lot of records (about 4,000,000,000 rows) and I want to process them in order to reduce them(reduce to about 1,000,000,000 Rows).
Assume I have following tables:
table RawData: I have more than 5000 rows per sec that I want to insert them to RawData
table ProcessedData : this table is a processed(aggregated) storage for rows that were inserted at RawData.
minimum rows count > 20,000,000
table ProcessedDataDetail: I write details of table ProcessedData (data that was aggregated )
users want to view and search in ProcessedData table that need to join more than 8 other tables.
Inserting in RawData and searching in ProcessedData (ProcessedData INNER JOIN ProcessedDataDetail INNER JOIN ...) are very slow. I used a lot of Indexes. assume my data length is 1G, but my Index length is 4G :). ( I want to get ride of these indexes, they make slow my process)
How can I Increase speed of this process ?
I think I need a shadow table from ProcessedData, name it ProcessedDataShadow. then proccess RawData and aggregate them with ProcessedDataShadow, then insert the result in ProcessedDataShadow and ProcessedData. What is your idea??
(I am developing the project by C++)
thank you in advance.
Without knowing more about what your actual application is, I have these suggestions:
Use InnoDB if you aren't already. InnoDB makes use of row-locks and are much better at handling concurrent updates/inserts. It will be slower if you don't work concurrently, but the row-locking is probably a must have for you, depending on how many sources you will have for RawData.
Indexes usually speeds up things, but badly chosen indexes can make things slower. I don't think you want to get rid of them, but a lot of indexes can make inserts very slow. It is possible to disable indexes when inserting batches of data, in order to prevent updating indexes on each insert.
If you will be selecting huge amount of data that might disturb the data collection, consider using a replicated slave database server that you use only for reading. Even if that will lock rows /tables, the primary (master) database wont be affected, and the slave will get back up to speed as soon as it is free to do so.
Do you need to process data in the database? If possible, maybe collect all data in the application and only insert ProcessedData.
You've not said what the structure of the data is, how its consolidated, how promptly data needs to be available to users nor how lumpy the consolidation process can be.
However the most immediate problem will be sinking 5000 rows per second. You're going to need a very big, very fast machine (probably a sharded cluster).
If possible I'd recommend writing a consolidating buffer (using an in-memory hash table - not in the DBMS) to put the consolidated data into - even if it's only partially consolidated - then update from this into the processedData table rather than trying to populate it directly from the rawData.
Indeed, I'd probably consider seperating the raw and consolidated data onto seperate servers/clusters (the MySQL federated engine is handy for providing a unified view of the data).
Have you analysed your queries to see which indexes you really need? (hint - this script is very useful for this).
I have a table with 8 millions records in mysql.
I want to keep last one week data and delete the rest, i can take a dump and recreate the table in another schema.
I am struggling to get the queries right, please share your views and best approaches to do this.Best way to delete so that it will not affect other tables in the production.
Thanks.
MySQL offers you a feature called partitioning. You can do a horizontal partition and split your tables by rows. 8 Million isn't that much, how is the insertion rate per week?
CREATE TABLE MyVeryLargeTable (
id SERIAL PRIMARY KEY,
my_date DATE
-- your other columns
) PARTITION BY HASH (YEARWEEK(my_date)) PARTITIONS 4;
You can read more about it here: http://dev.mysql.com/doc/refman/5.1/en/partitioning.html
Edit: This one creates 4 partitions, so this will last for 4 weeks - therefore I suggest changing to partitions based on months / year. Partition limit is quite high but this is really a question how the insertion rate per week/month/year looks like.
Edit 2
MySQL5.0 comes with an Archive Engine, you should use this for your Archive table ( http://dev.mysql.com/tech-resources/articles/storage-engine.html ). Now how to get your data into the archive table? It seems like you have to write a cron-job that runs on the beginning of every week, moving all records to the archive table and deleting them from the original one. You could write a stored procedure for this but the cron-job needs to run on the shell. Keep in mind this could affect your data integrity in some way. What about upgrading to MySQL 5.1?
I am working with a big table (~100.000.000 rows) in SQL Server 2008. Frequently, I need to add and remove batches of ~30.000.000 rows to and from this table. Currently, before loading a large batch into the table, I disable indexes, I insert the data, then I rebuild the index. I have measured this to be the fastest approach.
Since recently, I am considering implementing table partitioning on this table to increase speed. I will partition the table according to my batches.
My question, will it be possible to disable the index of one particular partition, and load the data into that one before enabling it again? In that case, the rest of my table will not have to suffer a complete index rebuild, and my loading can be even faster?
Indexes are typically on the Partition Scheme. For the scenario you are talking about you can actually load up a new table with the batch (identical structure, different name) and then use the SWITCH command to add this table as a new partition into your existing table.
I have included code that I use to perform this, you will need to modify it based on your table names:
DECLARE #importPart int
DECLARE #hourlyPart int
SET #importPart = 2 -- always, so long as the Import table is only made up of 1 partition
-- get the Hourly partition
SELECT
#hourlyPart = MAX(V.boundary_id) + 1
FROM
sys.partition_range_values V
JOIN sys.partition_functions F
ON V.function_id = F.function_id
AND F.name = 'pfHourly'
ALTER TABLE Import
SWITCH PARTITION #importPart
TO Hourly PARTITION #hourlyPart;