Does anyone have any tips that could help speed up a process of breaking down a table and inserting a large number of records into a new table.
I'm currently using Access and VBA to convert a table that contains records with a large string (700 + characters) into a new table where each character has its own record (row). I'm doing this by looping through the string 1 character at a time and inserting into the new table using simple DAO in VBA.
Currently I'm working with a small subset of data - 300 records each with a 700 character string. This process takes about 3 hours to run so isn't going to scale up to the full dataset of 50,000 records!
table 1 structure
id - string
001 - abcdefg
becomes
table 2 structure
id - string
001 - a
001 - b
001 - c
. .
. .
. .
I'm open to any suggestions that could improve things.
Cheers
Phil
Consider this example using Northwind. Create a table called Sequence with an INTEGER (Access = Long Integer) and populate it with values 1 to 20 (i.e. 20 row table). Then use this ACE/Jet syntax SQL code to parse each letter of the employees' last names:
SELECT E1.EmployeeID, E1.LastName, S1.seq, MID(E1.LastName, S1.Seq, 1)
FROM Employees AS E1, Sequence AS S1
WHERE S1.seq BETWEEN 1 AND LEN(E1.LastName);
When doing bulk inserts, you can often get a substantial performance boost by dropping the table's indexes, doing the bulk insert, and then restoring the indexes. In one case, when inserting a couple million records into a MySQL table, I've seen this bring the run time down from 17 hours to about 20 minutes.
I can't advise specifically regarding Access (I haven't used it since Access 2, 15 or so years ago), but the general technique is applicable to pretty much any database engine.
We have a routine that transposes data. Not sure if the code is optimized, but it runs significantly faster after the file has been compacted.
Doing a lot of deleting and rebuilding of tables bloats an .mdb file significantly.
Related
I am assisting to manage a data resource sitting on a SQL server (1.3M+ records, 125 Columns). The data model is fixed in its current state, though a handful of columns were added in the past month.
I have a DB app that copies a subset records from the primary table into a local table to let users efficiently review/edit, then write back the updated records to the server. Been working well since 2013. Typical subset is 3K to 10K records.
SELECT dbo_TblMatchedTb.* INTO TblMatchedTb
FROM dbo_TblMatchedTb
WHERE (((dbo_TblMatchedTb.INVID)=11339));
This week, while creating the local "copy", I saw an error for the first time:
"Record Too Large - err 3047"
Inspecting the dataset produced via the statement above (in a CSV export), I found 7 records MUCH longer than average. These records were 2087 chars wide, vs 1500 chars ave (including the CSV commas)
Via a bit of manual iteration, I was able to copy over the records when the max record width was shortened to < 1907 chars.
Question:
Is there an efficient method/query to measure the current total record width in the local table described above? (3K to 10K records, 125 columns). If I can ID records approaching some limit, I can TRIM several candidate data values (i.e.,from 255 to 100 char).
I can't touch the schema, but I can conditionally shorten some of the less than critical data values.
Any ideas?
I am in the situation where I need to store data for 1900+ cryptocurrencies every minute, i use MySQL innoDB.
Currently, the table looks like this
coins_minute_id | coins_minute_coin_fk | coins_minute_usd | coins_minute_btc | coins_minute_datetime | coins_minute_timestamp
coins_minute_id = autoincrement id
coins_minute_coin_fk = medium int unsigned
coins_minute_usd = decimal 20,6
coins_minute_btc = decimal 20,8
coins_minute_datetime = datetime
coins_minute_timestamp = timestamp
The table grew incredibly fast in the matter of no time, every minute 1900+ rows are added to the table.
The data will be used for historical price display as a D3.js line graph for each cryptocurrency.
My question is how do i optimize this database the best, i have thought of only collecting the data every 5 minutes instead of 1, but it will still add up to a lot of data in no time, i have also thought if it was better to create a unique table for each cryptocurrency, does any of you who loves to design databases know some other very smart and clever way to do stuff like this?
Kindly Regards
(From Comment)
SELECT coins_minute_coin_fk, coins_minute_usd
FROM coins_minutes
WHERE coins_minute_datetime >= DATE_ADD(NOW(),INTERVAL -1 DAY)
AND coins_minute_coin_fk <= 1000
ORDER BY coins_minute_coin_fk ASC
Get rid of coins_minute_ prefix; it clutters the SQL without providing any useful info.
Don't specify the time twice -- there are simple functions to convert between DATETIME and TIMESTAMP. Why do you have both 'created' and an 'updated' timestamps? Are you doing UPDATE statements? If so, then the code is more complicated than simply "inserting". And you need a unique key to know which row to update.
Provide SHOW CREATE TABLE; it is more descriptive that what you provided.
30 inserts/second is easily handled. 300/sec may have issues.
Do not PARTITION the table without some real reason to do so. The common valid reason is that you want to delete 'old' data periodically. If you are deleting after 3 months, I would build the table with PARTITION BY RANGE(TO_DAYS(...)) and use weekly partitions. More discussion: http://mysql.rjweb.org/doc.php/partitionmaint
Show us the queries. A schema cannot be optimized without knowing how it will be accessed.
"Batch" inserts are much faster than single-row INSERT statements. This can be in the form of INSERT INTO x (a,b) VALUES (1,2), (11,22), ... or LOAD DATA INFILE. The latter is very good if you already have a CSV file.
Does your data come from a single source? Or 1900 different sources?
MySQL and MariaDB are probably identical for your task. (Again, need to see queries.) PDO is fine for either; no recoding needed.
After seeing the queries, we can discuss what PRIMARY KEY to have and what secondary INDEX(es) to have.
1 minute vs 5 minutes? Do you mean that you will gather only one-fifth as many rows in the latter case? We can discuss this after the rest of the details are brought out.
That query does not make sense in multiple ways. Why stop at "1000"? The output is quite large; what client cares about that much data? The ordering is indefinite -- the datetime is not guaranteed to be in order. Why specify the usd without specifying the datetime? Please provide a rationale query; then I can help you with INDEX(es).
Am looking for advice about whether there is any way to speed up the import of about 250 GB of data into a MySQL table (InnoDB) from eight source csv files of approx. 30 GB each. The csv's have no duplicates within themselves, but do contain duplicates between files -- in fact some individual records appear in all 8 csv files. So those duplicates need to be removed at some point in the process. My current approach creates an empty table with a primary key, and then uses eight “LOAD DATA INFILE [...] IGNORE” statements to sequentially load each csv file, while dropping duplicate entries. It works great on small sample files. But with the real data, the first file takes about 1 hour to load, then the second takes more than 2 hours, the third one more than 5, the fourth one more than 9 hours, which is where I’m at right now. It appears that as the table grows, the time required to compare the new data to the existing data is increasing... which of course makes sense. But with four more files to go, it looks like it might take another 4 or 5 days to complete if I just let it run its course.
Would I be better off importing everything with no indexes on the table, and then removing duplicates after? Or should I import each of the 8 csv's into separate temporary tables and then do a union query to create a new consolidated table without duplicates? Or are those approaches going to take just as long?
Plan A
You have a column for dedupping; lets call it name.
CREATE TABLE New (
name ...,
...
PRIMARY KEY (name) -- no other indexes
) ENGINE=InnoDB;
Then, 1 csv at a time:
* Sort the csv by name (this makes any caching work better)
LOAD DATA ...
Yes, something similar to Plan A could be done with temp tables, but it might not be any faster.
Plan B
Sort all the csv files together (probably the unix "sort" can do this in a single command?).
Plan B is probably fastest, since it is extremely efficient in I/O.
I just insert a data with a form in my website, normally the data will inserted in the last of rows like :
auto_increment name
1 a
2 b
3 c
4 d
5 e
but, when i insert a new data last time, it inserted in the middle rows of table, looked like :
17 data17
30 data30
18 data18
19 data19
20 data20
the newest data that has been inserted in the middle rows of table (data30).
it's happen to me rarerly (still happen) why this happen? and how i prevent this thing in in the future? thankyou.
What you see is the result returned by the engine. It is hardly a matter which recod is fetched early and which later as it depends on a lot of issues. For one, dont think your database table to be a sequential file like FoxPro. It is way more sophisticated than that. Next, for every query that returns data use a Order by clause to avoid these instances.
So always use:
select columns from table order by column
The above will ensure you get the data the way you need and not be surprised when the DB engine finds a later record in cache while fetches an older record from a slow media in another database file. If you read the basics of RDBMS concepts then these things are discussed as also you need to study how MySQL internally works.
I found this great article that discusses the many wonderful features of a modern database query engine.
http://thinkingmonster.wordpress.com/mysql/mysql-architecture/
Although the entire article discusses the topic very well but you may pay extra attention to the section that talks about Record Cache.
I have 3 very large tables with clustered indexes on composite keys. No updates only inserts. New inserts will not be within the existing index range but the new inserts will not align with the clustered index and these tables get a lot of inserts (hundreds - thousands per second). What would like to do is DBREINDEX with Fill Factor = 100 but then set a Fill Factor of 5 and have that Fill Factor ONLY applied to inserts. Right now a Fill Factor applies to the whole table only. Is there a way to have a Fill Factor that applies to inserts (or inserts and updates) only? I don't care about select speed at this time. I am loading data. When the data load is complete then I will DBREINDEX at 100. A Fill Factor of 10 versus 30 doubles the rates at which new data is inserted. This load will takes a couple days and it cannot go live until the data is loaded. The clustered indexes are aligned with dominate query used by the end user application.
My practice is to DBREINDEX daily but the problem is now that the tables are getting large a 10 DBREINDEX takes a long time. I have considered indexing into "daily" tables and then inserting that data daily sorted by the clustered index into the production tables.
If you read this far even more. The indexes are all composite and I am running 6 instances of the parser on an 8 core server (lot of testing and that seems to have the best throughput). The data out of a SINGLE parser is in PK order and I am doing the inserts 990 values at a time (SQL value limits). The 3 active tables only share data via a foreign key relationship with a single relative inactive 4th table. My thought at this time is to have holding tables for each parser and then have another process that polls those table for the next complete insert and move the data into the production table in PK order. That is going to be a lot of work. I hope someone has a better idea.
The parses start in PK order but rarely finish in PK order. Some individual parses are so large that I could not hold all the data in memory until the end. Right now the SQL insert is slightly faster than the parse that creates the data. In an individual parse I run the insert asynch and go on parsing but don't insert until the prior insert is complete.
I agree you should have holding tables for the parser data and only insert to the main tables when you're ready. I implemented something similar in a former life (it was quasi-hashed into 10 tables based on mod 10 of the unique ID, then rolled into the primary table later - primarily to assist in load speed). If you're going to use holding tables then I see no need to have them at anything but FF = 100. The less pages you have to use the better.
Apparently, too, you should test the difference permanent tables, #temp tables and table-valued parameters. :-)