SSIS Flat File data column compare against table's column data range

SSIS Flat File data column compare against table's column data range - sql-server-2008

I need to develop a SSIS Package where I will need to import/use a flat file(has only 1 column) to compare each row against existing table's 2 columns(start and end column).
Flat File data -
110000
111112
111113
112222
112525
113222
113434
113453
114343
114545
And compare each row of the flat file against structure/data -
id start end
8 110000 119099
8 119200 119999
3 200000 209999
3 200000 209999
2 300000 300049
2 770000 779999
2 870000 879999
Now, If need to implement this in a simple stored procedure that would farely simple, however I am not able to get my head around this if I have do it in SSIS package.
Any ideas? Any help much appreciated.

At the core, you will need to use a Lookup Component. Write a query, SELECT T.id, T.start, T.end FROM dbo.MyTable AS T and use that as your source. Map the input column to the start column and select the id so that it will be added to the data flow.
If you hit run, it will perform an exact lookup and only find values of 110000 and 119200. To convert it to a range query, you will need to go into the Advanced tab. There should be 3 things you can check: amount of memory, rows and customize the query. When you check the last one, you should get a query like
SELECT * FROM
(SELECT T.id, T.start, T.end FROM dbo.MyTable AS T`) AS ref
WHERE ref.start = ?
You will need to modify that to become
SELECT * FROM
(SELECT T.id, T.start, T.end FROM dbo.MyTable AS T`) AS ref
WHERE ? BETWEEN ref.start AND ref.end
It's been my experience that the range queries can become rather inefficient as it seems to cache what's been seen already so if the source file had 110001, 110002, 110003 you would see 3 unique queries sent to the database. For small data sets, that may not be so bad but it led to some ugly load times for my DW.
An alternative to this is to explode the ranges. For me, I had a source system that only kept date ranges and I needed to know by day what certain counts were. The range lookups were not performing well so I crafted a query to convert the single row with a range of 2010-01-01 to 2013-07-07 to many rows, each with a single date 2013-01-01, 2013-01-02... While this approach lead to a longer pre-execute phase (it took a few minutes as the query had to generate ~30k rows per day for the past 5 years), once cached locally it was a simple seek to find a given transaction by day.
Preferably, I'd create a numbers table, fill it to the max of int and be done with it but you might get by with just using an inline table valued function to generate numbers. Your query would then look something like
SELECT
T.id
, GN.number
FROM
dbo.MyTable AS T
INNER JOIN
-- Make this big enough to satisfy your theoretical ranges
dbo.GenerateNumbers(1000000) AS GN
ON GN.number BETWEEN T.start and T.end;
That would get used in a "straight" lookup without the need for any of the advanced features. The lookup is going to get very memory hungry though so make the query as tight as possible. For example, cast the GN.number from a bigint to an int in the source query if you know your values will fit in an int.

Related

How to set up MYSQL Tables for fast SELECT

The question is about *.FIT files (link to definition) (1 to extremely many and constantly more), from Sports watches, speedometers,
in which there is always a timestamp (1 to n seconds), as well as 1 to n further parameters (which also have either a timestamp or a counter from 1 to x).
To perform data analysis, I need the data in the database to calculate e.g. the heart rates in relation to the altitude over several FIT files / training units / time periods.
Because of the changing number of parameters in a FIT file (depending on the connected devices, the device that created the file, etc.) and the possibility to integrate more/new parameters in the future, my idea was to have a separate table for each parameter instead of writing everything in one big table (which would then have extremely many "empty" cells whenever a parameter is not present in a FIT file).
Basic tables:
1 x tbl_file
id
filename
date
1
xyz.fit
2022-01-01
2
vwx.fit
2022-01-02
..
..
..
n x tbl_parameter_xy / tbl_ parameter_yz / ....
id
timestamp/counter
file_id
value
1
0
1
value
2
1
1
value
3
0
2
value
..
..
..
..
And these parameter tables would then be linked to each other via the file_id as well as to the FIT File.
I then used a test server, set up a MYSQL-DB to test this and was shocked:
SELECT * FROM tbl_parameter_xy as x
LEFT JOIN tbl_parameter_yz as y
ON x.file_id = y.file_id
WHERE x.file_id = 999
Took almost 30 seconds to give me the results.
In my parameter tables there are 209918 rows.
file_id 999 consists of 1964 rows.
But my SELECT with JOIN returns 3857269 rows, so there must be an/the error and that's the reason why it takes 30sec.
In comparison, fetching from a "large complete" table was done in 0.5 seconds:
SELECT * FROM tbl_all_parameters
WHERE file_id = 999
After some research, I came across INDEX and thought I had the solution.
I created an index (file_id) for each of the parameter tables, but the result was even slower/same.
Right now I´m thinking about building that big "one in all" table, which makes it easier to handle and faster to select from, but I would have to update it frequently to insert new cols for new parameters. And I´m afraid it will grow so big it kills itself
I have 2 questions:
Which table setup is recommended, primary with focus on SELECT speed, secondary with size of DB.
Do I have a basic bug in my SELECT that makes it so slow?
EXPLAIN SELECT

You're getting a combinatorial explosion in your JOIN. Your result set contains one output row for every pair of input rows in your two parameter tables.
If you say
SELECT * FROM a LEFT JOIN b
with no ON condition at all you get COUNT(a) * COUNT(b) rows in your result set. And you said this
SELECT * FROM a LEFT JOIN b WHERE a.file_id = b.file_id
which gives you a similarly bloated result set.
You need another ON condition... possibly try this.
SELECT *
FROM tbl_parameter_xy as x
LEFT JOIN tbl_parameter_yz as y
ON x.file_id = y.file_id
AND x.timestamp = y.timestamp
if the timestamps in the two tables are somehow in sync.
But, with respect, I don't think you have a very good database design yet.
This is a tricky kind of data for which to create an optimal database layout, because it's extensible.
If you find yourself with a design where you routinely create new tables in production (for example, when adding a new device type) you almost certainly have misdesigned you database.
An approach you might take is creating an attribute / value table. It will have a lot of rows in it, but they'll be short and easy to index.
Your observations will go into a table like this.
file_id part of your primary key
parameter_id part of your primary key
timestamp part of your primary key
value
Then, when you need to, say, retrieve parameters 2 and 3 from a particular file, you would do
SELECT timestamp, parameter_id, value
FROM observation_table
WHERE file_id = xxxx
AND parameter_id IN (2,3)
ORDER BY timestamp, parameter_id
The multicolumn primary key I suggested will optimize this particular query.
Once you have this working, read about denormalization.

SQL - Comparing text(combinations) on 100million table

I have a problem.
I have a table that has around 80-100million records in it. In that table I have a field, that has stored from 3 up to 16 different "combinations"(varchar). Combination is a 4-digit number, a colon and a char(A-E), . For example:
'0001:A/0002:A/0005:C/9999:E'. In this case there are 4 different combinations (they can go up to 16). This field is in every row of the table, never a null.
Now the problem: I have to go through the table, find every row, and see if they are similar.
Example rows:
0001:A/0002:A/0003:C/0005:A/0684:A/0699:A/0701:A/0707:A/0709:A/0710:D/0711:C/0712:A/0713:A
0001:A/0002:A/0003:C
0001:A/0002:A/0003:A/0006:C
0701:A/0709:A/0711:C/0712:A/0713:A
As you can see, each of these rows is similar to the others (in some way). The thing that needs to be done here is when you send '0001:A/0002:A/0003:C' via program(or parameter in SQL), that it checks every row and see if they have the same "group". Now the catch here is that it has to go both ways and it has to be done "quick", and the SQL needs to compare them somehow.
So when you send '0001:A/0002:A/0003:C/0005:A/0684:A/0699:A/0701:A/0707:A/0709:A/0710:D/0711:C/0712:A/0713:A' it has to find all fields where there are 3-16 same combinations and return the rows. This 3-16 can be specified via parameter, but the problem is that you would need to find all possible combinations, because you can send '0002:A:/0711:C/0713:A', and as you can see you can send 0002:A as the first parameter.
But you cannot have indexing because a combination can be on any place in a string, and you can send different combinations that are not "attached" (there could be a different combination in the middle).
So, sending '0001:A/0002:A/0003:C/0005:A/0684:A/0699:A/0701:A/0707:A/0709:A/0710:D/0711:C/0712:A/0713:A' has to return all fields that has the same 3-16 fields
and it has to go both ways, if you send "0001:A/0002:A/0003:C" it has to find the row above + similar rows(all that contain all the parameters).
Some things/options I tried:
Doing LIKE for all send combinations is not practical + too slow
Giving a field full-index search isn't an option(don't know why exactly)
One of the few things that could work would be making some "hash" type of encoding for fields, calculating it via program, and searching for all same "hashes"(Don't know how would you do that, given that the hash would generate different combinations for similar texts, maybe some hash that would be written exactly for that
Making a new field, calculating/writing(can be done on insert) all possible combinations and checking via SQL/program if they have the same % of combinations, but I don't know how you can store 10080 combinations(in case of 16) into a "varchar" effectively, or via some hash code + knowing then which of them are familiar.
There is another catch, this table is in usage almost 24/7, doing combinations to check if they are the same in SQL is too slow because the table is too big, it can be done via program or something, but I don't have any clue on how could you store this in a new row that you would know somehow that they are the same. It is a possibility that you would calculate combinations, storing them via some hash code or something on each row insert, calculating "hash" via program, and checking the table like:
SELECT * FROM TABLE WHERE ROW = "a346adsad"
where the parameter would be sent via program.
This script would need to be executed really fast, under 1 minute, because there could be new inserts into the table, that you would need to check.
The whole point of this would be to see if there are any similar combinations in SQL already and blocking any new combination that would be "similar" for inserting.
I have been dealing with that problem for 3 days now without any possible solution, the thing that was the closest is different type of insert/hash like, but I don't know how could that work.
Thank you in advance for any possible help, or if this is even possible!

it checks every row and see if they have the same "group".
IMHO if the group is a basic element of your data structure, your database structure is flawed: it should have each group in its own cell to be normalized. The structure you described makes it clear that you store a composite value in the field.
I'd tear up the table into 3:
one for the "header" information of the group sequences
one for the groups themselves
a connecting table between the two
Something along these lines:
CREATE TABLE GRP_SEQUENCE_HEADER (
ID BIGINT PRIMARY KEY,
DESCRIPTION TEXT
);
CREATE TABLE GRP (
ID BIGINT PRIMARY KEY,
GROUP_TXT CHAR(6)
);
CREATE TABLE GRP_GRP_SEQUENCE_HEADER (
GROUP_ID BIGINT,
GROUP_SEQUENCE_HEADER_ID BIGINT,
GROUP_SEQUENCE_HEADER_ORDER INT, /* For storing the order in the sequence */
PRIMARY KEY(GROUP_ID, GROUP_SEQUENCE_HEADER_ID)
);
(of course, add the foreign keys, and most importantly the indexes necessary)
Then you only have to break up the input into groups, and execute a simple query on a properly indexed table.
Also, you would probably save on the disk space too by not storing duplicates...
A sample query for finding the "similar" sequences' IDs:
SELECT ggsh.GROUP_SEQUENCE_HEADER_ID,COUNT(1)
FROM GRP_GRP_SEQUENCE_HEADER ggsh
JOIN GRP g ON ggsh.GROUP_ID=g.GROUP_ID
WHERE g.GROUP_TXT IN (<groups to check for from the sequence>)
GROUP BY gsh.ID
HAVING COUNT(1) BETWEEN 3 AND 16 --lower and upper boundaries
This returns all the header IDs that the current sequence is similar to.
EDIT
Rethinking it a bit more, you could even break up the group into the two parts, but as I seem to understand, you always have full groups to deal with, so it doesn't seem to be necessary.
EDIT2 Maybe if you want to speed the process up even more, I'd recommend to translate the sequences using bijection into numeric data. For example, evaluate the first 4 numbers to be an integer, shift it by 4 bits to the left (multiply by 16, but quicker), and add the hex value of the character in the last place.
Examples:
0001/A --> 1 as integer, A is 10, so 1*16+10 =26
...
0002/B --> 2 as integer, B is 11, so 2*16+11 =43
...
0343/D --> 343 as integer, D is 13, so 343*16+13 =5501
...
9999/E --> 9999 as integer, E is 14, so 9999*16+14 =159998 (max value, if I understood correctly)
Numerical values are handled more efficiently by the DB, so this should result in an even better performance - of course with the new structure.

So basically you want to execute a complex string manipulation on 80-100 million rows in less than a minute! Ha, ha, good one!
Oh wait, you're serious.
You cannot hope to do these searches on the fly. Read Joel Spolsky's piece on getting Back to Basics to understand why.
What you need to do is hive off those 80-100 million strings into their own table, broken up into those discrete tokens i.e. '0001:A/0002:A/0003:C' is broken up into three records (perhaps of two columns - you're a bit a vague about the relationship between the numeric and alphabetic components of th etokens). Those records can be indexed.
Then it is simply a matter of tokenizing the search strings and doing a select joining the search tokens to the new table. Not sure how well it will perform: that rather depends on how many distinct tokens you have.

As people have commented you would benefit immensely from normalizing your data, but can you not cheat and create a temp table with the key and exploding out your column on the "/", so you go from
KEY | "0001:A/0002:A/0003:A/0006:C"
KEY1| "0001:A/0002:A/0003:A"
to
KEY | 0001:A
KEY | 0002:A
KEY | 0003:A
KEY | 0006:C
KEY1| 0001:A
KEY1| 0002:A
KEY1| 0003:A
Which would allow you to develop a query something like the following (not tested):
SELECT
t1.key
, t2.key
, COUNT(t1.*)
FROM
temp_table t1
, temp_table t2
, ( SELECT t3.key, COUNT(*) AS cnt FROM temp_table t3 GROUP BY t3.key) t4
WHERE
t1.combination IN (
SELECT
t5.combination
FROM
temp_table t5
WHERE
t5.key = t2.key)
AND t1.key <> t2.key
HAVING
COUNT(t1.*) = t4.cnt
So return the two keys where key1 is a proper subset of key?

I guess I can recommend to build special "index".
It will be quite big but you will achieve superspeedy results.
Let's consider this task as searching a set of symbols.
There are design conditions.
The symbols are made by pattern "NNNN:X", where NNNN is number [0001-9999] and X is letter [A-E].
So we have 5 * 9999 = 49995 symbols in alphabet.
Maximum length of words with this alphabet is 16.
We can build for each word set of combinations of its symbols.
For example, the word "abcd" will have next combinations:
abcd
abc
ab
a
abd
acd
ac
ad
bcd
bc
b
bd
cd
с
d
As symbols are sorted in words we have only 2^N-1 combinations (15 for 4 symbols).
For 16-symbols word there are 2^16 - 1 = 65535 combinations.
So we make for an additional index-organized table like this one
create table spec_ndx(combination varchar2(100), original_value varchar2(100))
Performance will be excellent with price of overhead - in the worst case for each record in the original table there will be 65535 "index" records.
So for 100-million table we will get 6-trillion table.
But if we have short values size of "special index" reduces drastically.

Slow compound mysql update query

We're doing an update query between two database tables and it is ridiculously slow. As in: it would take 30 days to perform the query.
One table, lab.list, contains about 940,000 records, the other, mind.list about 3,700,000 (3.7 million)
The update sets a field when two BETWEEN conditions are met. This is the query:
UPDATE lab.list L , mind.list M SET L.locId = M.locId WHERE L.longip BETWEEN M.startIpNum AND M.endIpNum AND L.date BETWEEN "20100301" AND "20100401" AND L.locId = 0
As it is now, the query is performing with about 1 update every 8 seconds.
We also tried it with the mind.list table in the same database, but that doesn't matter for the query time.
UPDATE lab.list L, lab.mind M SET L.locId = M.locId WHERE longip BETWEEN M.startIpNum AND M.endIpNum AND date BETWEEN "20100301" AND "20100401" AND L.locId = 0;
Is there a way to speed up this query? Basically IMHO it should make two subsets of the databases:
mind.list.longip BETWEEN M.startIpNum AND M.endIpNum
lab.list.date BETWEEN "20100301" AND "20100401"
and then update the values for these subsets. Somewhere along the line I think I made a mistake, but where? Maybe there is a faster query possible?
We tried log_slow_queries, but that shows that it is indeed examining 100s of millions of rows, probably going up all the way to 3331 gigarows.
Tech info:
Server version: 5.5.22-0ubuntu1-log (Ubuntu)
lab.list has indexes on locId, longip, date
lab.mind has indexes on locId, startIpNum AND M.endIpNum
hardware: 2x xeon 3.4 GHz, 4GB RAM, 128 GB SSD (so that should not be a problem!)

I would first of all try to index mind on startIpNum, endIpNum, locId in this order. locId is not used in SELECTing from mind, even if it is used for the update.
For the same reason I'd index lab on locId, date and longip (which isn't used in the first chunking, which should run on date) this order.
Then what kind of datatype is assigned to startIpNum and endIpNum? For IPv4, it's best to convert to INTEGER and use INET_ATON and INET_NTOA for user I/O. I assume you already did this.
To run the update, you might try to segment the M database using temporary tables. That is:
* select all records of lab in the given range of dates with locId = 0 into a temporary table TABLE1.
* run an analysis on TABLE1 grouping IP addresses by their first N bits (using AND with a suitable mask: 0x80000000, 0xC0000000, ... 0xF8000000... and so on, until you find that you have divided into a "suitable" number of IP "families". These will, by and large, match with startIpNum (but that's not strictly necessary).
* say that you have divided in 1000 families of IP.
* For each family:
* select those IPs from TABLE1 to TABLE3.
* select the IPs matching that family from mind to TABLE2.
* run the update of the matching records between TABLE3 and TABLE2. This should take place in about one hundred thousandth of the time of the big query.
* copy-update TABLE3 into lab, discard TABLE3 and TABLE2.
* Repeat with next "family".
It is not really ideal, but if the slightly improved indexing does not help, I really don't see all that many options.

In the end, the query was too big or cumbersome for mysql to fill. Even after indexing. Testing the same query with the same data on a high-end Sybase server, also took 3 hours.
So we abandoned the do it all on the database server thought, and went back to scripting languages.
We did the following in python:
load a chunk of 100000 records of the 3.7 million records, and loop over the rows
for each row, set the locId and fill in the rest of the columns
All these updates together take about 5 minutes, so a huge improvement!
Conclusion:
think outside of the database box!

Query optimizing suggession needed

I have written a script which takes about 15 hours to execute. I need some query optimization technique or suggestion to make this script as faster as possible...
If anyone could help, take a look on the script:
declare #max_date date
declare #client_bp_id int
Select #max_date=MAX(tran_date) from All_Share_Txn
DELETE FROM Client_Share_Balance
DECLARE All_Client_Bp_Id CURSOR FOR
SELECT Bp_id FROM Client --Take All Client's BPID
OPEN All_Client_Bp_Id
FETCH NEXT FROM All_Client_Bp_Id
INTO #client_bp_id
WHILE ##FETCH_STATUS = 0
BEGIN
Insert Client_Share_Balance(Bp_id,Instrument_Id,Quantity_Total,Quantity_Matured,Quantity_Pledge,AVG_Cost,Updated_At,Created_At,Company_Id,Created_By,Updated_By)
select #client_bp_id,Instrument_Id,
sum(case when Is_buy='True' then Quantity when Is_buy='False' then -quantity end), --as Total Quantity
sum(case when Mature_Date_Share <= #max_date then (case Is_buy when '1' then quantity when '0' then -quantity end) else 0 end), --as Free Qty
ISnull((select sum(case pu.IsBuy when '1' then -pu.quantity else pu.quantity end) from
(Select * from Pledge UNION Select * from Unpledge) pu
where pu.Client_Bp_id=#client_bp_id and pu.Instrument_Id=t1.Instrument_Id and pu.Txn_Date<=#max_date
group by pu.Client_Bp_id,pu.Instrument_Id),0), -- as Pledge_Quantity
dbo.Avg_Cost(#client_bp_id,Instrument_Id), --as Avg_rate
GETDATE(),GETDATE(),309,1,1
from All_Share_Txn t1
where Client_Bp_id=#client_bp_id and Instrument_Id is not null
group by Instrument_Id
having sum(case Is_buy when '1' then quantity when '0' then -quantity end)<> 0
or sum(case when Mature_Date_Share <= #max_date then (case Is_buy when '1' then quantity when '0' then -quantity end) else 0 end) <> 0
FETCH NEXT FROM All_Client_Bp_Id
INTO #client_bp_id
END
CLOSE All_Client_Bp_Id
DEALLOCATE All_Client_Bp_Id
Just need to verify if the code could be written more efficiently..

Replace * with you ColumnNames Select * from Pledge. It should be like
Select Instrument_Id from Pledge
Exclude the usage of Cursor.
Do you have unique records in Pledge and Unpledge table, if so, UNION ALL should be used. As it is faster comparing with UNION
Insert the records of All_Share_Txn in Local Temporary Table.
Create another Local Temporary table which will have fields 'Total Quantity' information based upon Instrument_Id column and Instrument_Id. Now evaluate the Switch case based condition and insert the records for Quantity Information in this table. Please note while you extract information for this context, use the Local Temporary Table as created in Step 3.
Create another Local Temporary table which will have fields 'Free Qty' information based upon Instrument_Id column and Instrument_Id. Now evaluate the Switch case based condition and insert the records for Free Qty Information in this table. Please note while you extract information for this context, use the Local Temporary Table as created in Step 3.
Create another Local Temporary table which will have fields 'Pledge_Quantity' information based upon Instrument_Id column and Instrument_Id. Now evaluate the Switch case based condition and insert the records
for Pledge_Quantity Information in this table. Please note while you extract information for this context, use the Local Temporary Table as created in Step 3.
Create another Local Temporary table which will have fields 'Avg_rate' information based upon Instrument_Id column and Instrument_Id. Now evaluate the Switch case based condition and insert the records
for Avg_rate Information in this table. Please note while you extract information for this context, use the Local Temporary Table as created in Step 3.
Now, with the help of Joins among the tables created in Step 3, 4, 5, 6, 7. You can instantly get the Resultset.

If I understand you code. The cursor is the bottleneck in your code. So I would skip the cursor and do something like this:
Insert Client_Share_Balance(Bp_id,Instrument_Id..)
select Client_Bp_id,
......
from All_Share_Txn t1
where EXISTS(SELECT NULL FROM Client WHERE Client_Bp_id=t1.Bp_id)
and Instrument_Id is not null
group by Instrument_Id,Client_Bp_id
.......

Unless you care that you are reading COMMITTED data, then you can tell SQL Server to look at data the way it is, without holding any locks on the objects... basically the same behavior as WITH (NOLOCK), but Microsoft recommends not using object hints and letting SQL Server decide on the best locking method to use. Without "worrying" about data being committed or not, this greatly speeds up fetching data.
Add this to the top of your query
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
--Set at the connection level, not a one time thing
GO
See this link TRANSACTION ISOLATION LEVEL
Next thing, please check that your unique id columns having Clustered Indexes on them. If they just have unique constraints, that still leaves you with a heap and not a table. Heaps are basically a scattered mess. (You probably know this stuff, just mentioning it) Next, make sure you have non-clustered indexes for all of the columns you use in WHERE ORDER BY GROUP HAVING... and include columns that are returned frequently; this consumes more disk space btw. Use compressed indexes and tables if you are using Enterprise Version or higher.
I would run UPDATE STATISTICS WITH FULLSCAN on the tables to give SQL Server the latest and greatest statistics. Rebuilding the clustered index will do this for you by the way, and also you can tell update only data statistics to help speed this process up, since it can take while if your tables are millions of rows.
The biggest performance hit you are taking is the fact that you are grouping on aggregated results. SQL Server is going to scanning, using work tables, sorting...everything BUT seeking indexes, all because there is no index to be used to help your GROUP BY, HAVING, etc..
Sometimes we have no choice in this matter, but there are tricks like creating a #temporary table (or table object, and yes, you can create indexes on both) and populating it with pre calculated results, making sure that this #temporary table can be joined on. When you use ORDER BY, GROUP BY, HAVING , unless you are using a column, or multiple columns and use aggregates like SUM or scalar value user-defined functions, it's going to be slow - depending on what you define as slow :) but you did state that you think it should be faster.
Some basic settings to look at for any instance of SQL Server:
tempdb should have the same number of files as you have cores, all the same size and all the same growth rate using MB, not %. I would only go up to 8 files if you have more than 8 cores. Restart the instance after making changes; you do HAVE TO but I recommend it.
Example: 12 cores.
tempdb.mdf size= 1024MB growby= 256MB
tempdb2.ndf size= 1024MB growby= 256MB
(etc)
tempdb8.ndf size= 1024MB growby= 256MB
Same with your database. If you need to add more files, then do so with the recommended size settings and rebuild all of your clustered indexes, this will spread the data across the files, since it will be rebuilding the physical structure of the data.
Don't let SQL Server take up all the memory! Set its limit to total_avail_phys_memory minus 2GB (leave the O/S a good amount memory)
Don't just use the PRIMARY FILE GROUP; separate your data and your indexes in to their own file groups. If you can, put the indexes on their on RAID 10 drives and data on RAID 5 or 6.
Make sure you're giving SQL Server user databases the care they need with scheduled maintenance with a maintenance plan or by roll your own scripts.
Data on RAID 5, Logs on RAID 10, Tempdb on RAID 10 - each LUN (drive letter) should have dedicated spindles (drives)
I hope these suggestions are helpful, if anything they should be helpful for the overall performance of the instance.

randomizing large dataset

I am trying to find a way to get a random selection from a large dataset.
We expect the set to grow to ~500K records, so it is important to find a way that keeps performing well while the set grows.
I tried a technique from: http://forums.mysql.com/read.php?24,163940,262235#msg-262235 But it's not exactly random and it doesn't play well with a LIMIT clause, you don't always get the number of records that you want.
So I thought, since the PK is auto_increment, I just generate a list of random id's and use an IN clause to select the rows I want. The problem with that approach is that sometimes I need a random set of data with records having a spefic status, a status that is found in at most 5% of the total set. To make that work I would first need to find out what ID's I can use that have that specific status, so that's not going to work either.
I am using mysql 5.1.46, MyISAM storage engine.
It might be important to know that the query to select the random rows is going to be run very often and the table it is selecting from is appended to frequently.
Any help would be greatly appreciated!

You could solve this with some denormalization:
Build a secondary table that contains the same pkeys and statuses as your data table
Add and populate a status group column which will be a kind of sub-pkey that you auto number yourself (1-based autoincrement relative to a single status)
Pkey Status StatusPkey
1 A 1
2 A 2
3 B 1
4 B 2
5 C 1
... C ...
n C m (where m = # of C statuses)
When you don't need to filter you can generate rand #s on the pkey as you mentioned above. When you do need to filter then generate rands against the StatusPkeys of the particular status you're interested in.
There are several ways to build this table. You could have a procedure that you run on an interval or you could do it live. The latter would be a performance hit though since the calculating the StatusPkey could get expensive.

Check out this article by Jan Kneschke... It does a great job at explaining the pros and cons of different approaches to this problem...

You can do this efficiently, but you have to do it in two queries.
First get a random offset scaled by the number of rows that match your 5% conditions:
SELECT ROUND(RAND() * (SELECT COUNT(*) FROM MyTable WHERE ...conditions...))
This returns an integer. Next, use the integer as an offset in a LIMIT expression:
SELECT * FROM MyTable WHERE ...conditions... LIMIT 1 OFFSET ?
Not every problem must be solved in a single SQL query.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008