I'm processing data from a MySQL table where each row has a UUID associated with it. EDIT: the "UUID" is in fact an MD5 hash (VARCHAR) of the job text.
My select query looks something like:
SELECT * FROM jobs ORDER BY priority DESC LIMIT 1
I am only running one worker node right now, but would like to scale it out to several nodes without altering my schema.
The issue is that the jobs take some time, and scaling out beyond one right now would introduce a race condition where several nodes are working on the same job before it completes and the row is updated.
Is there an elegant way to effectively "shard" the data on the client-side, by specifying some modifier config value per worker node? My first thought was to use the MOD function like this:
SELECT * FROM jobs WHERE UUID MOD 2 = 0 ORDER BY priority DESC LIMIT 1
and SELECT * FROM jobs WHERE UUID MOD 2 = 1 ORDER BY priority DESC LIMIT 1
In this case I would have two workers configured as "0" and "1". But this isn't giving me an even distribution (not sure why) and feels clunky. Is there a better way?
The problem is you're storing the ID as a hex string like acbd18db4cc2f85cedef654fccc4a4d8. MySQL will not convert the hex for you. Instead, if it starts with a letter you get 0. If it starts with a number, you get the starting numbers.
select '123abc' + 0 = 123
select 'abc123' + 0 = 0
6 out of 16 will start with a letter so they will all be 0 and 0 mod anything is 0. The remaining 10 of 16 will be some number so will be distributed properly, 5 of 16 will be 0, 5 of 16 will be 1. 6/16 + 5/16 = 69% will be 0 which is very close to your observed 72%.
To do this right we need to convert the 128 hex string into a 64 bit unsigned integer.
Slice off 64 bits with either left(uuid, 16) or right(uuid, 16).
Convert the hex (base 16) into decimal (base 10) using conv.
cast the result to an unsigned bigint. If we skip this step MySQL appears to use a float which loses accurracy.
select cast(conv(right(uuid, 16), 16, 10) as unsigned) mod 2
Beautiful.
That will only use 64 bits of the 128 bit checksum, but for this purpose that should be fine.
Note this technique works with an MD5 checksum because it is pseudorandom. It will not work with the default MySQL uuid() function which is a UUID version 1. UUIDv1 is a timestamp + a fixed ID and will always mod the same.
UUIDv4, which is a random number, will work.
Convert the hex string to decimal before modding:
where CONV(substring(uuid, 1, 8), 16, 10) mod 2 = 1
A reasonable hashing function should distribute evenly enough for this purpose.
Use substring to convert only a small part so the conv doesn't overflow decimal range and maybe behave badly. Any subset of bits should also be well distributed.
I need to load several similar csv files into work tables with the same format for onward processing but for some of the data I get 'ORA-00984: column not allowed here' errors.
I can't change the layout of the csv but the ordering of the columns in the work table and the format of the sqlldr control file are in my control.
What do I need to change to get sqlldr to load this data?
EDIT: Solution: The following change to the .ctl file:
col6_fixedchar constant "abc", fixes the issue, interestingly sqlldr is quite happy with interpreting "3600" as a number.
Below is a sample:
table:
create table test_sqlldr
(
col1_date date,
col2_char varchar2(15),
col3_int number(5),
col4_int number(5),
col5_int number(5),
-- fixed and dummy fields
col6_fixedchar varchar2(15),
col7_nullchar varchar2(20),
col8_fixedint number(5)
);
csv:
cat /tmp/test_sqlldr.csv
2019-08-27 09:00:00,abcdefghi,3600,0,0
2019-08-27 09:00:00,jklmnopqr,3600,0,0
2019-08-27 09:00:00,stuvwxyza,3600,3598,3598
2019-08-27 09:00:00,bcdefghij,3600,0,0
ctl:
cat /tmp/test_sqlldr.ctl
load data infile '/tmp/test_sqlldr.csv'
insert into table test_sqlldr
fields terminated by ',' optionally enclosed by '"' TRAILING NULLCOLS
(
col1_date timestamp 'yyyy-mm-dd hh24:mi:ss',
col2_char,
col3_int,
col4_int,
col5_int,
col6_fixedchar "abc",
col8_fixedint "3600"
)
This generates the following output:
/opt/oracle/product/112020_cl_64/cl/bin/sqlldr <db credentials> control='/tmp/test_sqlldr.ctl' ; cat test_sqlldr.log
SQL*Loader: Release 12.2.0.1.0 - Production on Wed Aug 28 10:26:00 2019
Copyright (c) 1982, 2017, Oracle and/or its affiliates. All rights reserved.
Path used: Conventional
Commit point reached - logical record count 4
Table TEST_SQLLDR:
0 Rows successfully loaded.
Check the log file:
test_sqlldr.log
for more information about the load.
SQL*Loader: Release 12.2.0.1.0 - Production on Wed Aug 28 10:26:00 2019
Copyright (c) 1982, 2017, Oracle and/or its affiliates. All rights reserved.
Control File: /tmp/test_sqlldr.ctl
Data File: /tmp/test_sqlldr.csv
Bad File: /tmp/test_sqlldr.bad
Discard File: none specified
(Allow all discards)
Number to load: ALL
Number to skip: 0
Errors allowed: 50
Bind array: 64 rows, maximum of 256000 bytes
Continuation: none specified
Path used: Conventional
Table TEST_SQLLDR, loaded from every logical record.
Insert option in effect for this table: INSERT
TRAILING NULLCOLS option in effect
Column Name Position Len Term Encl Datatype
------------------------------ ---------- ----- ---- ---- ---------------------
COL1_DATE FIRST * , O(") DATETIME yyyy-mm-dd hh24:mi:ss
COL2_CHAR NEXT * , O(") CHARACTER
COL3_INT NEXT * , O(") CHARACTER
COL4_INT NEXT * , O(") CHARACTER
COL5_INT NEXT * , O(") CHARACTER
COL6_FIXEDCHAR NEXT * , O(") CHARACTER
SQL string for column : "abc"
COL8_FIXEDINT NEXT * , O(") CHARACTER
SQL string for column : "3600"
Record 1: Rejected - Error on table TEST_SQLLDR, column COL4_INT.
ORA-00984: column not allowed here
Record 2: Rejected - Error on table TEST_SQLLDR, column COL4_INT.
ORA-00984: column not allowed here
Record 3: Rejected - Error on table TEST_SQLLDR, column COL4_INT.
ORA-00984: column not allowed here
Record 4: Rejected - Error on table TEST_SQLLDR, column COL4_INT.
ORA-00984: column not allowed here
Table TEST_SQLLDR:
0 Rows successfully loaded.
4 Rows not loaded due to data errors.
0 Rows not loaded because all WHEN clauses were failed.
0 Rows not loaded because all fields were null.
Space allocated for bind array: 115584 bytes(64 rows)
Read buffer bytes: 1048576
Total logical records skipped: 0
Total logical records read: 4
Total logical records rejected: 4
Total logical records discarded: 0
Run began on Wed Aug 28 10:26:00 2019
Run ended on Wed Aug 28 10:26:00 2019
Elapsed time was: 00:00:00.14
CPU time was: 00:00:00.03
Try: col6_fixedchar CONSTANT "abc"
I was doing a query with MySQL to save all objects returned, but I'd like identify these objects based in statements of the block WHERE, that is, if determined object to satisfy the specific characteristic I'd like create one column and in this column I assignment the value 0 or 1 in the row corresponding the object if it satisfy or not satisfy these characteristic.
This is my script:
SELECT
s.id, al.ID, al.j, al.k, al.r, gal.i
FROM
datas as al
WHERE
AND s.id = al.ID
AND al.j between 1 and 1
AND al.k BETWEEN 15 and 16
AND al.r BETWEEN 67 and 72
The script above is working perfectly and I can to save all objects which it return.
So, I'd like to know if is there a way add in the query above, on block WHERE, the following statement,
( Flags & (dbo.environment('cool') +
dbo.environment('ok') -
dbo.environment('source')) ) = 25
and ((al_pp x al_pp1)-0.5/3=11
and determined the objects that satisfy or not these condition with 0 or 1 in a new column created in Table saved.
I read some tutorials about this and saw some attempts with IF, CASE, ADD COLUMN or WHEN, but none of these solved.
Thanks in advance
MySQL has if function, see here
So you can simply use it in your query:
SELECT IF(( Flags & (dbo.fPhotoFlags('SATURATED') +
dbo.fPhotoFlags('BRIGHT') +
dbo.fPhotoFlags('EDGE')) ) = 0
and petroRad_r < 18
and ((colc_u - colc_g) - (psfMag_u - psfMag_g)) < -0.4
, 1 --// VALUE IF TRUE
, 0 --// VALUE IF FALSE
) as conditional_column, ... rest of your query
I was able to use BULK INSERT on an SQL Server 2008 R2 database to import a CSV file (Tab delimited) with more than 2 million rows. This command is planned to run every week.
I added an additional column named "lastupdateddate" to the generated table to store the datestamp a row is updated via a INSERT trigger. But when I ran the BULK INSERT again, it failed due to mismatch in columns as there is no such a field in a raw CSV file.
Is there any possibility to configure BULK INSERT to ignore the "lastupdateddate" column when it runs?
Thanks.
-- EDIT:
I tried using a format file but still unable to solve the problem.
The table looks as below.
USE AdventureWorks2008R2;
GO
CREATE TABLE AAA_Test_Table
(
Col1 smallint,
Col2 nvarchar(50) ,
Col3 nvarchar(50) ,
LastUpdatedDate datetime
);
GO
The csv "data.txt" file is:
1,DataField2,DataField3
2,DataField2,DataField3
3,DataField2,DataField3
The format file is like:
10.0
3
1 SQLCHAR 0 7 "," 1 Col1 ""
2 SQLCHAR 0 100 "," 2 Col2 SQL_Latin1_General_CP1_CI_AS
3 SQLCHAR 0 100 "," 3 Col3 SQL_Latin1_General_CP1_CI_AS
The SQL command I ran is:
DELETE AAA_Test_Table
BULK INSERT AAA_Test_Table
FROM 'C:\Windows\Temp\TestFormatFile\data.txt'
WITH (formatfile='C:\Windows\Temp\TestFormatFile\formatfile.fmt');
GO
The error received is:
Msg 4864, Level 16, State 1, Line 2
Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 2, column 1 (Col1).
Msg 4832, Level 16, State 1, Line 2
Bulk load: An unexpected end of file was encountered in the data file.
Msg 7399, Level 16, State 1, Line 2
The OLE DB provider "BULK" for linked server "(null)" reported an error. The provider did not give any information about the error.
Msg 7330, Level 16, State 2, Line 2
Cannot fetch a row from OLE DB provider "BULK" for linked server "(null)".
Yes you can using a format file as documented Here and use that format with bcp command with -f option like -f format_file_name.fmt.
Well another option would be; import all the data (I mean all fields) and then drop the non wanted column lastupdateddate using SQL like
ALTER TABLE your_bulk_insert_table DROP COLUMN lastupdateddate
I am having a performance issue. I am trying to select from a table based on a very long list of parameters
Currently am using this stored proc
CREATE PROC [dbo].[GetFileContentsFromTitles]
#MyTitles varchar(max)
AS
SELECT [Title], [Sequence] From [dbo].[MasterSequence]
WHERE charindex(',' + Title + ',', ',' + #MyTitles + ',') > 0;
Where #MyTitles can be a very big number (currently doing a string with 4000 entries seperated by commas). Any suggestions? Thanks
OK, if you want performance for something like this, then you need to use the best stuff out there. First, create this function for splitting strings (which I got from Jeff Moden about two weeks ago):
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
ALTER FUNCTION [dbo].[DelimitedSplit8K]
/**********************************************************************************************************************
Purpose:
Split a given string at a given delimiter and return a list of the split elements (items).
Notes:
1. Leading a trailing delimiters are treated as if an empty string element were present.
2. Consecutive delimiters are treated as if an empty string element were present between them.
3. Except when spaces are used as a delimiter, all spaces present in each element are preserved.
Returns:
iTVF containing the following:
ItemNumber = Element position of Item as a BIGINT (not converted to INT to eliminate a CAST)
Item = Element value as a VARCHAR(8000)
Statistics on this function may be found at the following URL:
http://www.sqlservercentral.com/Forums/Topic1101315-203-4.aspx
CROSS APPLY Usage Examples and Tests:
--=====================================================================================================================
-- TEST 1:
-- This tests for various possible conditions in a string using a comma as the delimiter. The expected results are
-- laid out in the comments
--=====================================================================================================================
--===== Conditionally drop the test tables to make reruns easier for testing.
-- (this is NOT a part of the solution)
IF OBJECT_ID('tempdb..#JBMTest') IS NOT NULL DROP TABLE #JBMTest
;
--===== Create and populate a test table on the fly (this is NOT a part of the solution).
-- In the following comments, "b" is a blank and "E" is an element in the left to right order.
-- Double Quotes are used to encapsulate the output of "Item" so that you can see that all blanks
-- are preserved no matter where they may appear.
SELECT *
INTO #JBMTest
FROM ( --# & type of Return Row(s)
SELECT 0, NULL UNION ALL --1 NULL
SELECT 1, SPACE(0) UNION ALL --1 b (Empty String)
SELECT 2, SPACE(1) UNION ALL --1 b (1 space)
SELECT 3, SPACE(5) UNION ALL --1 b (5 spaces)
SELECT 4, ',' UNION ALL --2 b b (both are empty strings)
SELECT 5, '55555' UNION ALL --1 E
SELECT 6, ',55555' UNION ALL --2 b E
SELECT 7, ',55555,' UNION ALL --3 b E b
SELECT 8, '55555,' UNION ALL --2 b B
SELECT 9, '55555,1' UNION ALL --2 E E
SELECT 10, '1,55555' UNION ALL --2 E E
SELECT 11, '55555,4444,333,22,1' UNION ALL --5 E E E E E
SELECT 12, '55555,4444,,333,22,1' UNION ALL --6 E E b E E E
SELECT 13, ',55555,4444,,333,22,1,' UNION ALL --8 b E E b E E E b
SELECT 14, ',55555,4444,,,333,22,1,' UNION ALL --9 b E E b b E E E b
SELECT 15, ' 4444,55555 ' UNION ALL --2 E (w/Leading Space) E (w/Trailing Space)
SELECT 16, 'This,is,a,test.' --E E E E
) d (SomeID, SomeValue)
;
--===== Split the CSV column for the whole table using CROSS APPLY (this is the solution)
SELECT test.SomeID, test.SomeValue, split.ItemNumber, Item = QUOTENAME(split.Item,'"')
FROM #JBMTest test
CROSS APPLY dbo.DelimitedSplit8K(test.SomeValue,',') split
;
--=====================================================================================================================
-- TEST 2:
-- This tests for various "alpha" splits and COLLATION using all ASCII characters from 0 to 255 as a delimiter against
-- a given string. Note that not all of the delimiters will be visible and some will show up as tiny squares because
-- they are "control" characters. More specifically, this test will show you what happens to various non-accented
-- letters for your given collation depending on the delimiter you chose.
--=====================================================================================================================
WITH
cteBuildAllCharacters (String,Delimiter) AS
(
SELECT TOP 256
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789',
CHAR(ROW_NUMBER() OVER (ORDER BY (SELECT NULL))-1)
FROM master.sys.all_columns
)
SELECT ASCII_Value = ASCII(c.Delimiter), c.Delimiter, split.ItemNumber, Item = QUOTENAME(split.Item,'"')
FROM cteBuildAllCharacters c
CROSS APPLY dbo.DelimitedSplit8K(c.String,c.Delimiter) split
ORDER BY ASCII_Value, split.ItemNumber
;
-----------------------------------------------------------------------------------------------------------------------
Other Notes:
1. Optimized for VARCHAR(8000) or less. No testing or error reporting for truncation at 8000 characters is done.
2. Optimized for single character delimiter. Multi-character delimiters should be resolvedexternally from this
function.
3. Optimized for use with CROSS APPLY.
4. Does not "trim" elements just in case leading or trailing blanks are intended.
5. If you don't know how a Tally table can be used to replace loops, please see the following...
http://www.sqlservercentral.com/articles/T-SQL/62867/
6. Changing this function to use NVARCHAR(MAX) will cause it to run twice as slow. It's just the nature of
VARCHAR(MAX) whether it fits in-row or not.
7. Multi-machine testing for the method of using UNPIVOT instead of 10 SELECT/UNION ALLs shows that the UNPIVOT method
is quite machine dependent and can slow things down quite a bit.
-----------------------------------------------------------------------------------------------------------------------
Credits:
This code is the product of many people's efforts including but not limited to the following:
cteTally concept originally by Iztek Ben Gan and "decimalized" by Lynn Pettis (and others) for a bit of extra speed
and finally redacted by Jeff Moden for a different slant on readability and compactness. Hat's off to Paul White for
his simple explanations of CROSS APPLY and for his detailed testing efforts. Last but not least, thanks to
Ron "BitBucket" McCullough and Wayne Sheffield for their extreme performance testing across multiple machines and
versions of SQL Server. The latest improvement brought an additional 15-20% improvement over Rev 05. Special thanks
to "Nadrek" and "peter-757102" (aka Peter de Heer) for bringing such improvements to light. Nadrek's original
improvement brought about a 10% performance gain and Peter followed that up with the content of Rev 07.
I also thank whoever wrote the first article I ever saw on "numbers tables" which is located at the following URL
and to Adam Machanic for leading me to it many years ago.
http://sqlserver2000.databases.aspfaq.com/why-should-i-consider-using-an-auxiliary-numbers-table.html
-----------------------------------------------------------------------------------------------------------------------
Revision History:
Rev 00 - 20 Jan 2010 - Concept for inline cteTally: Lynn Pettis and others.
Redaction/Implementation: Jeff Moden
- Base 10 redaction and reduction for CTE. (Total rewrite)
Rev 01 - 13 Mar 2010 - Jeff Moden
- Removed one additional concatenation and one subtraction from the SUBSTRING in the SELECT List for that tiny
bit of extra speed.
Rev 02 - 14 Apr 2010 - Jeff Moden
- No code changes. Added CROSS APPLY usage example to the header, some additional credits, and extra
documentation.
Rev 03 - 18 Apr 2010 - Jeff Moden
- No code changes. Added notes 7, 8, and 9 about certain "optimizations" that don't actually work for this
type of function.
Rev 04 - 29 Jun 2010 - Jeff Moden
- Added WITH SCHEMABINDING thanks to a note by Paul White. This prevents an unnecessary "Table Spool" when the
function is used in an UPDATE statement even though the function makes no external references.
Rev 05 - 02 Apr 2011 - Jeff Moden
- Rewritten for extreme performance improvement especially for larger strings approaching the 8K boundary and
for strings that have wider elements. The redaction of this code involved removing ALL concatenation of
delimiters, optimization of the maximum "N" value by using TOP instead of including it in the WHERE clause,
and the reduction of all previous calculations (thanks to the switch to a "zero based" cteTally) to just one
instance of one add and one instance of a subtract. The length calculation for the final element (not
followed by a delimiter) in the string to be split has been greatly simplified by using the ISNULL/NULLIF
combination to determine when the CHARINDEX returned a 0 which indicates there are no more delimiters to be
had or to start with. Depending on the width of the elements, this code is between 4 and 8 times faster on a
single CPU box than the original code especially near the 8K boundary.
- Modified comments to include more sanity checks on the usage example, etc.
- Removed "other" notes 8 and 9 as they were no longer applicable.
Rev 06 - 12 Apr 2011 - Jeff Moden
- Based on a suggestion by Ron "Bitbucket" McCullough, additional test rows were added to the sample code and
the code was changed to encapsulate the output in pipes so that spaces and empty strings could be perceived
in the output. The first "Notes" section was added. Finally, an extra test was added to the comments above.
Rev 07 - 06 May 2011 - Peter de Heer, a further 15-20% performance enhancement has been discovered and incorporated
into this code which also eliminated the need for a "zero" position in the cteTally table.
**********************************************************************************************************************/
--===== Define I/O parameters
(#pString VARCHAR(8000), #pDelimiter CHAR(1))
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 1 up to 10,000...
-- enough to cover VARCHAR(8000)
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT TOP (ISNULL(DATALENGTH(#pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT 1 UNION ALL
SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(#pString,t.N,1) = #pDelimiter
),
cteLen(N1,L1) AS(--==== Return start and length (for use in substring)
SELECT s.N1,
ISNULL(NULLIF(CHARINDEX(#pDelimiter,#pString,s.N1),0)-s.N1,8000)
FROM cteStart s
)
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = SUBSTRING(#pString, l.N1, l.L1)
FROM cteLen l
;
Yes, it long, but that's mostly comments to explain it and its history. Don't worry, its the fastest thing available in T-SQL (AFAIK only SQLCLR is faster, and that's not T-SQL).
Note that it only supports up to VARCHAR(8000). If your really need VARCHAR(MAX), then it can be easily changed to that, but VARCHAR(MAX)'s are about twice as slow.
Now you can implement your procedure like this:
CREATE PROC [dbo].[GetFileContentsFromTitles]
#MyTitles varchar(max)
AS
SELECT *
INTO #tmpTitles
FROM dbo.DelimitedSplit8K(#MyTitles, ',')
SELECT [Title], [Sequence] From [dbo].[MasterSequence]
WHERE Title IN (SELECT item FROM #tmpTitles)
I cannot test this for you without your DDL and some data, but it should be much faster. If not, then we may need to throw an index onto the [item] column in the temp table.
Here's another version of the split function that uses VARCHAR(MAX):
ALTER FUNCTION [dbo].[DelimitedSplitMax]
/**********************************************************************************************************************
Purpose:
Split a given string at a given delimiter and return a list of the split elements (items).
Notes:
1. Leading a trailing delimiters are treated as if an empty string element were present.
2. Consecutive delimiters are treated as if an empty string element were present between them.
3. Except when spaces are used as a delimiter, all spaces present in each element are preserved.
Returns:
iTVF containing the following:
ItemNumber = Element position of Item as a BIGINT (not converted to INT to eliminate a CAST)
Item = Element value as a VARCHAR(MAX)
Statistics on this function may be found at the following URL:
http://www.sqlservercentral.com/Forums/Topic1101315-203-4.aspx
-----------------------------------------------------------------------------------------------------------------------
Other Notes:
1. Optimized for VARCHAR(8000) or less. No testing or error reporting for truncation at 8000 characters is done.
2. Optimized for single character delimiter. Multi-character delimiters should be resolvedexternally from this
function.
3. Optimized for use with CROSS APPLY.
4. Does not "trim" elements just in case leading or trailing blanks are intended.
5. If you don't know how a Tally table can be used to replace loops, please see the following...
http://www.sqlservercentral.com/articles/T-SQL/62867/
6. Changing this function to use NVARCHAR(MAX) will cause it to run twice as slow. It's just the nature of
VARCHAR(MAX) whether it fits in-row or not.
7. Multi-machine testing for the method of using UNPIVOT instead of 10 SELECT/UNION ALLs shows that the UNPIVOT method
is quite machine dependent and can slow things down quite a bit.
-----------------------------------------------------------------------------------------------------------------------
Credits:
This code is the product of many people's efforts including but not limited to the following:
cteTally concept originally by Iztek Ben Gan and "decimalized" by Lynn Pettis (and others) for a bit of extra speed
and finally redacted by Jeff Moden for a different slant on readability and compactness. Hat's off to Paul White for
his simple explanations of CROSS APPLY and for his detailed testing efforts. Last but not least, thanks to
Ron "BitBucket" McCullough and Wayne Sheffield for their extreme performance testing across multiple machines and
versions of SQL Server. The latest improvement brought an additional 15-20% improvement over Rev 05. Special thanks
to "Nadrek" and "peter-757102" (aka Peter de Heer) for bringing such improvements to light. Nadrek's original
improvement brought about a 10% performance gain and Peter followed that up with the content of Rev 07.
I also thank whoever wrote the first article I ever saw on "numbers tables" which is located at the following URL
and to Adam Machanic for leading me to it many years ago.
http://sqlserver2000.databases.aspfaq.com/why-should-i-consider-using-an-auxiliary-numbers-table.html
-----------------------------------------------------------------------------------------------------------------------
Revision History:
Rev 00 - 20 Jan 2010 - Concept for inline cteTally: Lynn Pettis and others.
Redaction/Implementation: Jeff Moden
- Base 10 redaction and reduction for CTE. (Total rewrite)
Rev 01 - 13 Mar 2010 - Jeff Moden
- Removed one additional concatenation and one subtraction from the SUBSTRING in the SELECT List for that tiny
bit of extra speed.
Rev 02 - 14 Apr 2010 - Jeff Moden
- No code changes. Added CROSS APPLY usage example to the header, some additional credits, and extra
documentation.
Rev 03 - 18 Apr 2010 - Jeff Moden
- No code changes. Added notes 7, 8, and 9 about certain "optimizations" that don't actually work for this
type of function.
Rev 04 - 29 Jun 2010 - Jeff Moden
- Added WITH SCHEMABINDING thanks to a note by Paul White. This prevents an unnecessary "Table Spool" when the
function is used in an UPDATE statement even though the function makes no external references.
Rev 05 - 02 Apr 2011 - Jeff Moden
- Rewritten for extreme performance improvement especially for larger strings approaching the 8K boundary and
for strings that have wider elements. The redaction of this code involved removing ALL concatenation of
delimiters, optimization of the maximum "N" value by using TOP instead of including it in the WHERE clause,
and the reduction of all previous calculations (thanks to the switch to a "zero based" cteTally) to just one
instance of one add and one instance of a subtract. The length calculation for the final element (not
followed by a delimiter) in the string to be split has been greatly simplified by using the ISNULL/NULLIF
combination to determine when the CHARINDEX returned a 0 which indicates there are no more delimiters to be
had or to start with. Depending on the width of the elements, this code is between 4 and 8 times faster on a
single CPU box than the original code especially near the 8K boundary.
- Modified comments to include more sanity checks on the usage example, etc.
- Removed "other" notes 8 and 9 as they were no longer applicable.
Rev 06 - 12 Apr 2011 - Jeff Moden
- Based on a suggestion by Ron "Bitbucket" McCullough, additional test rows were added to the sample code and
the code was changed to encapsulate the output in pipes so that spaces and empty strings could be perceived
in the output. The first "Notes" section was added. Finally, an extra test was added to the comments above.
Rev 07 - 06 May 2011 - Peter de Heer, a further 15-20% performance enhancement has been discovered and incorporated
into this code which also eliminated the need for a "zero" position in the cteTally table.
Rev 07a- 18 Oct 2012 - RBarryYoung, Varchar(MAX), because its needed, even though its slower...
**********************************************************************************************************************/
--===== Define I/O parameters
(#pString VARCHAR(MAX), #pDelimiter CHAR(1))
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 1 up to 100,000,000...
-- hopefully enough to cover most VARCHAR(MAX)'s
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
E8(N) AS (SELECT 1 FROM E4 a, E4 b), --10E+8 or 100,000,000 rows max
cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT TOP (ISNULL(DATALENGTH(#pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E8
),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT 1 UNION ALL
SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(#pString,t.N,1) = #pDelimiter
),
cteLen(N1,L1) AS(--==== Return start and length (for use in substring)
SELECT s.N1,
ISNULL(NULLIF(CHARINDEX(#pDelimiter,#pString,s.N1),0)-s.N1,999999999)
FROM cteStart s
)
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = SUBSTRING(#pString, l.N1, l.L1)
FROM cteLen l
;
Be forewarned, however, that I only set it up to count up to 100,000,000 characters. Also, I have not had a chance to test it yet, you should be sure to test it yourself.