I'm trying to check the results of a data load between two databases. Unfortunately, I only have access to one database (MySQL) directly, the company managing MSSQL can expose it to us via an API.
What I would like to do is check the consistency of certain columns across rowsets. Originally, I had hoped to be able to run a CRC or hash check against the columns, but there doesn't seem to be a compatible way of doing this.
For example, we can run CRC32 against a column in MySQL, but there isn't a reliable way of doing the same on MSSQL. Alternatively, there's CHECKSUM_AGG on MSSQL, but no alternative on MySQL.
The end result is that I would like to do a binary search if the checksums differ to identify the rows that require changing.
There is currently no bulk load interface, and SSIS is not available (the MSSQL servers are not part of my company).
I thought I'd come back to this and describe the solution we ended up implementing. This was a major pain in the neck!
Firstly, because of the fixed versions of MySQL on our server and MSSQL on the remote server, there were no common encoding methods.
The MSSQL API returned data in UTF-16LE, the MySQL database had Unicode data stored in Latin-1 tables sigh
Firstly, we concatenated the fields that we were comparing, then we MD5'd the result. In order to get the MySQL result to match the output of the MSSQL HASHBYTES function, we had to do this:
SELECT ABS(CONV(CONCAT(
IF(MID(MD5(CONC), -8 , 1) >= "8", "FFFFFFFF", ""),
RIGHT(MD5(CONC), 8)
), 16, -10 )) AS CALC
where CONC is the result of a subselect concatenating the fields we are interested in.
On the MSSQL server, we had to do the following query:
SELECT ABS(CONVERT(INT,HASHBYTES('MD5',
CONVERT(NVARCHAR(4000), FIELD1 ) +
CONVERT(NVARCHAR(4000), FIELD2 ) + ...
Then, we took the sum of the MD5 across the entire range, modulo three large-ish primes(311,313,317), as per Chinese Remainder Theorem
This gave us three numbers for the range we were checking. We could be reasonably certain that if all three numbers matched for a given range on each server, then the data was consistent.
I'll spare you the details of the munging we had to do to get Unicode in Latin-1 transliterated to UTF-16LE
Related
I have created a connection to Cloud SQL and used EXTERNAL_QUERY() to export the data to Bigquery. My problem is that I do not know a computationally efficient way to export a new days data since the Cloud SQL table is not partitioned; however, it does have a date column date_field but it is of the datatype char.
I have tried running the following query with the view of scheduling a similar type so that it inserts the results: SELECT * FROM EXTERNAL_QUERY("connection", "SELECT period FROM table where date_field = cast(current_date() as char);") but it takes very long to run, whereas: SELECT * FROM EXTERNAL_QUERY("connection", "SELECT period FROM table where date_field = '2020-03-20';") is almost instant.
Firstly, it’s highly recommended to convert the ‘date_field’ column to the datatype DATE. This would improve simplicity and performance in the future.
When comparing two strings, MySQL will make use of indexes to speed up the queries. This is executed successfully when defining the string as ‘2020-03-20’ for example. When casting the current date to a string, it’s possible that the characters set used in the comparison aren’t the same, so indexes can’t be used.
You may want to check the characters set once current_datetime has been casted compared to the values in the ‘date_field’ column. You could then use this command instead of cast:
CONVERT(current_date() USING enter_char_sets_here)
Here is the documentation for the different casting functions.
I am attempting to create several calculated columns in a table with different parts of a parsed filename. Using the InstrRev function is critical to isolate the base file name or extension, but InstrRev is not supported in calculated columns.
I know that there are other ways to solve my problem that don't use calculated columns, but does anyone have a valid calculated column formula that could help me?
Access lets you use VBA functions (including user-defined functions) directly from within a SQL query - however they only work within an Access context - if you have another frontend for a JET (now ACE) database - or inside a computed/calculated column, they won't work - as you've just discovered.
Unfortunately Access (JET and ACE) have only a very meagre and anaemic selection of built-in functions, and the platform has now lagged-behind SQL Server (and even the open-source SQLite) significantly - Access 2016 has not made significant changes to its SQL implementation since Access 2000 (16 years of stagnation!) whereas SQL Server 2016's T-SQL language is so evolved it's almost unrecognizable compared to SQL Server 2000.
JET and ACE support the standard ODBC functions ( https://msdn.microsoft.com/en-us/library/bb208907(v=office.12).aspx ) however none of these perform a "reverse index-of" operation. Also absent is any form of pattern-matching function - though the LIKE operator works, it only returns a boolean result, not a character index.
In short: what you want to do is impossible.
This has been discovered by many people before you:
https://social.msdn.microsoft.com/Forums/office/en-US/6cf82b1b-8e74-4ac8-9997-61cad8bb9310/access-database-engine-incompatible-with-instrrev?forum=accessdev
He maintains a list of DAO/Jet/etc reserved words - and on that list you will see the InstrRev is a VBA() function, and is not a part of the Jet/Ace Engines.
using InStrRev() and similar functions in Jet/ACE queries outside of Access
As you have discovered, SQL queries executed from within Access can use many VBA functions that are not natively supported by the Jet/ACE dialect of SQL
That said, computed/calculated columns are only really of use in stored VIEW objects ("Queries" objects in Access parlance) - which in turn are used for user convenience, not for any programming advantage - especially as these are scalar functions that are evaluated for every row of data that the engine processes (making them potentially very expensive and inefficient to run).
...so the only real solution is to abandon computed/calculated columns and perform this processing in your own application code - but the advantage is that your program will likely be significantly faster.
...or don't use Access and switch to a different DBMS with better active support, such as SQLite (for an in-process database), SQL Server (now with LocalDb for in-process support), or VistaDB (proprietary, but 100% Managed code). Note that Access also supports acting as a front-end for a SQL Server "backend" data-store - where you could create a VIEW that performs this operation, then query the view from your Access code or other consuming client.
There is a workaround if you must: Create a duplicate column that contains the string-reversed value of your original column, then you can evaluate the ODBC LOCATE or JET SQL InStr functions on it and get the result you want (albiet, reversed) - but this would require double the storage space.
e.g.
RowId, FileName , FileNameRev
1 , 'Foo.txt', 'txt.ooF'
2 , 'Bar.txt', 'txt.raB'
Avoid any calculated field. It's a "super user" feature only, that will cause you nothing but trouble. Calculated fields - or expressions - belong in a query.
So create a simple select query:
Select
*,
InStrRev([FieldToCheck], "YourMatchingString") As StringMatch
From
YourTable
Save the query, and then use this whenever you need the table values and this expression.
I am building a reporting database and get data in CSV as well as from MS SQL Server. These are mostly personal records, tied together by SSN (well, not really SSN but something very similar). For security reasons, we are not storing the SSN, but rather the SHA2-256 hash of the SSN.
On MySql 5.6, I can simply use the built-in function:
sha2( string, 256).
For MS SQL Express (SQL Server 2008 running on Windows 7) I used this link (http://geekswithblogs.net/hroggero/archive/2009/09/19/strong-password-hashing-with-sql-server.aspx) to write an external UDF.
Each function should return the same string for the same argument .. and it does not. Even more puzzling is that the binary representation seems to be quite different:
MySql Output: bfd6b995588ec54ce16871bc82a7ac86dd43a2c22309ea68e479a50043683937
MSSQL Output: 0x1B4F27012B7F6E7DA6563376E3CB560FCB07FDE2E33C6C3241A5D53885ABCF71
The MSSQL output is clearly a hex encoding (0-9, A-F) of the low and high 4 bit portions of each byte. But how does MySql represent the binary characters?
Ran the MySql queries both through SqlYog and through the DOS command-line ... no difference.
I have checked the MySql documentation and everything else I could find on the web, no luck.
Actually, MySql returns the same hex codes as MSSQL, just in lowercase. I did correct my C# code and now both functions return the same string. Hurrayyyyyyy
I'm trying to migrate an application database from MySQL to SQL Server using the vendor's conversion tool. When I do, I get a unique constraint violation error that indicates I'm trying to write a value ("Canon Inc.") twice in one column of the SQL Server table.
I logged into the live system (MySQL) and ran the query below, shown with its results (zero records). This tells me that MySQL doesn't see any records where the column contains duplicate values.
However, when I search more generally for Canon Inc., I do indeed find two records. But when I check the character & bit lengths of the strings, they're clearly not exactly the same, as shown here:
What could the difference be between these two strings? Is there a way for me to clean this up?
My guess would be that you have a trailing space on the second canon inc. That would account for the character length being one more than the other one, and I'd bet that SQL Server is ignoring the trailing space.
I'm trying to push a brand new Ruby on Rails app to Heroku. Currently, it sits on MySQL. It looks like Heroku doesn't really support MySQL and so we are considering using PostgreSQL, which they DO support.
How difficult should I expect this to be? What do I need to do to make this happen?
Again, please note that my DB as of right now (both development & production) are completely empty.
Common issues:
GROUP BY behavior. PostgreSQL has a rather strict GROUP BY. If you use a GROUP BY clause, then every column in your SELECT must either appear in your GROUP BY or be used in an aggregate function.
Data truncation. MySQL will quietly truncate a long string to fit inside a char(n) column unless your server is in strict mode, PostgreSQL will complain and make you truncate your string yourself.
Quoting is different, MySQL uses backticks for quoting identifiers whereas PostgreSQL uses double quotes.
LIKE is case insensitive in MySQL but not in PostgreSQL. This leads many MySQL users to use LIKE as a case insensitive string equality operator.
(1) will be an issue if you use AR's group method in any of your queries or GROUP BY in any raw SQL. Do some searching for column "X" must appear in the GROUP BY clause or be used in an aggregate function and you'll see some examples and common solutions.
(2) will be an issue if you use string columns anywhere in your application and your models aren't properly validating the length of all incoming string values. Note that creating a string column in Rails without specifying a limit actually creates a varchar(255) column so there actually is an implicit :limit => 255 even though you didn't specify one. An alternative is to use t.text for your strings instead of t.string; this will let you work with arbitrarily large strings without penalty (for PostgreSQL at least). As Erwin notes below (and every other chance he gets), varchar(n) is a bit of an anachronism in the PostgreSQL world.
(3) shouldn't be a problem unless you have raw SQL in your code.
(4) will be an issue if you're using LIKE anywhere in your application. You can fix this one by changing a like b to lower(a) like lower(b) (or upper(a) like upper(b) if you like to shout) or a ilike b but be aware that PostgreSQL's ILIKE is non-standard.
There are other differences that can cause trouble but those seem like the most common issues.
You'll have to review a few things to feel safe:
group calls.
Raw SQL (including any snippets in where calls).
String length validations in your models.
All uses of LIKE.
If you have no data to migrate, it should be as simple as telling your Gemfile to use the pg gem instead, running bundle install, and updating your database.yml file to point to your PostgreSQL databases. Then just run your migrations (rake db:migrate) and everything should work great.
Don't feel you have to migrate to Postgres - there are several MySQL Addon providers available on Heroku - http://addons.heroku.com/cleardb is the one I've had the most success with.
It should be simplicity itself: port the DDL from MySQL to PostgreSQL.
Does Heroku have any schema creation scripts? I'd depend on those if they were available.
MySQL and PostgreSQL are different (e.g. identity type for MySQL, sequences for PostgreSQL). But the port shouldn't be too hard. How many tables? Tens are doable.