Use of wildcards in external table definition - partitioning

Thanks for reading! I would like to define an external table on a storage account where the path format is as follows:
flowevents/resourceId=/SUBSCRIPTIONS/<unique>/RESOURCEGROUPS/<unique>/PROVIDERS/MICROSOFT.NETWORK/NETWORKSECURITYGROUPS/<unique>/y=2022/m=05/d=11/h=09/m=00/<unique>/datafiles
I would like to partition the external table by date. The relevant documentation for this is located here. My understanding and experimentation indicates that this might not be possible to do, given the URI path above where there are unique values before the values that I would like to partition on and the answer given by Slavik here.
Is it possible to create an external table using wildcards to traverse the folders to achieve the partition scheme described above?
Is the only way to solve this to define multiple storage connection strings for all possible values of unique? Is there an upper limit to how many values may be provided?

The path traversal functionality I'm looking for can be found in LightIngest:
-prefix:resourceId=/SUBSCRIPTIONS/00-00C-00-00-00/RESOURCEGROUPS/ -pattern:*/PROVIDERS/MICROSOFT.NETWORK/NETWORKSECURITYGROUPS/*/y=2021/m=11/d=10/*.json
It does not seem to be supported when defining external tables. A possible reason for this is that the engine will get overloaded if you load too many files from external storage. I got the following error message when I defined 50 connections strings:
Partial query failure: Input stream/record/field too large (E_INPUT_STREAM_TOO_LARGE). (message: '', details: '')
It works as intended when I provided 30 connection strings and used four virtual columns to do partitioning. This error message is not described in the documentation, by the way.
Update, for kusto developers: I attempted to use virtual columns for the whole URI path and then query to generate the connection string. I verified that the table definition is correct using:
.show external table X articats limit 1
It would show the partitions with populated values. However, when attempting to query the external table using the recommended operators ("in" or "has") to navigate it does not work, the query goes on forever despite fetching a small file and running on a cluster on D14_v2 VMs. If I were to define an external table just for that file, it would load just fine.

Related

MySQL - Invalid GIS data provided to function st_polygonfromtext

I have a table in mysql with geometry data in one of the columns. The datatype is text and I need to save it as Polygon geometry.
I have tried a few solutions, but keep running into Invalid GIS data provided to function st_polygonfromtext. error.
Here's some data to work with and an example:
https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=78ac63e16ccb5b1e4012c21809cba5ff
Table has 25k rows, there are likely some bad geometries in there. When I attempt to update on a subset of rows, it seems to successfully work, like it did in the fiddle example. It fails when I attempt to update all 25k rows.
Someone suggested using wrapping the statements around TRY and CATCH. Detecting faulty geometry WKT and returning the faulty record
I am not too familiar with using them in MySQL or stored procedures either.
I need a spatial index on the table to be able to use spatial functions and filter queries by location.
Plan A: Create a new table and try to convert as you INSERT IGNORE INTO that table from your existing table. I don't know if this will apply the "IGNORE" to conversion failures. Also, you would end up with the "good" values. What do you want to do about the "bad" values?
Plan B: Write a loop in application code -- read one row, convert the varchar value, check for errors.

How to scope a MySQL JOOQ rename table query to the same database?

I have a scala application that manages multiple MySQL database schemas, which includes modifying (adding, renaming, etc.) tables. The commands are issued over a connection pool that connects to a generic management database in the database server.
Because the application is designed to be cross-database, I use JOOQ to render SQL queries (execution is done via a separate JDBC module).
I experience issues with JOOQs alterTable(...).renameTo(...) DSL - consider the following example:
We have a table "TestTable" in database "TestDatabase". Let's say I want to rename that table simply to "Foo", keeping it in "TestDatabase".
This code:
...
val context = DSL.using(SQLDialect.MYSQL_5_7)
val query = context
.alterTable(table(name("TestDatabase", "TestDatabase")))
.renameTo(name("TestDatabase", "Foo"))
...
Generates: ALTER TABLE `TestDatabase`.`TestTable` RENAME TO `Foo`
However, since the connection pool I'm using is connected to my management database, it just renames the table to "Foo" and moves it to my management database. I would have expected the SQL to be: ALTER TABLE `TestDatabase`.`TestTable` RENAME TO `TestDatabase`.`Foo`. I tried a variety of alternatives to invoke the .renameTo method and convice it to use the fully qualified name, to no avail:
.renameTo(table(name(...) -> same behaviour.
.renameTo("`TestDatabase`.`Foo`") -> Escapes the name with backticks, treats it as one name instead of a qualified name.
I'm wondering if I'm missing something, if this is intended behaviour, or maybe even a bug or design shortcoming of JOOQ.
Is there a way to rename the table using fully qualified names?
Thank you!
That's a bug in jOOQ: https://github.com/jOOQ/jOOQ/issues/8042
Your workaround is close. This doesn't work:
.renameTo("`TestDatabase`.`Foo`")
As you've noticed, behind the scenes, the DSL.name() API is used to wrap the target name, because the renameTo() method doesn't implement the plain SQL templating API. You can, however, explicitly use plain SQL templating by writing as a workaround:
.renameTo(table("`TestDatabase`.`Foo`"))

No columns returned SSIS

I am implementing a SSIS package and currently trying to do the following.
Truncate the destination table
Fetch the data by executing the stored procedure and insert it into the destination table.
I have created an Execute SQL task to address step 1 and dataflow with oledb source and oledb destination to address the second point. It been working successfully so far but isn't working for one my stored procedure that uses temp tables.
When I edit the oledb source and click the preview button, I get the error no column returned
I know that SSIS has an issue with generating column while executing stored procedures that depend on temp tables. I have converted the stored proc to use temporary table variables and its now able to return columns in SSIS when I do a preview. The only downside is that the stored procedure is taking longer time to execute. Its taking 1 hour 15 mins as compared to 15 mins while using temp tables.
I did see a suggestion to use SET FMTONLY before executing the stored procedure as an alternate solution to changing to temp table variables but that didn't seem to work as I am getting syntax or permission denied error.
Could somebody tell me a solution to my problem which does not compromise on the performance.
Sounds like you've already read all the approaches to using Temp tables in SSIS, including the IF 1=0... trick? If you haven't seen that one yet, google it.
You say that using Table Variables causes your stored procedure to take about 5 times longer than using Temp Tables. The most likely reason for that is that you are indexing your temp tables but not your table variables. If you didn't know that table variables can be indexed, they can. You might try that.
Finally, a solution that you haven't mentioned is that you can replace your temporary table with a real table that gets truncated when you're done using it.
Short comment:
Try EXEC WITH RESULT SETS and specify the metadata yourself for a proc with temp tables; or use the Script Component as a source and specify the Output columns yourself.
Long comment:
Technically speaking, it is the driver/database you are using in SSIS that would decide the behavior when working with temp tables.
Metadata is an important factor when using SSIS's pipeline components. By metadata, I mean the names of the columns, their data types etc that a pipeline component uses. When designing a data flow, someone/something should provide this metadata to the components that require it.
In most cases, SSIS automatically retreives the metadata. Components that do not connect to a external data source, like Conditional Split etc, get their metadata from the other components they are connected to. For the pipeline components that connect to a external data source (like Oledb source, oledb destination, Lookup etc.), SSIS provides a mechanism to get this metadata without human involvement. This mechanism involves the driver connecting to the database and retrieving the metadata of the output. If the driver/database is capable of returning the metadata, then that metadata is used. If the driver/database is incapable, then you get the errors you are seeing. The rest of my comments are based on the assumption that you are using a SQL Server database in your question.
When working with a SQL Server database in SSIS, typically, we use the native client drivers provided by Microsoft. When trying to get the metadata, these drivers try to get the metadata without actually executing the SQL Statement (actual execution can have side effects; and also, might take more than a few seconds/minutes/hours; and you dont want side effects and long wait times during package design time.) So to get the metadata, the driver relies on the metadata of the actual objects used in the sql command. If the command uses a physical table or view, SQL Server already has the metadata available and can supply it to the driver. If it is a temp table, SQL Server does not have the metadata until it can create the temp table. If using FMT ONLY option, you can use it in such a way to create the temp tables, but avoid any heavy processing/side affects and thus be able to retrieve metadata without penalties. Post 2012, these native client drivers rely on some newer functionality to retrieve metadata than the drivers before 2012. In 2012 and after, the driver uses the sp_describe_first_result_set proc to retrieve metadata. So, whether you can get metadata or not is determined by the ability of the sp_describe_first_result_set proc.
So while SSIS can automatically get the metadata (because of the driver/database), it does not automatically get the metadata in some cases (again because of the driver/database). In cases involving the second scenario, some other process (typically a human) can help the driver infer metadata or provide the metadata to the component directly.
To help the driver, in case of SQL Server 2012 and after, you can use the WITH RESULTSETS clause to specify the output metadata. When this clause is present, the driver will use it and doesnt try to query the metadata from system objects; and thus avoid the error which you would otherwise get. If you are using the drivers that came with SQL Server 2008, you can use FMT ONLY. This option is at the driver/database level.
Another option could be to use a Script Component as the Source and in the Output columns, you can specify the columns/metadata. SSIS would not try to retrieve metadata from the datasource in this case, but would rely on the definitions you provided in the Output section of the Script Component.
As you can see, both options involve a human (or some other process) specifying the metadata instead of SSIS trying to retrieve the metadata in an automated fashion. I would prefer the first option if working with SQL Server and the second option if working with databases like MySql.

Data Cleanse ENTIRE Access Table of Specific Value (SQL Update Query Issues)

I've been searching for a quick way to do this after my first few thoughts have failed me, but I haven't found anything.
My Issue
I'm importing raw client data into an Access database where the flat file they provide is parsed and converted into a standardized format for our organization. I do this for all of our clients, but this particular client's software gives us a file that puts "(NULL)" in every field that should be NULL. lol as a result, I have a ton of strings rather than a null field!
My goal is to do a data cleanse of the entire TABLE, rather than perform the cleanse at the FIELD level (as I do in my temporary solution below).
Data Cleanse
Temporary Solution:
I can't add those strings to our datawarehouse, so for now, I just have a query with an IIF statement check that replaces "(NULL)" with "" for each field (which took awhile to setup since the client file has roughly 96 fields). This works. However, we work with hundreds of clients, so I'd like to make a scale-able solution that doesn't require many changes if another client has a similar file; not to mention that if this client changes something in their file, I might have to redo my field specific statements.
Long-term Solution:
My first thought was an UPDATE query. I was hoping I could do something like:
UPDATE [ImportedRaw_T]
SET [ImportedRaw_T].* = ""
WHERE ((([ImportedRaw_T].* = "(NULL)"));
This would be easily scale-able, since for further clients I'd only need to change the table name and replace "(NULL)" with their particular default. Unfortunately, you can't use SELECT * with an update query.
Can anyone think of a work-around to the SELECT * issue for the update query, or have a better solution for cleansing an entire table, rather doing the cleanse at the field level?
SIDE NOTES
This conversion is 100% automated currently (Access is called via a watch folder batch), so anything requiring manual data manipulation / human intervention is out.
I've tried using a batch script to just cleanse the data in the .txt file before importing to Access - however, this caused an issue with the fixed-width format of the .txt, which has caused even larger issues with the automatic import of the file to Access. So I'd prefer to do this in Access if possible.
Any thoughts and suggestions are greatly appreciated. Thanks!
Unfortunately it's impossible to implement this in SQL using wildcards instead of column names, there is no such kind syntax.
I would suggest VBA solution, where you need to cycle thru all table fields and if field data type is string, generate and execute SQL UPDATE command for updating current field.
Also use Null instead of "", if you really need Nulls in the field instead of empty strings, they may work differently in calculations.

At run time how to I verify that the database schema matches my objects?

I have data access object that have been generated by SqlMetal, however the database is created by running a sql script.
Is there an easy way to verify that all table and columns names and type matches the attributes on the classes that SqlMetal created?
I guess the easiest way to do this would be to have some kind of version number hidden in a config table in your schema. Then on runtime check the version number returned.
Much easier than doing a full scan. Set the version number in your SQL script and somewhere in your data access object