What are Azure Data Explorer external table partitions good for? - partitioning

Adding pertition to the external table definition does not help with a query on the partition.
Blob path example
/data/1234/2021/12/02/9483D.parquet
/data/1235/2021/12/02/12345.parquet
Partition (pseudo syntax not the real one) : '/data/'uniqueid'/yyyy/MM/dd/'
So only two uniqueids values are in the storage path. Total files count ~ 1 million for different dates in the path
So I defined 2 partitions as virtual columns:
uniqueid
datetime
Executing a query on the uniqueid like:
table | summarize by uniqueid
goes over all files in the blob storage for some reason.
As the uniqueid is a partition and as virtual column, shouldn't the query be super fast as we have only 2 values in the path for it?
Am I totally missing the point of partitioning?
EDIT add smaple:
.create external table ['sensordata'] (['timestamp']:long,['value']:real)
kind = adl
partition by (['uniqueid']:string ,['datecreated']:datetime )
pathformat = (['uniqueid'] '/' datetime_pattern("yyyy/MM/dd", ['daterecorded']))
dataformat = parquet
(
h#'abfss://XXXXXX#YYYYYYYY.dfs.core.windows.net/histdata;impersonate'
)
with (FileExtension='.parquet')
Query sample:
sensordata
| summarize by uniqueid

Thanks for your input, #user998888.
We have many optimizations for partitioned external tables, and we invest significant effort in adding more and more optimizations. But we still haven't optimized the type of query like the one you provided. It's on our list.

Related

Specific SQL queries with JOIN vs. multiple methods

I have a NodeJS app and a mysql database. I use the mysql npm package to access it.
Very often in my SQL schema I have relation between tables like this :
Table User :
SERIAL id
...
Table Group
SERIAL id
...
Table Group_Constraints
SERIAL ConstraintId
...
Table UserToGroup
BIGINT UserId
BIGINT GroupId
Table GroupToConstraint
BIGINT GroupId
BIGINT ConstraintId
Now, In my User model, I want to have a function that gives me the users, their groups and the user's group properties.
For now, I do a big custom SQL request with some JOINS. It works, but it leads me with many "custom" functions like getUsersWithGroupAndConstraints.
As I have this design in many different place of my code, it leads to something very hard to maintain.
I wish my code where a bit more generic and have a User/Group/Constraint model,that I would then Query this way :
User.getAll().forEach(user =>
Group.getAll(user).forEach(group =>
Constraint.getAll(group)))
But it would lead from 1 SQL query to User.length*Group.length SQL Queries.
I cannot find a way to achieve a clean design, without having a HUGE amount of SQL queries and therefore, I guess, very poor performances.
How can I do this ?

MySql SELECT Query performance issues in huge database

I have a pretty huge MySQL database and having performance issues while selecting data. Let me first explain what I am doing in my project: I have a list of files. Every file should be analyzed with a number of tools. The result of the analysis is stored in a results table.
I have one table with files (samples). The table contains about 10 million rows. The schema looks like this:
idsample|sha256|path|...
The other (really small table) is a table which identifies the tool. Schema:
idtool|name
The third table is going to be the biggest one. The table contains all results of the tools I am using to analyze the files (The number of rows will be the number of files TIMES the number of tools). Schema:
id|idsample|idtool|result information| ...
What I am looking for is a query, which returns UNPROCESSED files for a given tool id (where no result exists yet).
The (most efficient) way I found so far to query those entries is following:
SELECT
s.idsample
FROM
samples AS s
WHERE
s.idsample NOT IN (
SELECT
idsample
FROM
results
WHERE
idtool = 1
)
LIMIT 100
The problem is that the query is getting slower and slower as the results table is growing.
Do you have any suggestions for improvements? One further problem is, that i cannot change the structure of the tables, as this a shared database which is also used by other projects. (I think) the only way for improvement is to find a more efficient select query.
Thank you very much,
Philipp
A left join may perform better, especially if idsample is indexed in both tables; in my experience, those kinds of "inquiries" are better served by JOINs rather than that kind of subquery.
SELECT s.idsample
FROM samples AS s
LEFT JOIN results AS r ON s.idsample = r.idsample AND r.idtool = 1
WHERE r.idsample IS NULL
LIMIT 100
;
Another more involved possible solution would be to create a fourth table with the full "unprocessed list", and then use triggers on the other three tables to maintain it; i.e.
when a new tool is added, add all the current files to that fourth table (with the new tool).
when a new file is added, add all the current tools to that fourth table (with the new file).
when a new result in entered, remove the corresponding record from the fourth table.

Select decoded JSON data from joined MySQL tables

Could you write me please how to make selection from MySQL database if I have two tables with JSON data. One of them has following structure:
Table Trees
(id, name, value) - three columns
which includes following data
1, trees, [{"name":"Oaktree","value":1,"target":null},{"name":"Appletree","value":2,"target":null},{"name":"Plumtree","value":3,"target":null}]
2, length, [{"name":"10m","value":1,"target":null},{"name":"15m","value":2,"target":null},{"name":"20m","value":3,"target":null}]
3, age, [{"name":"5y","value":1,"target":null},{"name":"10y","value":2,"target":null},{"name":"20y","value":3,"target":null}]
The second table has the following structure:
Table SelectedTrees
(properties) - only one column
which includes the following data
[{"id":"1","value":["1","3"]},{"id":"2","value":["1", "2", "3"]},{"id":"3","value":["2"]}]
it means selected data from Trees tables. id in properties column from selectedTrees coresponds to id column from Trees table. I would like to select from database real (json_decoded) values like:
Trees = Oaktree, Plumtree
Length = 10m, 15m, 20m
Age = 10y
How could I make this?
Thanks in advance.
Jan
In a nutshell, this is not possible. Relational databases are built for quickly comparing constant values that they can index. JSON is just a string to MySQL, and any kind of partial string matching triggers a so-called table scan, which is essentially going to become freaking slow when you get serious amounts of data.
You COULD get it to work like this:
SELECT * FROM Trees
JOIN SelectedTrees
ON properties LIKE CONCAT('"id":"', Trees.id, '"')
This is however just a hack that you should never want to use in any production system, and I advise against using it in a test system. Instead refactor your database so there's never going to be any JSON in there that you are going to match on in your queries. It's fine to store secondary data as JSON, just make sure the IDs and names are extracted before insertion, and then insert in separate columns in the database tables so the DB engine can do its relational magic.

Performance Issues with DB Design and Heavy Data

I asked the following question regarding DB Design and Performance issue in my application today.
DB Design and Data Retrieval from a heavy table
But, could not get much replies on that. I don't know, I may not have explained the question properly. Now, I have re-defined my question, hoping to get some suggestion from the experts.
I am facing performance issues while selecting data from a particular table. The business logic of the application is as following:
I have a number of import processes which result in creating pivot columns under their parent column names while showing them to the user. As the columns get pivoted, system takes time to convert rows into columns which results in slow performance.
The database tables related to this functionality are as following:
There can be N number of clients. CLT_Clients table stores client information.
There can be N number of projects associated to a client. PRJ_Projects table stores project information and a link to the client.
There can be N number of listings associated to a project. PRJ_Listings table stores listing information and a link to the project.
There can be N number of source entities associated to a listing. ST_Entities table stores source entity information and a link to the listing.
This source entity is the actual import that contains the InvestorID, position values, source date, active and formula status.
The name of the import e.g. L1Entity1 is stored in ST_Entities table alongwith ID field i.e. EntityID
InvestorID, Position, Source Date, Active and Formula values get stored in ST_Positions table
Database Diagram
Data need to be view as following:
With this design I’m able to handle N number of imports because the Position, Source Date, IsActive, Formula columns get Pivoted.
The problem that I’m facing with this design is that the system performs very slow when it has to select data for more than 10-12 source entities, and the requirement is to show about 150 source entities. Because data is not stored row wise and I need to show it column wise, hence dynamic queries are written to pivot these columns which takes long.
Ques 1: Please comment/suggest on my current database design if it’s correct or need to be changed with the new design by taking 150 columns each for Position, Source Date, IsActive, Formula; In this new way data will already be stored in the way I need to retrieve in i.e. I’ll not have to pivot/unpivot it. But the downside is:
a) There are going to be more than 600 columns in this table?
b) There will be limit i.e. 150 on the source entities.
Ques 2: If I need to stick to my current, what can be done to improve the performance?
Please see below the indexing and Pivot method information:
Regarding indexes in Position table, I also have taken ProjectID field with clustered index as the data is selected from Position table either on the basis of ProjectID OR EntityID.
Whenever EntityID is used to select data from Position table, it's always used in JOIN. And Whenever ProjectID is used to select data from this table, it's always used in WHERE.
Point to notice here is that I have a Clustered index on ProjectID but I have not taken any index on Pivoted column OR EntityID. Is there any room for improvement here?
Pivoting Method used:
Example 1:
'Select * From
(
Select DD.InvestorID,Cast(1 As Bit) As IsDSInvestor,DD.Position,
Case DD.ProjectID
When ' + CAST(#ProjectID AS VARCHAR) +' Then DE.SourceName
Else ''' + #PPDeliveryDate + '''+DE.SourceName
End As SourceName
From DE_PositionData DD
Inner Join DE_DataEntities DE ON DE.EntityID=DD.EntityID
Where DD.ProjectID IN (' + CAST(#ProjectID AS VARCHAR) +',' + CAST(#PreviousProjectID AS VARCHAR) +') AND InvestorID IS NOT NULL
) IDD
Pivot
(
max(Position) for SourceName in ('+ #DataColumns+')
) as p1'
Example2:
'Select * From
(
Select DD.InvestorID As DSSOFID,Cast(1 As Bit) As IsActiveInvestor,
Case ST.SourceTypeCode
When ''RSH'' Then Cast(IsNull(DD.IsActive,0) As Int)
Else Cast(IsNull(DD.IsActive,1) As Int)
End As IsActive,
''~''+DE.SourceName As ActiveSourceName
From DE_DataEntities DE
Left Join DE_PositionData DD ON DE.EntityID=DD.EntityID
Left Join
(
Select * From #DataSources
Where ProjectID=' + CAST(#ProjectID AS VARCHAR) +'
) ST ON ST.ESourceID=DE.ESourceID
Where DE.ProjectID=' + CAST(#ProjectID AS VARCHAR) +' AND ST.SourceTypeCode NOT IN (''PBC'',''EBL'',''REG'')
AND InvestorID IS NOT NULL
) IDD
Pivot
(
Max(IsActive) for ActiveSourceName in ('+ #DataColumns+')
) As p1'
I would suggest the following.
You should store your data in the normalized format. You should be able to set up indexes to make the pivoting of the data go faster. If you post the actual query that you are using for pivoting, we might be able to help.
There are many reasons you want to store the data this way:
Flexibility in the number of repeating blocks
Many databases have limits on the number of columns or total width of a table. You don't want your design to approach those limits.
You may want additional information on each block. I always include CreatedBy and CreatedAt columns in my tables, and you would want this per block.
You have additional flexibility in summarization.
Adding/removing intermediate values is cumbersome.
That said, the pivoted table has one key advantage: it is what users want to see. If your data is updated only once per day, then you should create a reporting table with the pivot.
If your data is updated incrementally throughout the day, then you can set up triggers (or stored procedure code) to update the base tables and the reporting summary.
But as I said earlier, you should ask another question is the particular method you are using for pivoting. Perhaps we can improve the performance there.

DB Design and Data Retrieval from a heavy table

I have a requirement to have 612 columns in my database table. The # of columns as per data type are:
BigInt – 150 (PositionCol1, PositionCol2…………PositionCol150)
Int - 5
SmallInt – 5
Date – 150 (SourceDateCol1, SourceDate2,………….SourceDate150)
DateTime – 2
Varchar(2000) – 150 (FormulaCol1, FormulaCol2………………FormulaCol150)
Bit – 150 (IsActive1, IsActive2,……………….IsActive150)
When user does the import for first time the data gets stored in PositionCol1, SourceDateCol1, FormulaCol1, IsActiveCol1, etc. (other datetime, Int, Smallint columns).
When user does the import for second time the data gets stored in PositionCol2, SourceDateCol2, FormulaCol2, IsActiveCol2, etc. (other datetime, Int, Smallint columns)….. so and so on.
There is a ProjectID column in the table for which data is being imported.
Before starting the import process, user maps the excel column names with the database column names (PositionCol1, SourceDateCol1, FormulaCol1, IsActiveCol1) and this mapping get stored in a separate table; so that when retrieved data can be shown under these mapping column names instead of DB column names. E.g.
PositionCol1 may be mapped to SAPDATA
SourceDateCol1 may be mapped to SAPDATE
FormulaCol1 may be mapped to SAPFORMULA
IsActiveCol1 may be mapped to SAPISACTIVE
40,000 rows will be added in this table every day, my questions is that will the SQL be able to handle the load of that much of data in the long run?
Most of the times, a row will have data in about 200-300 columns; in the worst case it’ll have data in all of the 612 columns. Keeping in view this point, shall I make some changes in the design to avoid any future performance issues? If so, please suggest what could be done?
If I stick to my current design, what points I should take care of, apart from Indexing, to have optimal performance while retrieving the data from this huge table?
If I need to retrieve data of a particular entity e.g. SAPDATA, I’ll have to go to my mapping table, get the database column name against SAPDATA i.e. PositionCol1 in this case; and retrieve it. But, in that way, I’ll have to write dynamic queries. Is there any other better way?
Don't stick with your current design. Your repeating groups are unweildy and self limiting... What happens when somebody uploads 151 times? Normalise this table so that you have one of each type per row rather than 150. You won't need mapping this way as you can select SAPDATA from the positioncol without worring if it is 1-150.
You probably want a PROJECTS table with an ID, a PROJECT_UPLOADS table with an ID and an FK to the PROJECTS table. This table would have Position, SourceDate, Formula and IsActive given your use-case above.
Then you could do things like
select p.name, pu.position from PROJECTS p inner join PROJECT_UPLOADS pu on pu.projectid = p.id WHERE pu.position = 'SAPDATA'
etc.