Handling very large datasets (>1M records) using Power Automate - csv

Before I go much further down this thought process, I wanted to check whether the idea was feasible at all. Essentially, I have two datasets, each of which will consist of ~500K records. For the sake of discussion, we can assume they will be presented in CSV files for ingesting.
Basically, what I'll need to do, is take records from the first dataset, do a lookup against the second dataset, and then produce an output that essentially merges the two together and produces an output CSV file with the results. The expected number of records after the merge will be in the range of 1.5-2M records.
So, my questions are,
Will Power Automate allow me to work with CSV datasets of those sizes?
Will the "Apply to each" operator function across that large of a dataset?
Will Power Automate allow me to produce the export CSV file with that size?
Will the process actually complete, or will it eventually just hit some sort of internal timeout error?
I know that I can use more traditional services like SQL Server Integration Services for this, but I'm wondering whether Power Automate has matured enough to handle this level of ETL operation.

Related

Flat File as Input - MySQL Best Practice

I receive a flat file (CSV) each day, the contents of which gets imported into my database (rather than data entry through a web form, POS or the like). There are 40 fields in a record and I'm up to 600,000 unique records.
Up until now, I haven't seen the need to make this a relational database though there certainly is some normalization that would make it more efficient; repeating products, stores, customers, resellers, etc.
If I was starting this from the beginning and incrementally inputting the data somehow, I'd know how to do all that (every resource I've gone through covers it that way but none cover it when you have a large volume of data already and need to make it relational). And with the CVS's coming in each day I'm not quite sure how to import the data once the database is set up. If I were to split those 40 fields into say 5 tables would I then have to split that daily file the same way and import them one at a time? Would foreign keys update that way?
If someone could push me in the right direction I'll go do more digging on my own.
If you were faced with the same project, how would you create such a database and perform the daily updates?
Thanks!
Create your database structure independently of what you have right now (CSV structure and data). E.g. organize your tables to fit your future needs, think and define the relations between them good, apply proper indexes.
As the second step - unavoidable in my opinion, write a little program in the programming language of your own. It should be able to
mainly read records/lines from a (CSV) file,
validate/sanitize the fetched data
import/save the data in the correspondent database tables, as needed. By "as needed", I mean, that, in time, can appear a multitude of factors which could unexpectedly influence your first db-structural decisions. For example, the need for some temporal tables. Also, you should benefit from the advantages given to you by the triggers and stored procedures.
properly handle the errors and exceptions raised along the importing process. For example, due to eventual "duplicate key" issues - because data in files can be error-prone, some records could not be imported in a certain day. That doesn't mean that the import should break. Read a record, try to save it. If a problem appears, handle it (copy the line in another file, or save it in a special table, for a later editing/revision and re-import) and let the program follow its course with the next records.
properly log all (main) operations and maintain a counter of the read and of the problematic records.
automatically copy each daily file - after import - in a backup directory, until its not needed anymore.
eventually signalize you per email about the status of the operations.
The third step would be to find a solution to automatize the whole cycle. For example, to find a tasks/cron-jobs manager to start your program daily, once or even twice a day, without you having to make this manually.
Regarding of splitting the file into different files, based on your ddatabase structure: it wouldn't be necessary, e.g. it would be a redundant step, since your program should manage to read the file and handle the data import correspondingly.
As of the type of the program: it should be a web solution, so that you can access and modify it any time you need.
Good luck.

Analyzing multiple Json with Tableau

I'm beginning to use Tableau and I have a project involving multiple website logs stored as JSON. I have one log for each day for about a month, each weighting about 500-600 Mb.
Is it possible to open (and join) multiple JSON files in Tableau? If yes, how ? I can load them in parallel, but not join them.
EDIT : I can load multiple JSON files and define their relationship, so this is OK. I still have the memory issue:
I'm am worried that by joining them all, I will not have enough memory to make it work. Are the loaded files stored in RAM of in an internal DB ?
What would be the best way to do this ? Should I merge all the JSON first, or load them in a database and use a connector to Tableau? If so, what could be a good choice of DB?
I'm aware some of these questions are opinion-based, but I have no clue about this and I really need some guideline to get started.
For this volume of data, you probably want to preprocess, filter, aggregate and index it ahead of time - either using a database, something like Parquet and Spark and/or Tableau extracts.
If you use extracts, you probably want to filter and aggregate them for specific purposes, just be aware if that that you aggregate the data when you make the extract, you need to be careful that any further aggregations you perform in the visualization are well defined. Additive functions like SUM(), MIN() and MAX() are safe. Sums of partial sums are still correct sums. But averages of averages and count distincts of count distincts often are not.
Tableau sends a query to the database and then renders a visualization based on the query result set. The volume of data returned depends on the query which depends on what you specify in Tableau. Tableau caches results, and you can also create an extract which serves as a persistent, potentially filtered and aggregated, cache. See this related stack overflow answer
For text files and extracts, Tableau loads them into memory via its Data Engine process today -- replaced by a new in-memory database called Hyper in the future. The concept is the same though, Tableau sends the data source a query which returns a result set. For data of the size you are talking about, you might want to test using some sort of database if it the volume exceeds what comfortably fits in memory.
The JSON driver is very convenient for exploring JSON data, and I would definitely start there. You can avoid an entire ETL step if that serves your needs. But at high volume of data, you might need to move to some sort of external data source to handle production loads. FYI, the UNION feature with Tableau's JSON driver is not (yet) available as of version 10.1.
I think the answer which nobody gave is that No, you cannot join two JSON files in Tableau. Please correct me if I'm wrong.
I believe we can join 2 JSON tables in Tableau.
First extract the column names from the JSON data as below--
select
get_json_object(JSON_column, '$.Attribute1') as Attribute1,
get_json_object(line, '$.Attribute2') as Attribute2
from table_name;
perform the above for the required tableau and join them.

Database (OLTP) and Reporting

I am working on a trading platform that has reporting as a big portion of its business.
The set up is the following:
SQL OLTP database (about 200 tables) – rather small in number of records. (20,000 records the biggest table – but keeps growing every week)
For reporting services, SQL views are being used to query the Live Transaction Database. Imagine the result set of the views a de-normalized one, in the spirit of a data warehouse approach. Then these data sets are passed to a third party Reporting platform (like Tableau, Power Bi or SiSense), which take these data sets and throws them into Cubes (probably some columnar structure, like mono db, hadoop, etc). From there the Reports are getting generated.
Current challenges.
The SQL views (about 8). Are huge and very hard to maintain. To give you an example, one of the views outputs 100 fields. But each of these fields are calculated fields with complicated CASE statements, nested IF statements, inline Functions, and what not, which makes this view as big as 700 lines of sql code. I inherited these from anther employee and now, sadly, I have to maintain them.
Because the data grows weekly by several hundreds records (through migration and transactions) and the number of fields in the views also grow (a few every week), the cube building takes longer and longer. To give you an example, few months ago we set up the cube re-built ever 10 minutes to refresh the data (which was taking 5 minutes). Currently takes 12-15 minutes to build, so we set it up every 30 minutes. As you can imagine, this will get worse as data and the number of fields keep growing; and we kind of need the data as current as possible.
The only good thing is that once the cube is built, the reports load fast because they are being pulled form the 3rd party platform, so no concerns here.
What I have in mind
I would like to get rid of the views so I could ease the process of maintenanace and also keep at minimum the duration of the cube re-built.
Options:
to build a data warehouse. And then build SSIS packages to populate this structure with the live transactional data. The de-normalized structure would probably look very similar the views mentioned above. The draw back here is that I don’t really feel like I simplify much, actualy adding one more layer, which is the data migration from the OLTP to the OLAP (datawarehouse). And I would still have to re-build the cube.
To turn the current views into SQL Indexed Views (materialized views), but in their current state, I simply cannot do it because of the agregate and inline functions used a lot across the entire view.
Another option I red about is to built a ODS (Operational Data Store – which would be a databse that would contain the necessary tables similar to the sql views I have now – and refresh it constantly) Maybe using triggers, or, Transaction logs? But I am not sure what involves to built such thing and how hard is to maintain.
Question:
What approach should I take?
Do any of the 3 above make any sense?
Of course, I am interested in other ideas or suggestions, as well.
Thank you!
from my experience your best approach will be 1. It is costly, but will give you better benefits . Create a ROLAP DWH(i recommend Kimball's "The data warehouse toolkit" for best practices and design patterns), if you have the oportunity use a columnar data store(like amazon redshift, or sap sybase iq) and all the case statements ,nested ifs and all operations that you mentioned, would be applied on ETL time, so in the ROLAP everything is precalculated and optimized to consumption. And dont forget about aplying indexes(depending on the relying technology you use) . Some database vendors already published "indexing best practices" for ROLAP so they will tell you which type of index aply depending of the type of table(dimension) and data type for example.

One SQL table for all or multiple tables per regression record?

I am moving a design flow consisting of running a regression consisting of multiple simulations run on a server farm from using files over NFS to using a MySQL DB for extra speed. (We have an associated flow that has just this optimisation so we know it can work).
We will probably run in the order of 1000 regressions over one year; each of approx 100K simulations, each simulation to store a mall record of its results/runtime/...
In the current flow, each regressions results are stored in a separate (CSV) file. Currently each regression in the DB is stored in the same table of regressions and all simulation results for simulations from every regression is all stored in the one sim_results table.
To minimise changes from the current flow, I would like to consider creating separate sim_results tables for each regression but
I don't know how to create a separate table from an iondividual regression record (which has ID as its primary index).
I don't know if I should do it this way - to better mimmick the current flow; orgo with the one sim_results table because it may be "The SQL way".
Help appreciated!
the SQL way is typically that you don't create multiple tables which each correspond to a different series of rows, except in the case where you're breaking out those tables for the purpose of sharding the data among multiple nodes (e.g. horizontal sharding). Horizontal sharding is generally a complex task that has lots of caveats.
But overall, the way you design your schema has to do with the use cases you need to suit. Particularly if you want to run queries over many simulations at once, storing all the data in a single series of tables is how you'd do that. If OTOH you don't really have any querying plans, then you probably don't need a relational DB in the first place.
I'm not sure of the format of your data, but one schema design that is common for large amounts of data to be "analyzed" is the star schema. The wikipedia page is a good read.
If you are heading towards creating many tables, SQLAlchemy's Table() construct is a Python data structure, which you can build programmatically. Build a function that creates new Table() objects as needed and then calls create() on them. I've worked with companies that have had to work hard to get off of this particular design though so I'd really consider if this scheme is worth it. Relational tables properly configured can store billions of rows without issue.

Automate creation and deployment of SSRS report from single table query

What is the most efficient way to automate both creation and deployment of simple SSRS reports from one underlying query?
An example query might look like
SELECT Name, ID, Date FROM Errorlog
Query could contain quite a few columns and anywhere from 1 to 1 million rows.
The business purpose behind this question is that I have a sizable number of report queries that need to go out as SSRS reports. I also need the capacity to turn any query I write instantly (or within a matter of seconds) into a simple SSRS report. Unfortunately, doing it through BIDS manually (using toolbox items and creating datasets is cumbersome, slow and unnecessarily repetitive. The only thing I am concerned with is making sure interactive page height/width is zero (to allow scrolling) and that columns are autosized.
How would you accomplish this in a way that is smooth and repeatable?
Let me start by saying that I don't think SSRS will not be very good at this. Specifically on two points this may be troublesome.
First, the number of rows may become a problem. One million results is typically a bit much for reporting services 2008 (though it does depend on the context a bit), it's much better at displaying either aggregated data, or a limited number (up to a few thousand - though again: depending on context) of data rows.
Second, a dynamic number of columns being returned by the SQL side will be a problem. There's only two ways around this that I know of:
Have a denormalized data set with a fixed number of columns, and one or more columns that contain the grouping. Then use a matrix to generate columns dynamically in SSRS. This does have a considerable performance impact.
Generate the RDL dynamically. There's information on the schema to do this, and if you create a good starting point it's very possible. After generating the RDL you'll have to execute it - how to do that depends on your specific setup.
Bottom line is that I wouldn't recommend using SSRS for the task you describe. Consider other technologies that may be better up to this task, e.g. SSIS packages, or perhaps another custom made or third party tool?
If I were you, I'd utilize 'Access Data Projects' which have a wizard for creating report.. that is then easy to upsize to Reporting Services. Right-click IMPORT into a solution full of RDL, and it prompts for MS Access file.
You can easily make a couple of columns into a report using an Access wizard, and then upsize to SSRS.. I've done it hundreds upon hundreds of times like this.