I have been trying to run a report for my CEO that shows income. Our agency management software uses FoxPro databases (it originally came out in the early '80s, I think). I have linked the .dbf files to an Access database, and I have been setting up queries based on queries to get the information I need on a live basis without having to export the data. The problem that I have run into is that I cleaned up the selection criteria in the first query, but I did not run that query (it takes about ten minutes to run each of these). When I ran the last query (with data based on the first), I still had bad data in that result.
So here's the dumb question: (a) do I need to create a macro that runs the queries (there are four of them) in sequence so that they are all updated each time, (b) is there some better way to do this, and/or (c) does Access automatically run the prior queries when I run the downstream query?
Related
I working on an application which is supposed to read the data(json) from a particular source, do the business required transformation on it and then load it to MySQL database in real time (time duration for all this should be in milliseconds).
Till now I have been able to do this using Apache Beam with spark runner and Java, but it takes more time as compared to what the expectation is (approx. 25 secs for close to 2 million records).
I am very new to Apache Beam and I would like to know if there is something that I can do to improve the performance of the application or should I move to some other Tech stack that would help me to achieve this.
When speed is the main criteria, you need to understand the SQL that is being generated by the intermediate package(s).
The layers (beam/runner/java/etc) are handy to make the logic a little simpler. But you cannot necessarily trust them to generate optimal SQL.
Plan A: Provide the generated SQL for critique.
Plan B: Jetisson the layers and use SQL directly. (There would, of course, be a simple API that would clearly expose what the generated SQL will be.)
As a general rule with SQL, it is better to operate on hundreds of thousands of rows in a single query. Using a loop to issue hundreds or thousands of one-row SQL statements is likely to be literally 10 times as slow.
That may involve loading all the data unmodified into a table, then doing the ETL in SQL -- operating on one 'column' at a time instead of one 'row' at a time. Show us one JSON string together with SHOW CREATE TABLE and explain the "T" (transformation) of your ETL.
I've got a quiz application that is running several queries to a MySQL database. The backend application is running using Java. Every time the app runs a query to the database, there are additional queries that are being executed that I am not specifying within the application. As a result, this is causing a lot of additional overhead to the database, sometimes resulting in an error.
For example, I've got a 'Questions' table that only contains regular characters, such as the below:
The application does a simple SELECT * from Questions to get the list of questions. However, when that is executed, I can see in the database logs that there are 4 additional queries that are also run (the first I assume is the connectivity to the database), which I have not specified. Those are:
Could someone tell me why this is happening? Essentially, every query that is run against the database (specified by the application) renders the same exact additional 3 queries that (to me) are coming out of nowhere.
I am an ex multi value developer that over the last 6 months have been thrust in to the world of SQL and apologies in advance for the length of the question. So far I have got by with general instinct (maybe some ignorance!) and a lot of help from the good people on this site answering questions previously asked.
First some background …
I have an existing reporting database (SQL Server) and a new application (using MySQL) that I am looking to copy data from at either 30 min, hourly or daily intervals (will be based on reporting needs). I have a linked server created so that I can see the MySQL database from SQL Server and have relevant privileges on both databases to do read/writes/updates etc.
The data that I am looking to move to reporting on the 30 minute or hourly schedule typically are header/transactions by nature have both created and modified date/time stamp columns available for use.
Looking at the reporting DBs other feeds, Merge is the statement used most frequently across linked servers but to other SQL server databases. The merge statements also seem to do a full table to table comparison which in some cases takes a while (>5mins) to complete. Whilst the merge seems to be a safe options I do notice a performance hit on reporting whist the larger tables are being processed.
In looking at delta loads only, using dynamic date ranges (eg between -1 hour:00:00 and -1 hour:59:59) on created and modified time stamps, my concern would be the failure of any one job execution could leave the databases out of sync.
Rather than initially ask for specific sql statements what I am looking for is a general approach/statement design for the more regular (hourly) executed statements with the ideal being just to perform delta loads of the new or modified rows safely with a SQL Server to MySQL connection.
I hope the information given is sufficient and any help/suggestions/pointers to reading material gratefully accepted.
Thanks in advance
Darren
I have done a bit of “playing” over the weekend.
The approach I have working pulls the data (inserts and updates) from MySQL via openquery into a CTE. I then merge the CTE into the SQL Server table.
The openquery seems slow (by comparison to other linked tables) but the merge is much faster due to limiting the amount of source data.
I am working on a project where I am storing data in Sql Server database for data mining. I 'm at the first step of datamining, collecting data.
All the data is being stored currently stored in SQL Server 2008 db. The data is being stored in couple different tables at the moment. The table adds about 100,000 rows per day.
At this rate the table will have more than million records in about a month's time.
I am also running certain select statements against these tables to get upto the minute realtime statistics.
My question is how to handle such large data without impacting query performance. I have already added some indexes to help with the select statements.
One idea is to archive the database once it hits a certain number of rows. Is this the best solution going forward?
Can anyone recommend what is the best way to handle such data, keeping in mind that down the road I want to do some data mining if possible.
Thanks
UPDATE: I have not researched enough to decide what tool I would use for datamining. My first order of task is to collect relevant information. And then do datamining.
My question is how to manage the growing table so that running selects against it does not cause performance issues.
What tool you will you be using to data mine? If you use a tool that uses a relational source then you check the worlkload that it is submitting to the database and optimise based on that. So you don't know what indexes you'll need until you actually start doing data mining.
If you are using SQL Server data mining tools then they pretty much run off SQL Server cubes (which pre aggregate the data). So in this case you want to consider which data structure will allow you to build cubes quickly and easily.
That data structure would be a star schema. But there is additional work required to get it into a star schema, and in most cases you can build a cube off a normalised/OLAP structure OK.
So assuming you are using SQL Server data mining tools, your next step is to build a cube of the tables you have right now and see what challenges you have.
I have a mysql query that is taking 8 seconds to execute/fetch (in workbench).
I won't go into the details of why it may be slow (I think GROUPBY isnt helping though).
What I really want to know is, how I can basically cache it to work more quickly because the tables only change like 5-10 times/hr, while users access the site 1000s times/hour.
Is there a way to just have the results regenerated/cached when the db changes so results are not constantly regenerated?
I'm quite new to sql so any basic thought may go a long way.
I am not familiar with such a caching facility in MySQL. There are alternatives.
One mechanism would be to use application level caching. The application would store the previous result and use that if possible. Note this wouldn't really work well for multiple users.
What you might want to do is store the report in a separate table. Then you can run that every five minutes or so. This would be a simple mechanism using a job scheduler to run the job.
A variation on this would be to have a stored procedure that first checks if the data has changed. If the underlying data has changed, then the stored procedure would regenerate the report table. When the stored procedure is done, the report table would be up-to-date.
An alternative would be to use triggers, whenever the underlying data changes. The trigger could run the query, storing the results in a table (as above). Alternatively, the trigger could just update the rows in the report that would have changed (harder, because it involves understanding the business logic behind the report).
All of these require some change to the application. If your application query is stored in a view (something like vw_FetchReport1) then the change is trivial and all on the server side. If the query is embedded in the application, then you need to replace it with something else. I strongly advocate using views (or in other databases user defined functions or stored procedures) for database access. This defines the API for the database application and greatly facilitates changes such as the ones described here.
EDIT: (in response to comment)
More information about scheduling jobs in MySQL is here. I would expect the SQL code to be something like:
truncate table ReportTable;
insert into ReportTable
select * from <ReportQuery>;
(In practice, you would include column lists in the select and insert statements.)
A simple solution that can be used to speed-up the response time for long running queries is to periodically generate summarized tables, based on underlying data refreshing or business needs.
For example, if your business don't care about sub-minute "accuracy", you can run the process once each minute and make your user interface to query this calculated table, instead of summarizing raw data online.