How to manage Apache Superset Filterbox queries on large database? - data-analysis

Apache Superset Filterbox is a functionality that allows users to add filters dynamically on graphs - and that on dashboards. It works great for query that can be executed in less than a minute, but I don't understand how to deal with query that can take more than 2 ~ 3 minutes?

FilterBox [soon to be deprecated] can be pointed to the fact table (it'll generate a SELECT DISTINCT {dimension} FROM {table} LIMIT 10000). There's a field in the dataset configuration modal that enables you to add a predicate to filter / limit the scan. For large tables we recommend filtering on time on an indexed or partitioned field.
It's also possible to point the Filter Box to a table that isn't the large fact table in favor of a reference or dimension table where the SELECT DISTINCT is much cheaper. The requirement there is for the field name to be the same for the filter to propagate properly on other charts in the dashboard.

Related

MySQL - How to maintain acceptable response time while querying with many filters (Should I use Redis?)

I have a table called users with a couple dozen columns such as height, weight, city, state, country, age, gender etc...
The only keys/indices on the table are for the columns id and email.
I have a search feature on my website that filters users based on these various columns. The query could contain anywhere from zero to a few dozen different where clauses (such as where `age` > 40).
The search is set to LIMIT 50 and ORDER BY `id`.
There are about 100k rows in the database right now.
If I perform a search with zero filters or loose filters, mysql basically just returns the first 50 rows and doesn't have to read very many more rows than that. It usually takes less than 1 second to complete this type of query.
If I create a search with a lot of complex filters (for instance, 5+ where clauses), MySQL ends up reading through the entire database of 100k rows, trying to accumulate 50 valid rows, and the resulting query takes about 30 seconds.
How can I more efficiently query to improve the response time?
I am open to using caching (I already use Redis for other caching purposes, but I don't know where to start with properly caching a MySQL table).
I am open to adding indices, although there are a lot of different combinations of where clauses that can be built. Also, several of the columns are JSON where I am searching for rows that contain certain elements. To my knowledge I don't think an index is a viable solution for that type of query.
I am using MySQL version 8.0.15.
In general you need to create indexes for the columns which are mentioned in the criteria of the WHERE clauses. And you can also create indexes for JSON columns, use generated column index: https://dev.mysql.com/doc/refman/8.0/en/create-table-secondary-indexes.html.
Per the responses in the comments from ysth and Paul, the problem was just the server capacity. After upgrading the an 8GB RAM server, to query times dropped to under 1s.

Solr indexing structure with MySQL

I have three to five search fields in my application and planning to integrate this with Apache Solr. I tried to do the sams with a single table and is working fine. Here are my questions.
Can we create index multiple tables in same core ? Or should i create separate core for each indexes (i guess this concept is wrong).
Suppose i have 4 tables users, careers, education and location. I have two search boxes in a php page where one is to search for simple locations (just like an autocomplete box) and another one is to get search for a keyword which should check on tables careers and education. If multiple indexes are possible under single core;
2.1 How do we define the query here ?
2.2 Can we specify index name in query (like table name in mysql) ?
Links which can answer my concerns are enough.
If you're expecting to query the same data as part of the same request, such as auto-completing users, educations and locations at the same time, indexing them to the same core is probably what you want.
The term "core" is probably identical to the term "index" in your usage, and having multiple sets of data in the same index will usually be achieved through having a field that indicates the type of document (and then applying a filter query if you want to get documents of only one type, such as fq=type:location. You can use the grouping feature of Solr to get separate result sets of documents back for each query as well.
If you're only ever going to query the data separately, having them in separate indexes are probably the way to go, as you'll be able to scale and perform analysis and tuning independent from each index in that case (and avoid having to always have a filter query to get the type of content you're looking for).
Specifying the index name is the same as specifying the core, and is part of the URL to Solr: http://localhost:8983/solr/index1/ or http://localhost:8983/solr/index2/.

MYSQL Group By Date Performance Query

I'm currently writing a query that will run multiple times on a MYISAM table in a MySQL DB.
The query takes a lot (could be anything upto 100,000+) of rows and gets monthly totals. The SQL I'm currently using is
SELECT DATE_FORMAT(ct_cdatetime, "%m-%Y") AS Month, SUM(ct_total), SUM(ct_charge)
FROM client_transaction
WHERE (...omitted for this example...)
GROUP BY DATE_FORMAT(ct_cdatetime, "%m-%Y") ORDER BY ct_cdatetime ASC
I'm aware of the performance issue of forcing MySQL to cast the date to a string. Would it be quicker and / or better practice to
1) Leave it as is
2) Select all the rows and group them in PHP in an
array.
3) Have a month-year int field in the database and update this
when I add the row (e.g. 714 for July 2014)?
The answer to the question of which is fastest is simple: try both and measure.
The performance on the SQL side is not really affected by the date conversion. It is determined by the group by, and in particular, the sorting for the ordering.
I am skeptical that transferring the data and doing the work on the application side would be faster. In particular, you have to transfer a large (ish) amount of data from the database to the application. Then, you have to replicate the work in the application that would be done in the database.
However, the in-memory algorithms on the application side can be faster than the more generic algorithms on the database-side. It is possible that doing the work on the application side could be faster.
The following query is faster because it just extracts the components of a date instead of creating a formatted text:
SELECT YEAR(dtfield), MONTH(dtfield), COUNT(*)
FROM mytable
GROUP BY YEAR(dtfield), MONTH(dtfield);
Please keep in mind that MySQL cannot use any index optimization for the GROUP BY criteria when using functions. If you have a big table and need this computations often create separate indexed columns for them (as noticed by hellcode).

What is difference between INDEX and VIEW in MySQL

Which one is fast either Index or View both are used for optimization purpose both are implement on table's column so any one explain which one is more faster and what is difference between both of them and which scenario we use view and index.
VIEW
View is a logical table. It is a physical object which stores data logically. View just refers to data that is tored in base tables.
A view is a logical entity. It is a SQL statement stored in the database in the system tablespace. Data for a view is built in a table created by the database engine in the TEMP tablespace.
INDEX
Indexes are pointres that maps to the physical address of data. So by using indexes data manipulation becomes faster.
An index is a performance-tuning method of allowing faster retrieval of records. An index creates an entry for each value that appears in the indexed columns.
ANALOGY:
Suppose in a shop, assume you have multiple racks. Categorizing each rack based on the items saved is like creating an index. So, you would know where exactly to look for to find a particular item. This is indexing.
In the same shop, you want to know multiple data, say, the Products, inventory, Sales data and stuff as a consolidated report, then it can be compared to a view.
Hope this analogy explains when you have to use a view and when you have to use an index!
Both are different things in the perspective of SQL.
VIEWS
A view is nothing more than a SQL statement that is stored in the database with an associated name. A view is actually a composition of a table in the form of a predefined SQL query.
Views, which are kind of virtual tables, allow users to do the following:
A view can contain all rows of a table or select rows from a table. A view can be created from one or many tables which depends on the written SQL query to create a view.
Structure data in a way that users or classes of users find natural or intuitive.
Restrict access to the data such that a user can see and (sometimes) modify exactly what they need and no more.
Summarize data from various tables which can be used to generate reports.
INDEXES
While Indexes are special lookup tables that the database search engine can use to speed up data retrieval. Simply put, an index is a pointer to data in a table. An index in a database is very similar to an index in the back of a book.
For example, if you want to reference all pages in a book that discuss a certain topic, you first refer to the index, which lists all topics alphabetically and are then referred to one or more specific page numbers.
An index helps speed up SELECT queries and WHERE clauses, but it slows down data input, with UPDATE and INSERT statements. Indexes can be created or dropped with no effect on the data.
view:
1) view is also a one of the database object.
view contains logical data of a base table.where base table has actual data(physical data).another way we can say view is like a window through which data from table can be viewed or changed.
2) It is just simply a stored SQL statement with an object name. It can be used in any SELECT statement like a table.
index:
1) indexes will be created on columns.by using indexes the fetching of rows will be done quickly.
2) It is a way of cataloging the table-info based on 1 or more columns. One table may contain one/more indexes. Indexes are like a 2-D structure having ROWID & indexed-column (ordered). When a table-data is retrieved based on this column (col. which are used in WHERE clause), this index gets into the picture automatically and it's pointer search the required ROWIDs. These ROWIDs are now matched with actual table's ROWID and the records from table are shown.

Fast mysql query to randomly select N usernames

In my jsp application I have a search box that lets user to search for user names in the database. I send an ajax call on each keystroke and fetch 5 random names starting with the entered string.
I am using the below query:
select userid,name,pic from tbl_mst_users where name like 'queryStr%' order by rand() limit 5
But this is very slow as I have more than 2000 records in my table.
Is there any better approach which takes less time and let me achieve the same..? I need random values.
How slow is "very slow", in seconds?
The reason why your query could be slow is most likely that you didn't place an index on name. 2000 rows should be a piece of cake for MySQL to handle.
The other possible reason is that you have many columns in the SELECT clause. I assume in this case the MySQL engine first copies all this data to a temp table before sorting this large result set.
I advise the following, so that you work only with indexes, for as long as possible:
SELECT userid, name, pic
FROM tbl_mst_users
JOIN (
-- here, MySQL works on indexes only
SELECT userid
FROM tbl_mst_users
WHERE name LIKE 'queryStr%'
ORDER BY RAND() LIMIT 5
) AS sub USING(userid); -- join other columns only after picking the rows in the sub-query.
This method is a bit better, but still does not scale well. However, it should be sufficient for small tables (2000 rows is, indeed, small).
The link provided by #user1461434 is quite interesting. It describes a solution with almost constant performance. Only drawback is that it returns only one random row at a time.
does table has indexing on name?
if not apply it
2.MediaWiki uses an interesting trick (for Wikipedia's Special:Random feature): the table with the articles has an extra column with a random number (generated when the article is created). To get a random article, generate a random number and get the article with the next larger or smaller (don't recall which) value in the random number column. With an index, this can be very fast. (And MediaWiki is written in PHP and developed for MySQL.)
This approach can cause a problem if the resulting numbers are badly distributed; IIRC, this has been fixed on MediaWiki, so if you decide to do it this way you should take a look at the code to see how it's currently done (probably they periodically regenerate the random number column).
3.http://jan.kneschke.de/projects/mysql/order-by-rand/