How to analyze the output data to obtain the required information - data-analysis

I have performed blastn (ncbi-blast+) with my miRNA reads with the miRBase mature miRNA and got the output. I would like to analyze the data and separate the information in the form of Query read, length, readcount, %identity, gap, Q.start, description, etc. from the output.
How should I proceed further? Is there any script or command that will be helpful?

Related

Handling very large datasets (>1M records) using Power Automate

Before I go much further down this thought process, I wanted to check whether the idea was feasible at all. Essentially, I have two datasets, each of which will consist of ~500K records. For the sake of discussion, we can assume they will be presented in CSV files for ingesting.
Basically, what I'll need to do, is take records from the first dataset, do a lookup against the second dataset, and then produce an output that essentially merges the two together and produces an output CSV file with the results. The expected number of records after the merge will be in the range of 1.5-2M records.
So, my questions are,
Will Power Automate allow me to work with CSV datasets of those sizes?
Will the "Apply to each" operator function across that large of a dataset?
Will Power Automate allow me to produce the export CSV file with that size?
Will the process actually complete, or will it eventually just hit some sort of internal timeout error?
I know that I can use more traditional services like SQL Server Integration Services for this, but I'm wondering whether Power Automate has matured enough to handle this level of ETL operation.

Why do I sometimes get a malformed data error for some Contour boards but not others?

I have a dataset that when being used as an input to build another dataset results in a 'malformed record' error indicating that something is wrong with the raw data file (some malformed values). I would therefore expect not to be able to use that dataset in Contour. So, why do I sometimes get a malformed record error for some boards in Contour but not others?
So the answer here is that you will be able to perform some operations in Contour, and others you won't. This depends on whether or not the spark job that is being executed under the hood in Contour actually encounters the malformed records. Basically spark is lazy so won’t actually perform all operations over all of the data - only what it needs to to show you the results. So if the function performed in Contour doesn't include the particular column / rows where the malformed records exist, you'll be able to utilize the dataset.

How can i get a list of queries and their execution count

I want to get a list of queries executed against my mysql instance, I also want to get list of executions counts for them and duration,
I can get these stats in something like datadog APM, but I would like to be able to run a query for them locally.
is there a table or schema I need to look at?
Turn on the "general log" and have it write to a file.
Wait a finite amount of time.
Then use pt-query-digest to summarize the results.
Turn off the general log before it fills up disk.
The slowlog (with a small value in long_query_time) is more useful for finding naughty queries.

Analyzing multiple Json with Tableau

I'm beginning to use Tableau and I have a project involving multiple website logs stored as JSON. I have one log for each day for about a month, each weighting about 500-600 Mb.
Is it possible to open (and join) multiple JSON files in Tableau? If yes, how ? I can load them in parallel, but not join them.
EDIT : I can load multiple JSON files and define their relationship, so this is OK. I still have the memory issue:
I'm am worried that by joining them all, I will not have enough memory to make it work. Are the loaded files stored in RAM of in an internal DB ?
What would be the best way to do this ? Should I merge all the JSON first, or load them in a database and use a connector to Tableau? If so, what could be a good choice of DB?
I'm aware some of these questions are opinion-based, but I have no clue about this and I really need some guideline to get started.
For this volume of data, you probably want to preprocess, filter, aggregate and index it ahead of time - either using a database, something like Parquet and Spark and/or Tableau extracts.
If you use extracts, you probably want to filter and aggregate them for specific purposes, just be aware if that that you aggregate the data when you make the extract, you need to be careful that any further aggregations you perform in the visualization are well defined. Additive functions like SUM(), MIN() and MAX() are safe. Sums of partial sums are still correct sums. But averages of averages and count distincts of count distincts often are not.
Tableau sends a query to the database and then renders a visualization based on the query result set. The volume of data returned depends on the query which depends on what you specify in Tableau. Tableau caches results, and you can also create an extract which serves as a persistent, potentially filtered and aggregated, cache. See this related stack overflow answer
For text files and extracts, Tableau loads them into memory via its Data Engine process today -- replaced by a new in-memory database called Hyper in the future. The concept is the same though, Tableau sends the data source a query which returns a result set. For data of the size you are talking about, you might want to test using some sort of database if it the volume exceeds what comfortably fits in memory.
The JSON driver is very convenient for exploring JSON data, and I would definitely start there. You can avoid an entire ETL step if that serves your needs. But at high volume of data, you might need to move to some sort of external data source to handle production loads. FYI, the UNION feature with Tableau's JSON driver is not (yet) available as of version 10.1.
I think the answer which nobody gave is that No, you cannot join two JSON files in Tableau. Please correct me if I'm wrong.
I believe we can join 2 JSON tables in Tableau.
First extract the column names from the JSON data as below--
select
get_json_object(JSON_column, '$.Attribute1') as Attribute1,
get_json_object(line, '$.Attribute2') as Attribute2
from table_name;
perform the above for the required tableau and join them.

informatica - Can anyone please explain on how mapping and transformation execute on input data

I been searching across and did not get a good post explaining how mapping and transformations process input data.
Does entire mapping execute for each record (I understand it can not be the case as sorter and aggregate etc.. need to see entire data (or defined batch size) in order to execute their purpose).
May be it depends on each transformation type - certain may hold entire input data (or defined batch size) before they release output.
May be this question apply yo all data integration tools.
Briefly speaking: this depends on transformation type - e.g. Expression is processing rows without caching them, Aggregator, Sorter needs to cache all input data, Joiner caches just Master pipeline.