Best way to index very long filepaths in mysql - mysql

Given the following filepaths, stored in a mysql database:
.//.git/refs/remotes/origin/HEAD
.//.git/refs/tags
.//__init__.py
.//__init__.pyc
.//forms.py
.//forms.pyc
.//models.py
.//models.pyc
.//settings.py
.//settings.pyc
.//static
.//static/css
.//static/css/all.css
.//static/images
.//static/images/bg.png
.//static/images/favicon.ico
.//static/images/pds-header-logo.png
.//static/images/pds-logo.png
.//static/images/revolver.png
.//static/js
.//static/js/all.js
.//templates
.//templates/base.html
.//templates/default.html
.//templates/overview.html
.//urls.py
.//urls.pyc
.//views.py
.//views.pyc
.//wsgi.py
.//wsgi.pyc
Someone needs to be able to search the path. For example if the user searched for "static", it would return the results with "static " in the path:
.//static
.//static/css
.//static/css/all.css
.//static/images
.//static/images/bg.png
.//static/images/favicon.ico
.//static/images/pds-header-logo.png
.//static/images/pds-logo.png
.//static/images/revolver.png
.//static/js
.//static/js/all.js
The search that I currently have is something like:
`SELECT path FROM files WHERE path LIKE '%search%';`
Is there a way to index this column / improve this search (remove the LIKE %%), as I probably have 1M+ filepaths on this system. Note that the filepath could be 200+ characters.

You can't. Wildcard search won't make use of index.
If you index the file path, you can only best support queries like
/static/images/%
Given your scenario, if you want to to allow wildcard search,
your best bet is to explode the directory to become multiple keywords:
static
images
revolver
.png
Then store each of the keyword into a keyword table,
and build the relationship.
When you perform the wildcard search is actually search the keyword table.

I guess you could actually have a "index of partial names". Something like this:
id ! name ! parent
---------------------
1 ! static ! 0 // at root.
2 ! css ! 1 // Parent is "static"
3 ! all.css ! 2 // parent is css
4 ! images ! 1 // parent is static
5 ! bg.png ! 4 // images.
It will need a bit of work to read out the original filename, unless you store that as well.

Related

Output files with mixed wildcards

I'm stuck on how to do this with Snakemake.
First, say my "all" rule is:
rule all:
input: "SA.txt", "SA_T1.txt", "SA_T2.txt",
"SB.txt", "SB_T1.txt", "SB_T2.txt", "SB_T3.txt"
Notice that SA has two _T# files while SB has three such files, a crucial element of this.
Now I want to write a rule like this to generate these files:
rule X:
output: N="S{X}.txt", T="S{X}_T{Y}.txt"
(etc.)
But SnakeMake requires that both output templates have the same wildcards, which these don't. Further, even if SnakeMake could handle the multiple wildcards, it would presumably want to find a single filename match for the S{X}_T{Y}.txt template, but I want that to match ALL files where {X} matches the first template's {X}, i.e. I want output.T to be a list, not a single file. So it would seem the way to do it is:
def myExpand(T, wildcards):
T = T.replace("{X}", wildcards.X)
T = [T.replace("{Y}", S) for S in theXYs[wildcards.X]]
return T
rule X:
output: N="S{X}.txt", T=lambda wildcards: myExpand("S{X}_T{Y}.txt", wildcards)
(etc.)
But I can't do this, because a lambda function cannot be used in an output section.
How do I do it?
It seems to me this argues for supporting lambda functions on output statements, providing a wildcards dictionary filled with values from already-parsed sections of the output statement.
Additions responding to comments:
The value of wildcard Y is needed because other rules have inputs for those files that have the wildcard Y.
My rule knows the different values for Y (and X) that it needs to work with from data read from a database into python dictionaries.
There are many values for X, and from 2 to 6 values of Y for each value of X. I don't think it makes sense to use separate rules for each value of X. However, I might be wrong as I recently learned that one can put a rule inside a loop, and create multiple rules.
More info about the workflow: I am combining somatic variant VCF files for several tumor samples from one person together into a single VCF file, and doing it such that for each called variant in any one tumor, all tumors not calling that variant are analyzed to determine read depth at the variant, which is included in the merged VCF file.
The full process involves about 14 steps, which could perhaps be as many as 14 rules. I actually didn't want to use 14 rules, but preferred to just do it all in one rule.
However, I now think the solution is to indeed use lots of separate rules. I was avoiding this partly because of the large number of unnecessary intermediate files, but actually, these exist anyway, temporarily, within a single large rule. With multiple rules I can mark them temp() so Snakemake will delete them at the end.
For the sake of fleshing out this discussion, which I believe is a legitimate one, let's assume a simple situation that might arise. Say that for each of a number of persons, you have N (>=2) tumor VCF files, as I do, and that you want to write a rule that will produce N+1 output files, one output file per tumor plus one more file that is associated with the person. Use wildcard X for person ID and wildcard Y for tumor ID within person X. Say that the operation is to put all variants present in ALL tumor VCF files into the person output VCF file, and all OTHER variants into the corresponding tumor output files for the tumors in which they appear. Say a single program generates all N+1 files from the N input files. How do you write the rule?
You want this:
rule:
output: COMMON="{X}.common.vcf", INDIV="{X}.{Y}.indiv.vcf"
input: "{X}.{Y}.vcf"
shell: """
getCommonAndIndividualVariants --inputs {input} \
--common {output.COMMON} --indiv {output.INDIV}
"""
But that violates the rules for output wildcards.
The way I did it, which is less than satisfactory, but works, is to use two rules, the first of which has the output template with more wildcards, and the second the template with fewer wildcards, and having the second rule create temporary output files which are renamed to the final name by the first rule:
rule A:
output: "{X}.{Y}.indiv.vcf"
input: "{X}.common.vcf"
run: "for infile in {input}: os.system('mv '+infile+'.tmp'+' '+infile)"
rule B:
output: "{X}.common.vcf"
input: lambda wildcards: \
expand("{X}.{Y}.vcf", **wildcards, Y=getYfromDB(wildcards["X"]))
params: OUT=lambda wildcards: \
expand("{X}.{Y}.indiv.vcf.tmp", Y=getYfromDB(wildcards["X"]))
shell: """
getCommonAndIndividualVariants --inputs {input} \
--common {output} --indiv {params.OUT}
"""
I do not know how the rest of your workflow looks, and what the best solution is is context-dependent.
1
What about splitting up the rule into two, one creating "SA.txt", "SA_T1.txt", "SA_T2.txt" and another "SB.txt", "SB_T1.txt", "SB_T2.txt", "SB_T3.txt"?
2
Another possibility is to only have the {X} files in the output-directive, but have the rule create the other files, even though they are not in the output-directive. This does not work if the {Y} files are part of the DAG.
3 (Best solution?)
A third and potentially the best solution might be to have aggregated wildcards in the X rule and the rule that requires the output from X.
Then the solution would be
rule X:
output: N="S{X_prime}.txt", T="S{Y_prime}.txt"
The rule which requires these files can look like:
rule all:
input:
expand("S{X_prime}", X_prime="A_T1 A_T2".split()),
expand("S{Y_prime}", Y_prime="B_T1 B_T2 B_T3".split())
If this does not meet your requirements we can discuss it further :)
Ps. You might need to use wildcard_constraints to disambiguate the outputs of rule X.
list_of_all_valid_X_prime_values = "A_T1 A_T2".split()
list_of_all_valid_Y_prime_values = "B_T1 B_T2 B_T3".split()
wildcard_constraints:
X_prime = "({})".format("|".join(list_of_all_valid_X_prime_values))
Y_prime = "({})".format("|".join(list_of_all_valid_Y_prime_values))
rule all:
...
My understanding is that snakemake works by taking steps that look as follows:
Look at the name of a file it is asked to generate.
Find the rule that has a matching pattern in its output.
Infer the values of the wildcards from the above match.
Use the wildcards to determine what the name of the inputs of the chosen rule should be.
Your rule can generate its output without needing an input, so the problem of inferring the value of wildcard Y is not evident.
How does your rule know how many different values for Y it needs to work with ?
If you find a way to determine the values for Y just knowing the value for X and predefined "python-level" functions and variables, then there may be a way to have Y as an internal variable of your rule, and not a wildcard.
In this case, the workflow could be driven only by files S{X}.txt. The S{X}_T{Y}.txt would just be a byproduct of the execution of the rule, not in its explicit output.

PhpStorm - change the path to images - how to update all paths with a batch process

I want to change the path to images I have on my site however there are hundreds to change.
Is there a batch process for this?
You should try the most obvious approach: global find & replace.
Edit | Find | Replace in Path... (Ctrl + Shift + R using Default keymap).
Search functionality allows you to specify the very narrow search scope -- only specific folder .. or user-defined custom scope where you can include on per-file level.
If find finds too many possible occurrences (search is too broad -- e.g. folder name is not too unique etc) you can still review and exclude particular occurrences before doing actual replace part.

using nowdoc from php method to store texts in arrays in fat-free-framework

I am working on page which is going to present 20 products. I would like to avoid using any db(page is going to be simple) so I am thinking about storing products' data in [globals] array. Case is that each product description is quite long between 500 and 1000 words and it is formatted which makes this very complicated. I am wondering if is possible to use similiar to nowdoc from php method to manage such long texts in free-fat-framework frane(http://www.php.net/manual/en/language.types.string.php#language.types.string.syntax.nowdoc)
Do you have any other idea to store long text in arrays in 3f?
Thanks in advance
Macrin
The user guide has an example of a very long string:
[globals]
str="this is a \
very long \
string"
Me, I would keep each product's description (with any other info, like photo url or price) in a seperate text file in a dedicated directory (let's say products). Then in index.php or any other route handler I would scan this directory and load the descriptions:
$productsDir = dir(__DIR__ . '/products');
$productsInfo = [];
foreach (new DirectoryIterator($productsDir) as $fileInfo) {
if($fileInfo->isDot()) continue;
$productsInfo[] = file_get_contents($fileinfo->getPathname());
}
var_dump($productsInfo);
You can use the JIG database and its data mapper.
https://fatfreeframework.com/3.6/jig-mapper
It can store your product items in plain .json files and you also get some basic CRUD and search functionality. You can also hook in Cortex later, if you ever want to upgrade to a real DB.

How can I go about querying a database for non-similar, but almost matching items

How can I go about querying a database for items that are not only exactly similar to a sample, but also those that are almost similar? Almost as search engines work, but only for a small project, preferably in Java. For example:
String sample = "Sample";
I would like to retrieve all the following whenever I query sample:
String exactMatch = "Sample";
String nonExactMatch = "S amp le";
String nonExactMatch_2 = "ampls";
You need to define what similar means in terms that your database can understand.
Some possibilities include Levenshtein distance, for example.
In your example, sample matches...
..."Sample", if you search without case sensitivity.
..."S amp le", if you remove a set of ignored characters (here space only) from both the query string and the target string. You can store the new value in the database:
ActualValue SearchFor
John Q. Smith johnqsmith%
When someone searches for "John Q. Smith, Esq." you can boil it down to johnqsmithesq and run
WHERE 'johnqsmithesq' LIKE SearchFor
"ampls" is more tricky. Why is it that 'ampls' is matched by 'sample'? A common substring? A number of shared letters? Does their order count (i.e. are anagrams valid)? Many approaches are possible, but it is you who must decide. You might use Levenshtein distance, or maybe store a string such as "100020010003..." where every digit encodes the number of letters you have, up to 9 (so 3 C's and 2 B's but no A's would give "023...") and then run the Levenshtein distance between this syndrome and the one from each term in the DB:
ActualValue Search1 Rhymes abcdefghij_Contains anagramOf
John Q. Smith johnqsmith% ith 0000000211011... hhijmnoqst
...and so on.
One approach is to ask oneself, how must I transform both searched value and value searched for, so that they match?, and then proceed and implement that in code.
You can use match_against in myisam full text indexes columns.

HTML Form Posts - Using Checkboxes with a Long Name attribute

I'm having a bit of an issue with some coding. I have made a file restore PHP script that will allow a person to place a checkbox next to the name of a file - and then when they click the "Restore" button at the bottom, it will restore the file in question from a backup.
Unfortunately, there seems to be a problem. The full path and name of the file are in the checkbox's "name" attribute - so that way it is passed along to the next script as the location and file that needs restored.
As an example:
<input type="checkbox" name="/backups/Sunday/111111111111-com/www/components/com_virtuemart/" />
See how long the "name" attribute is? In many cases, the restore works - but once the name attribute gets longer, it doesn't work anymore. In the above attribute, the "name" field is 63 characters long.
Now, if another one is tried:
<input type="checkbox" name="/backups/Sunday/111111111111-com/www/components/com_virtuemart/js/" />
The above "name" attribute is 67 characters long. It DOES NOT work.
On the script that the POST data is being posted to, I did a var_dump($_POST); to see what the output was. In the first case where the "name" attribute was 63 characters long, the var_dump displays it. But in the second case where the attribute was 67 characters long, it does not display it - and therefore the file is not restored.
Is there any way around this supposed attribute size limit? I looked online and saw several posts where individuals said there was no limit to the length of the "name" attribute - but apparently there is one.
Thank you!
Its the suhosin php hardener thats doing it no doubt. You can either edit the suhosin config directory or you need to take a different approach.
Maybe the easiest way without recoding a chunk of your script to use aliasing or storing a key=>value array persistently would be to simply make the name of your inputs "files[]" and put the path tot he file as the checkbox value. then you can just do
foreach($_POST['files'] as $f) {
//$f is the file path
}
However myself i dont like to do things like this. Id try and store a key=>value array somewhere. If you dont want to use a database, just serialize a php array into a file, then just have the checkboxes with the integer array key for each file. Then on the processing script you can simply get the files from the stored array at the posted integer indexes