I am using the Transformer model from Hugging face for machine translation. However, my input data has relational information as shown below:
I want to craft a graph like the like the following:
________
| |
| \|/
He ended his meeting on Tuesday night.
/|\ | | /|\
| | | |
|__| |___________|
Essentially each token in the sentence is a node and there could be an edge embedded between the tokens.
In a normal transformer, the tokens are processed into token embeddings, also there is an encoding of each position which resulted into positional embeddings.
How could I do something similar with the edge information?
Theoretically I could take the edge type and the positional encoding of a node and output an embedding. The embeddings of all the edges can be added to the positional embeddings for the corresponding nodes.
Ideally, I would like to implement this with the hugging face transformer.
I am struggling to understand how could I update the positional embedding here:
https://github.com/huggingface/transformers/blob/main/src/transformers/models/longformer/modeling_longformer.py#L453
self.position_embeddings = nn.Embedding(
config.max_position_embeddings, config.hidden_size, padding_idx=self.padding_idx
)
Your question is generally about how to use linguistic representations to enrich word representations in Transformer language models. This is an open research question with no clear answer. One option given by Prange et al. (2022) is to define a function which returns a subgraph of the whole-sentence graph given a particular token, and to process the token's subgraph so that it is represented in a fixed-length vector which can then be readily combined with the token's representation for use in the rest of the Transformer LM. Their related work section reviews other approaches and is worth a look.
Related
I am currently working with single cell data from human and zebrafish both from brain tissue!
My assignment is to integrate them! So the steps I have followed until now :
Find human orthologs for zebrafish genes in biomart
kept only the one2one
subset the zebrafish Seurat object based on the orthlogs and replace the names with the human gene names
Create an new Object for zebrafish and run Normalization anad FindVariableFeatures
Then use this object with my human object for integration
Human object: 20620 features across 2989 samples
Zebrafish object: 6721 features across 6036 samples
features <- SelectIntegrationFeatures(object.list = double.list)
anchors <- FindIntegrationAnchors(object.list = double.list,
anchor.features = features,
normalization.method="LogNormalize",
nn.method="rann")
This identifies 2085 anchors!
I used nn.method="rann" because if I use the default I have this error
Error: C stack usage 7973252 is too close to the limit
Then I am running the integration like this
ZF_HUMAN.combined <- IntegrateData(anchorset = anchors,
new.assay.name = "integrated")
and the error I am receiving is like this
Scaling features for provided objects
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=00s
Finding all pairwise anchors
| | 0 % ~calculating Running CCA
Merging objects
Finding neighborhoods
Finding anchors
Found 9265 anchors
Filtering anchors
Retained 2085 anchors
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=22s
To solve this I tried to play around with the arguments in FindIntegrationAnchors
e.g i used l2.norm=F! The only things that changed is the number of anchors which decreased
I am wondering if the usage of nn.method="rann" at FindIntegrationAnchors messing things up
ANY help will be appreciated because I am struggling for a long time with that, I don't know what else to do
My question is: is there any algorithm that can convert a SMILES structure into a topological fingerprint? For example if glycerol is the input the answer would be 3 x -OH , 2x -CH2 and 1x -CH.
I'm trying to build a python script that can predict the density of a mixture using an artificial neural network. As an input I want to have the structure/fingerprint of my molecules starting from the SMILES structure.
I'm already familiar with -rdkit and the morganfingerprint but that is not what i'm looking for. I'm also aware that I can use the 'matching substructure' search in rdkit, but then I would have to define all the different subgroups. Is there any more convenient/shorter way?
For most of the structures, there's no existing option to find the fragments. However, there's a module in rdkit that can provide you the number of fragments especially when it's a function group. Check it out here. As an example, let's say you want to find the number of aliphatic -OH groups in your molecule. You can simply call the following function to do that
from rdkit.Chem.Fragments import fr_Al_OH
fr_Al_OH(mol)
or the following would return the number of aromatic -OH groups:
from rdkit.Chem.Fragments import fr_Ar_OH
fr_Ar_OH(mol)
Similarly, there are 83 more functions available. Some of them would be useful for your task. For the ones, you don't get the pre-written function, you can always go to the source code of these rdkit modules, figure out how they did it, and then implement them for your features. But as you already mentioned, the way would be to define a SMARTS string and then fragment matching. The fragment matching module can be found here.
If you want to predict densities of pure components before predicting the mixtures I recommend the following paper:
https://pubs.acs.org/doi/abs/10.1021/acs.iecr.6b03809
You can use the fragments specified by rdkit as mnis proposes. Or you could specify the groups as SMARTS patterns and look for them yourself using GetSubstructMatches as you proposed yourself.
Dissecting a molecule into specific groups is not as straightforward as it might appear in the first place. You could also use an algorithm I published a while ago:
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0382-3
It includes a list of SMARTS for the UNIFAC model, but you could also use them for other things, like density prediction.
I am trying to create a KNIME workflow that would accept a list of compounds and carry out bioisosteric replacements (we will use the following example here: carboxylic acid to tetrazole) automatically.
NOTE: I am using the following workflow as inspiration : RDKit-bioisosteres (myexperiment.org). This uses a text file as SMARTS input. I cannot seem to replicate the SMARTS format used here.
For this, I plan to use the Rdkit One Component Reaction node which uses a set of compounds to carry out the reaction on as input and a SMARTS string that defines the reaction.
My issue is the generation of a working SMARTS string describing the reaction.
I would like to input two SDF files (or another format, not particularly attached to SDF): one with the group to replace (carboxylic acid) and one with the list of possible bioisosteric replacements (tetrazole). I would then combine these two in KNIME and generate a SMARTS string for the reaction to then be used in the Rdkit One Component Reaction node.
NOTE: The input SDF files have the structures written with an
attachment point (*COOH for the carboxylic acid for example) which
defines where the group to replace is attached. I suspect this is the
cause of many of the issues I am experiencing.
So far, I can easily generate the reactions in RXN format using the Reaction Builder node from the Indigo node package. However, converting this reaction into a SMARTS string that is accepted by the Rdkit One Component Reaction node has proven tricky.
What I have tried so far:
Converting RXN to SMARTS (Molecule Type Cast node) : gives the following error code : scanner: BufferScanner::read() error
Converting the Source and Target molecules into SMARTS (Molecule Type Cast node) : gives the following error code : SMILES loader: unrecognised lowercase symbol: y
showing this as a string in KNIME shows that the conversion is not carried out and the string is of SDF format : *filename*.sdf 0 0 0 0 0 0 0 V3000M V30 BEGIN etc.
Converting the Source and Target molecules into RDkit first (RDkit from Molecule node) then from RDkit into SMARTS (RDkit to Molecule node, SMARTS option). This outputs the following SMARTS strings:
Carboxylic acid : [#6](-[#8])=[#8]
Tetrazole : [#6]1:[#7H]:[#7]:[#7]:[#7]:1
This is as close as I've managed to get. I can then join these two smarts strings with >> in between (output: [#6](-[#8])=[#8]>>[#6]1:[#7H]:[#7]:[#7]:[#7]:1) to create a SMARTS reaction string but this is not accepted as an input for the Rdkit One Component Reaction node.
Error message in KNIME console :
ERROR RDKit One Component Reaction 0:40 Creation of Reaction from SMARTS value failed: null
WARN RDKit One Component Reaction 0:40 Invalid Reaction SMARTS: missing
Note that the SMARTS strings that this last option (3.) generates are very different than the ones used in the myexperiments.org example ([*:1][C:2]([OH])=O>>[*:1][C:2]1=NNN=N1). I also seem to have lost the attachment point information through these conversions which are likely to cause issues in the rest of the workflow.
Therefore I am looking for a way to generate the SMARTS strings used in the myexperiments.org example on my own sets of substituents. Obviously doing this by hand is not an option. I would also like this workflow to use only the open-source nodes available in KNIME and not proprietary nodes (Schrodinger etc.).
Hopefully, someone can help me out with this. If you need my current workflow I am happy to upload that with the source files if required.
Thanks in advance for your help,
Stay safe and healthy!
-Antoine
What you're describing is template generation, which has been a consistent field of work in reaction prediction and/or retrosynthesis in cheminformatics for a long time.
I'm not particularly familiar with KNIME myself, though I know RDKit extensively: Your last option (3) is closest to what I'd consider a usable workflow. The way I would do this:
Load the substitution pair molecules from SDF into RDKit mol objects.
Export these RDKit mol objects as SMARTS strings rdkit.Chem.MolToSmarts().
Concatenate these strings into the form before_substructure>>after_substructure to generate a reaction SMARTS string.
Load this SMARTS string into a reaction object rxn = rdkit.Chem.AllChem.ReactionFromSmarts()
Use the rxn.RunReactants() method to generate your bioisosterically substituted products.
The error you quote for the RDKit One Component Reaction node input cuts off just before the important information, unfortunately. Running rdkit.Chem.AllChem.ReactionFromSmarts("[#6](-[#8])=[#8]>>[#6]1:[#7H]:[#7]:[#7]:[#7]:1") produces no errors for me locally, which leads me to believe this is specific to the KNIME node functionality.
Note, that the difference between [#6](-[#8])=[#8] and [*:1][C:2]([OH])=O is relatively minimal: The former represents a O-C=O substructure, the latter represents a ~COOH group. Within the square brackets of the latter, the :num refers to an optional 'atom map' number, which allows a one-to-one mapping of reactant and product atoms. For example, [C:1][C:3].[C:2][C:4]>>[C:1][C:3][C:4][C:2] allows you to track which carbon is which during a reaction, for situations where it may matter. The token [*:1] means "any atom" and is equivalent to a wavey line in organic chemistry (and it is mapped to #1).
There are only two situations I can think of where [#6](-[#8])=[#8] and [*:1][C:2]([OH])=O might differ:
You have methanoic acid as a potential input for substitution (former will match, latter might not - I can't remember how implicit hydrogens are treated in this situation)
Inputs are over/under protonated. (COO- != COOH)
Converting these reaction SMARTS to RDKit reaction objects and running them on input molecule objects should potentially create a number of substituted products. Note: Typically, in extensive projects, there will be some SMARTS templates that require some degree of manual intervention - indicating attachment points, specifying explicit hydrogens, etc. If you need any help or have any questions don't hesitate to drop a comment and I'll do my best to help with specifics.
I am starting to work with Vowpal Wabbit with Python and I am kinda struggling with its lack of documentation.
Do you guys know what modeling it uses as a cost/reward estimation for each arm? Do you know how to retrieve this current estimation?
vw = pyvw.vw("--cb_explore 2 --epsilon 0.2")
input = "2:-20:0.5 | Anna"
vw.learn(initial_input)
input = "1:-10:0.1 | Anna"
vw.learn(initial_input)
vw.predict(" | Anna")
Output would be:
[0.10000000149011612, 0.9000000357627869]
How can I also get the expected value for each arm? Something like
[-10.00, -20.00]
When using _explore you get back a PMF over the given actions. This is true for CB and CB_adf.
However, when using the non-explore version for each of those things differ a bit.
--cb is going to give you the chosen action directly, whereas --cb_adf is going to return the score for each given action.
So in this situation changing to using action dependent features (ADF) should provide the score/estimated cost.
I have similar problem like below
Why did NLTK NaiveBayes classifier misclassify one record?
In my case, I queried positive feed and built positive_vocab and then queried negative feed and built negative_voca. I get the data from feed clean and built the classifier. How do I build the neutral_vocab. Is there a way I can instruct NLTK classifier to return neutral label when the given word is not found in the negative_voca and positive_vocab. How do I do that?
In my current implementation, if I give a word which is not present in the both sets it tells positive by default. Instead it should tell, neutral or notfound