package molenc
Install
Dune Dependency
Authors
Maintainers
Sources
sha256=a35d49c3c89ad607b46ec171248456218b08a6dc8ab34aab5d3081d4e7c0855a
md5=cab545a0a8b78f78d54c33750349022c
Description
Chemical fingerprints are lossy encodings of molecules. molenc allows to encode molecules using unfolded-counted fingerprints (i.e. a potentially very long but sparse vector of positive integers).
Currently, Faulon fingerprints and atom pairs are supported.
Currently, atom types are the quadruplet (#pi-electrons, element symbol, #HA neighbors, formal charge). In the future, pharmacophore features might be supported (a more abstract/fuzzy atom typing scheme).
Bibliography:
Faulon, J. L., Visco, D. P., & Pophale, R. S. (2003). The signature molecular descriptor.
- Using extended valence sequences in QSAR and QSPR studies. Journal of chemical information and computer sciences, 43(3), 707-720.
Carhart, R. E., Smith, D. H., & Venkataraghavan, R. (1985). Atom pairs as molecular features in structure-activity studies: definition and applications. Journal of Chemical Information and Computer Sciences, 25(2), 64-73.
Kearsley, S. K., Sallamack, S., Fluder, E. M., Andose, J. D., Mosley, R. T., & Sheridan, R. P. (1996). Chemical similarity using physiochemical property descriptors. Journal of Chemical Information and Computer Sciences, 36(1), 118-127.
OpenSMILES specification. Craig A. James et. al. v1.0 2016-05-15. http://opensmiles.org/opensmiles.html
Published: 31 Jan 2024
README
Introduction
MolEnc: a molecular encoder using rdkit and OCaml.
The implemented fingerprint is J-L Faulon's "Signature Molecular Descriptor" (SMD [1]). This is an unfolded-counted chemical fingerprint. Such fingerprints are less lossy than famous chemical fingerprints like ECFP4. SMD encoding doesn't introduce feature collisions upon encoding. Also, a feature dictionary is created at encoding time. This dictionary can be used later on to map a given feature index to an atom environment. Molenc also implements unfolded-counted atom pairs [2].
For SMD, we recommend using a radius of zero to one (molenc.sh -r 0:1 ...) or zero to two.
Currently, the atom typing scheme being used is: (#pi-electrons, element symbol, #HA neighbors, formal charge).
In the future, we might add pharmacophore feature points[3] (Donor, Acceptor, PosIonizable, NegIonizable, Aromatic, Hydrophobe), to allow a fuzzier description of molecules.
How to install the software
For beginners/non opam users: download and execute the latest self-installer shell script from (https://github.com/UnixJunkie/molenc/releases).
Then execute:
./molenc-5.0.1.sh ~/usr/molenc-5.0.1
This will create ~/usr/molenc-5.0.1/bin/molenc.sh, among other things inside the same directory.
For opam users:
opam install molenc
Do not hesitate to contact the author in case you have problems installing or using the software or if you have any question.
Usage
molenc.sh -i input.smi -o output.txt
[-d encoding.dix]: reuse existing feature dictionary
[-r i:j]: fingerprint radius (default=0:1)
[--pairs]: use atom pairs instead of Faulon's FP
[-m <int>]: maximum allowed atom-pair distance
(default: no limit)
[--seq]: sequential mode (disable parallelization)
[-v]: debug mode; keep temp files
[-n <int>]: max jobs in parallel
[-c <int>]: chunk size
[--no-std]: don't standardize input file molecules
ONLY USE IF THEY HAVE ALREADY BEEN STANDARDIZED
How to encode a database of molecules:
molenc.sh -i molecules.smi -o molecules.txt
How to encode another database of molecules, but reusing the feature dictionary from another database:
molenc.sh -i other_molecules.smi -o other_molecules.txt -d molecules.txt.dix
Bibliography
[1] Faulon, J. L., Visco, D. P., & Pophale, R. S. (2003). The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies. Journal of chemical information and computer sciences, 43(3), 707-720.
[2] Carhart, R. E., Smith, D. H., & Venkataraghavan, R. (1985). Atom pairs as molecular features in structure-activity studies: definition and applications. Journal of Chemical Information and Computer Sciences, 25(2), 64-73.
[3] Kearsley, S. K., Sallamack, S., Fluder, E. M., Andose, J. D., Mosley, R. T., & Sheridan, R. P. (1996). Chemical similarity using physiochemical property descriptors. Journal of Chemical Information and Computer Sciences, 36(1), 118-127.
Dependencies (16)
-
pyml
>= "20211015"
- vector3
-
parany
>= "12.1.1"
- ocamlgraph
-
ocaml
>= "5.1.0"
-
minicli
>= "5.0.0"
-
line_oriented
>= "1.2.0"
-
dune
>= "1.11"
-
dolog
>= "5.0.0"
- dokeysto
-
cpm
>= "9.0.0"
- conf-rdkit
- conf-python-3
- conf-graphviz
-
bst
>= "2.0.0"
-
batteries
>= "3.5.0"
Dev Dependencies
None
Used by (7)
Conflicts
None