Alignment¶
The alignment process takes available mappings and produces a set of stable
merges (output/domain_ontology/merges.csv
).
In merge example (one to one) the node SNOMED:31387002
is deemed
the same, i.e. a duplicate, of thenode MONDO:0004979
(indeed both nodes
represent the disease concept
asthma).
source_id |
target_id |
SNOMED:31387002 |
MONDO:0004979 |
Stable means that from the perspective of the source ontology there are
no splits. In merge example (many to one) two OMIM
nodes are mapped
to the same MONDO
nodes, where we retain the concept representation
granularity of the target ontology but loose it from the perspective of the
source ontology.
source_id |
target_id |
OMIM:608584 |
MONDO:0004979 |
OMIM:600807 |
MONDO:0004979 |
In cases where a source concept would match to two different concepts of the target ontology, merging is inappropriate and it is handled in the connectivity process.
Steps¶
Establishing source alignment order¶
First we produce the source alignment order. The alignment process is a
sequence of steps, where in each
step we attempt to merge nodes to a given ontology.
The goal is to have the minimal number of nodes, from the minimal number of
different sources.
Therefore the alignment order is produced by putting the seed ontology as
first (this should have the most mappings and the desired hierarchy), and the
rest of the ontologies according to the frequency of the
nodes. For example, in the
example data set
this would be MONDO, MEDDRA, ICD10CM, MESH, ...
.
Pre-processing mappings¶
Next we preprocess the mappings.
Mappings typically contain internal code reassignments.
These are mappings between nodes of the same source that
describe the new code(s) of deprecated nodes. In the
internal code reassignment mapping example the
node MONDO:0022856
was
deprecated
and
replaced
by the node MONDO:0001217
. The input table nodes_obsolete.csv
helps to determine the deprecated and the current node ID (some input mappings
may be parsed inconsistently regarding to directionality).
source_id |
target_id |
MONDO:0022856 |
MONDO:0001217 |
These are removed from the full mapping sets (mappings.csv
).
The remainder mapping set is updated using
the internal code reassignments mappings (the full mapping set may contain
mappings from different sources,
that are not necessarily up to date, as described
here).
Aligning sources for mapping type groups¶
The alignment is run in several batches, where each batch aligns nodes to each source as specified by the source alignment order. First it uses the strongest mapping relation type group, equivalence, then database reference and the rest.
Aligning nodes to a source¶
This process is repeated for each source.
- Filter mappings for the source: filters the available mappings tofind all where either the source or the target node is from the givensource (e.g.
MONDO
) - Filter mappings for the permitted type: filters source mappings forthe mapping type group (e.g.
equivalence
) - Orient mappings towards source: source and target node IDs arepotentially flipped so the target node ID is always from the ontologythat we are aligning onto (e.g.
MONDO:0004979, OMIM:608584
becomesMONDO:0004979, OMIM:608584
) - Get one or many source to one target node mappings:
- mappings are de-duplicated: if we have two mappings where thesource and target node IDs are the same, but the mapping relation isdifferent, the two mappings are reduced to one (note that as these arein the same type group)
- filtering for unmapped nodes: only those mappings are retainedwhere the source node ID is unmapped.
- filtering for multiplicity: only those mappings are kept that wontform a one source to many target (i.e. split) mapping cluster, i.e.in the remaining mappings are one or many source to one target nodemappings (the rest of such mappings are dropped, these are savedfor debugging in the folder:
PROJECT_FOLDER/output/intermediate/dropped_mappings
folder, withstep ID, aligned source ID and the mapping strength e.g.../equivalence_1_MONDO.csv
)
Filtered mappings are saved as merges to the source