Project structure and data¶

This page describes the project folder structure, and the input and output data.

Project folder organisation¶

We recommend creating an empty folder for the project. The path of this project will be the input of the API. The input folder needs to be manually created and populated with the data; the rest of the folders will be generated by the API.

PROJECT_FOLDER
├── input
│   └── ...
└── output
    ├── domain_ontology
    │   ├── ...
    ├── intermediate
    │   ├── ...
    └── report
        └── ...

Data schema¶

General points:

default_id column is mandatory (nodes and nodes_obsolete tables).
default_label column is optional in all tables.
source_id and target_id must be the same format as default_id,

for more on this please read node ID schema,

Node ID schema¶

All nodes must be formatted as follows NAMESPACE:CODE, for example: MONDO:0000123, SNOMED:1233425, where

NAMESPACE: denotes the source of the node, e.g. MONDO, SNOMED;

this should be all caps for uniformity.
CODE: is the actual node identifier code that must not contain :.
NAMESPACE and CODE parts must be concatenated by the : character.
There must be only one : character in the node ID.

Input data¶

The input requires four tables with the below specified schema. All of the tables are required to be present, however nodes_obsolete.csv can be an empty table with a header.

PROJECT_FOLDER
├── input
│   ├── config.json
│   ├── edges_hierarchy.csv
│   ├── mappings.csv
│   ├── nodes.csv
│   └── nodes_obsolete.csv
├── ...

Table `nodes.csv`¶

default_id	default_label
MONDO:0004979	asthma

Table `nodes_obsolete.csv`¶

default_id	default_label
MONDO:0006775	obsolete haemophilus influenzae meningitis

Table `edges_hierarchy.csv`¶

source_id	target_id	relation	prov
MONDO:0008798	MONDO:0019211	rdfs:subClassOf	MONDO

Table `mappings.csv`¶

source_id	target_id	relation	prov
MONDO:0004979	SNOMED:31387002	equivalent_to	MONDO

Output Data¶

PROJECT_FOLDER
├── input
│   └── ...
└── output
    ├── domain_ontology
    │   ├── edges_hierarchy.csv
    │   ├── mappings.csv
    │   ├── merges.csv
    │   └── nodes.csv
    ├── intermediate
    │   ├── data_tests
    │   │   └── ...
    │   ├── dropped_mappings
    │   │   └── equivalence_1_MONDO.csv
    │   ├── ...
    └── report
        ├── data_docs
        │   └── ...
        ├── data_profile_reports
        │   ├── nodes_report.html
        │   └── ...
        ├── index.html
        └── logs
            └── onto-merger.logger

Folders¶

domain_ontology: contains the output of the alignment process, i.e. the

final merged ontology in table format.
intermediate: contains the intermediate files generated during the

alignment process, data testing and data profiling.
report: contains the report of the alignment process with links to the

data profiling and data testing pages. Also includes the log output file:

onto-merger.logger.

Table `edges_hierarchy.csv`¶

Same format as in the input set. Contains the merged ontology hierarchy.

Table `mappings.csv`¶

Same format as in the input set. Node IDs are potentially updated with canonical node IDs and / or internal code reassignments.

Table `merges.csv`¶

source_id	target_id
SNOMED:31387002	MONDO:0004979

Table `nodes.csv`¶

Same format as in the input set. Contains the final set of nodes (i.e. canonical nodes). Merged nodes are excluded from this table.

Project structure and data¶

Project folder organisation¶

Data schema¶

Node ID schema¶

Input data¶

Table nodes.csv¶

Table nodes_obsolete.csv¶

Table edges_hierarchy.csv¶

Table mappings.csv¶

Output Data¶

Folders¶

Table edges_hierarchy.csv¶

Table mappings.csv¶

Table merges.csv¶

Table nodes.csv¶

Table `nodes.csv`¶

Table `nodes_obsolete.csv`¶

Table `edges_hierarchy.csv`¶

Table `mappings.csv`¶

Table `edges_hierarchy.csv`¶

Table `mappings.csv`¶

Table `merges.csv`¶

Table `nodes.csv`¶