Project structure and data

This page describes the project folder structure, and the input and output data.

Project folder organisation

We recommend creating an empty folder for the project. The path of this project will be the input of the API. The input folder needs to be manually created and populated with the data; the rest of the folders will be generated by the API.

PROJECT_FOLDER
├── input
│   └── ...
└── output
    ├── domain_ontology
    │   ├── ...
    ├── intermediate
    │   ├── ...
    └── report
        └── ...

Data schema

General points:

  • default_id column is mandatory (nodes and nodes_obsolete tables).

  • default_label column is optional in all tables.

  • source_id and target_id must be the same format as default_id,
    for more on this please read node ID schema,

Node ID schema

All nodes must be formatted as follows NAMESPACE:CODE, for example: MONDO:0000123, SNOMED:1233425, where

  • NAMESPACE: denotes the source of the node, e.g. MONDO, SNOMED;
    this should be all caps for uniformity.
  • CODE: is the actual node identifier code that must not contain :.

  • NAMESPACE and CODE parts must be concatenated by the : character.

  • There must be only one : character in the node ID.


Input data

The input requires four tables with the below specified schema. All of the tables are required to be present, however nodes_obsolete.csv can be an empty table with a header.

PROJECT_FOLDER
├── input
│   ├── config.json
│   ├── edges_hierarchy.csv
│   ├── mappings.csv
│   ├── nodes.csv
│   └── nodes_obsolete.csv
├── ...

Table nodes.csv

default_id

default_label

MONDO:0004979

asthma

Table nodes_obsolete.csv

default_id

default_label

MONDO:0006775

obsolete haemophilus influenzae meningitis

Table edges_hierarchy.csv

source_id

target_id

relation

prov

MONDO:0008798

MONDO:0019211

rdfs:subClassOf

MONDO

Table mappings.csv

source_id

target_id

relation

prov

MONDO:0004979

SNOMED:31387002

equivalent_to

MONDO


Output Data

PROJECT_FOLDER
├── input
│   └── ...
└── output
    ├── domain_ontology
    │   ├── edges_hierarchy.csv
    │   ├── mappings.csv
    │   ├── merges.csv
    │   └── nodes.csv
    ├── intermediate
    │   ├── data_tests
    │   │   └── ...
    │   ├── dropped_mappings
    │   │   └── equivalence_1_MONDO.csv
    │   ├── ...
    └── report
        ├── data_docs
        │   └── ...
        ├── data_profile_reports
        │   ├── nodes_report.html
        │   └── ...
        ├── index.html
        └── logs
            └── onto-merger.logger

Folders

  • domain_ontology: contains the output of the alignment process, i.e. the
    final merged ontology in table format.
  • intermediate: contains the intermediate files generated during the
    alignment process, data testing and data profiling.
  • report: contains the report of the alignment process with links to the
    data profiling and data testing pages. Also includes the log output file:
    onto-merger.logger.

Table edges_hierarchy.csv

Same format as in the input set. Contains the merged ontology hierarchy.

Table mappings.csv

Same format as in the input set. Node IDs are potentially updated with canonical node IDs and / or internal code reassignments.

Table merges.csv

source_id

target_id

SNOMED:31387002

MONDO:0004979

Table nodes.csv

Same format as in the input set. Contains the final set of nodes (i.e. canonical nodes). Merged nodes are excluded from this table.