HIDE™: Health Information DE-identification

Version History - HIDE-1.0
This page is provided for informational purposes. For latest download, please visit the software page.

Installation

Prerequisites

  • perl v5.10.0 or later. (Previous versions may work but are not supported). get it.
  • File::Cat perl module (Hint: sudo perl -MCPAN -e "install File::Cat") or get it.
  • Java version 1.6 or higher. (Previous versions may work but are not supported). get it.
  • Instructions

  • 1. Install all of the prerequisites.
  • 2. Extract the hide-1.0.tar.gz archive.
      tar -xvzf hide-1.0.tar.gz
  • 3. Go to the hide-1.0/lib directory, extract, make a link and compile mallet.
      cd path/to/hide-1.0/lib
      tar -xvzf mallet-2.0-RC2.tar.gz
      ln -s mallet-2.0-RC2 mallet
      cd ./mallet
      ant
  • Optional

    An annotation tool can be used to annotate files and generate training data. We currently recommend Callisto. get it.

    Usage
    HIDE must be run from the top-level HIDE installation directory. Before trying to run HIDE make sure to
    cd /path/to/hide-1.0/

    TRAIN - Training a new mark-up/de-identification model

    HIDE-train.pl crfmodelfile [list of files]

    The list of files are the marked up sgml formatted files you wish to train the CRF classifier.

    MARK-UP - Using a mark-up/de-identification model to mark-up new files

    HIDE-markup.pl crfmodelfile outputdir [list of files]

    The list of files is the list of files you with to mark with sgml tags based on the trained CRF.

    DE-ID

    (from original text files)

    HIDE crfmodelfile outputdir deidconfigfile [list of files]

    The list of files is the list of files you wish to deid and the result is placed in the outputdir directory.

    Remember to modify the deidconfigfile to fit the tags that you have used to markup your data.

    (from marked up files)

    ./lib/HIDE-DEID.pl deidconfigfile outputdir [list of files]

    The list of files is the list of files to replace sgml tags based on the configuration in the deidconfigfile and the result is placed in the outputdir directory.

    Remember to modify the deidconfigfile to fit the tags that you have used to markup your data.

    Input/Output File Format

    The input and output of HIDE is completely in sgml format. HIDE simply uses the main tag name (no support for sgml attributes e.g. <name type="sometype"> will be interpreted the same as <name>) as the label for the enclosed word or phrase.

    Example

    This is a made up pathology report about <name>John Doe<name> he is a <age>36</age> year old male. This report is in reference to <MRN>1234123-123</MRN>.

    The above example can be used to train the CRF classifier. Note: One training example is probably not enough for the classifier. You will want to have many examples to train the markup/de-identification model. Again, an annotation tool such as Calisto can be used.

    The output from HIDE will be in the same format as the example above.