Skip to content

How to use a CSV with Pheno-Ranker

If you have a CSV, you can use Pheno-Ranker too 😄 !

About inter-cohort analysis with CSV

Please note that performing inter-cohort analysis may be difficult due to potential inconsistencies or lack of common variables across different CSVs.

Best practices for CSV data quality

Here are a few important considerations regarding the CSV format. Most of these are common-sense guidelines:

  • Ensure there are no duplicated column names (headers).

  • While it's acceptable not to include ontologies/terminologies, please maintain consistent nomenclature for values (e.g., avoid using M and Male to refer to the same concept).

  • For columns, you can use any separator of your choice (default is ;), but if you have nested values in columns, you must specify the delimiter with --array-separator (default is |).

  • Pheno-Ranker was built with speed in mind, but if you have more than 10K rows, be aware that the calculations may take more than a few seconds.

  • Qualitative values are preferred over quantitative ones. If your data includes quantitative values (especially numerical or floating-point numbers), we recommend discretizing them into ranges.

If your dataset contains both quantitative and qualitative variables, consider exploring Factor Analysis of Mixed Data. Alternatively, if your data is purely numeric, you may find K-means clustering useful.

Ok, let's convert a CSV then.

Imagine you have a file named example.csv that uses ; as a column separator, and it looks like this:

Foo Bar Baz
foo1 bar1a,bar1b baz1
foo2 bar2a,bar2b baz2

Column Bar is an array and columns Fooand Baz aren't. Your file does not have a column that can be used as an identifier for each row.

OK, we are going to use the included utility to convert example.csv:

csv2pheno-ranker -i example.csv --generate-primary-key --primary-key-name Id --array-separator ','

Where:

  • --generate-primary-key forces the generation of a primary key field for each record in your CSV, if one does not already exist. Use this option when your data lacks a unique identifier. The name of the newly created primary key field should be specified using --primary-key-name.

  • --primary-key-name Id specifies the name Id for the primary key field. This option is used together with --generate-primary-key to name the newly generated primary key field, or alone, to identify the existing field to be used as a primary key in your CSV data. The specified field must be a single-value field (non-array).

  • --array-separator ',' specifies the delimiter for nested values in columns.

One of the results will be this file named example.json:

[
   {
      "Bar" : [
         "bar1a",
         "bar1b"
      ],
      "Baz" : "baz1",
      "Foo" : "foo1",
      "Id" : "PR_00000001"
   },
   {
      "Bar" : [
         "bar2a",
         "bar2b"
      ],
      "Baz" : "baz2",
      "Foo" : "foo2",
      "Id" : "PR_00000002"
   }
]

And the other will be this file named example_config.yaml:

---
allowed_terms:
- Bar
- Baz
- Foo
- Id
array_terms:
- Bar
format: CSV
id_correspondence:
  CSV:
    - Bar: Bar
primary_key: Id

Once you have these two files you can run Pheno-Ranker by using:

pheno-ranker -r example.json --config example_config.yaml

If you want to exclude or include columns (i.e., terms) you can use the corresponding flag:

pheno-ranker -r example.json --exclude-terms Id Foo --config example_config.yaml

Remember to always use --config example_config.yaml.

Good luck!

NAME

csv2pheno-ranker: A script to convert a CSV to an input suitable for Pheno-Ranker

SYNOPSIS

csv2pheno-ranker -i <input.csv> [-options]

  Arguments:
    -i, --input <input.csv>          CSV file

  Options:
    -generate-primary-key            Generates a primary key if absent. Use --primary-key-name to set its name
    -primary-key-name <name>         Sets the name for the primary key. Must be a single, non-array field
    -sep, --separator <char>         Delimiter for CSV fields [;] (e.g., --sep $'\t' for tabs)
    -array-separator <char>          Delimiter for nested arrays [|] (e.g., --array-separator ';' for semicolons)
    -output-dir <directory>          Specify the directory where output files will be stored. If not specified, outputs will be placed in the same directory as the input file

  Generic Options:
    -debug <level>                   Print debugging (from 1 to 5, being 5 max)
    -h, --help                       Brief help message
    -man                             Full documentation
    -v, --verbose                    Verbosity on
    -V, --version                    Print version

DESCRIPTION

Numerous tools exist for CSV to JSON conversion, but our focus here was on creating JSON specifically for Pheno-Ranker. The script supports both basic CSV files and complex, comma-separated CSV files with nested fields, ensuring seamless Pheno-Ranker integration.

The script will create both a JSON file and the configuration file for Pheno-Ranker. Then, you can run Pheno-Ranker as:

$ pheno-ranker -r my_csv.json --config --my_csv_config.yaml

Note that we load all data in memory before dumping the JSON file. If you have a huge CSV (e.g.,>5M rows) please use a computer that has enough RAM.

SUMMARY

A script to convert a CSV to an input suitable for Pheno-Ranker

INSTALLATION

(only needed if you did not install Pheno-Ranker)

$ cpanm --installdeps .

System requirements

* Ideally a Debian-based distribution (Ubuntu or Mint), but any other (e.g., CentOs, OpenSuse) should do as well.
* Perl 5 (>= 5.10 core; installed by default in most Linux distributions). Check the version with "perl -v"
* 1GB of RAM.
* 1 core (it only uses one core per job).
* At least 1GB HDD.

HOW TO RUN CSV2PHENO-RANKER

The software requires a CSV file as the input and operates with default settings. By default, both the JSON file and the configuration file will be created in the same directory as the input file, and will share the same basename.

If you have columns with nested values make sure that you use --array-separator to define the delimiting character (default is "|").

If you want to change some parameters please take a look to the synopsis.

Examples:

$ ./csv2pheno-ranker -i example.csv

$ ./csv2pheno-ranker -i example.csv --generate-primary-key --primary-key-name ID

$ ./csv2pheno-ranker -i example.csv --generate-primary-key --primary-key-name ID  --output-dir /my-path --sep ';' --array-separator ','

COMMON ERRORS AND SOLUTIONS

* Error message: Foo
  Solution: Bar

* Error message: Foo
  Solution: Bar

AUTHOR

Written by Manuel Rueda, PhD. Info about CNAG can be found at https://www.cnag.eu.

COPYRIGHT AND LICENSE

This PERL file is copyrighted. See the LICENSE file included in this distribution.