bio_analytics.pl -- Computational biology data analytics.

Collects a number of biological data analytics tasks. This library provides tools for the bio_db served data, empowering downstream analyses of user's experimental data.

Installation and loading:

?- pack_install(bio_analytics).
        % also installs pack(lib). other dependencies are installed at load time via lib
?- use_module(library(lib)).
?- lib(bio_analytics).

The library comes with one experimental dataset: data/silac/bt.csv which is used in the example files in directory examples/ .

author: - nicos angelopoulos
version: - 0.1 2019/4/22; - 0.2 2019/5/08; - 0.3 2019/5/11; - 0.4 2020/9/18; - 0.5 2023/1/2; - 0.6 2023/6/6; - 0.7 2024/10/7
license: - MIT

bio_analytics_version(-Vers, -Date)

Version and release date.

?- bio_analytics_version(V,D).
V = 0:7:0,
D = date(2024,10,17).

bio_diffex(+Exp, -DEs, -NonDEs, +Opts)

Select a sublist of significantly, differentially expressed genes in an experiment.

Exp is in the form of a mtx/1 matrix. Selected (DEs) and rejected (NonDEs) are returned in common format, of either (default) Gene-Ev, where Ev is the expression value for Gene in Exp, or sub-matrices of Exp. The format is controlled by option as_pairs.

We use ev as short for expression value, to avoid confusion with exp which short for experiment.

Opts

as_non(AsNon=pvalue): what to return as non differentially expressed: either everything in Gcnm (AsNon=all, default when EvLog is true), or only those with numeric pvalue (AsNon=pvalue, default when EvLog is false)
as_pairs(AsPairs=true): whether to return pairs or matrices
de_max(DexMax=false): puts a cap on the number of differentially expressed genes returned
de_mtx(DexMtx): returns the selected matrix. if an atom, its taken to be a filename
exp_pv_cnm(ExpPcnm=adj.pvalue): the experimental column (found in MsF) on which Pcut is applied as a filter
exp_pv_cut(Pcut=0.05): p.value cut off for experimental input from MSstats
exp_ev_log(EvLog=true): are the expression values log values
exp_ev_cnm(EvCnm): expession value column name (log2FC if EvLog=true, expression otherwise)
exp_ev_cut_let(EvLet = -1): expression values below (less or equal) which values are selected
exp_ev_cut_get(EvGet): expression values above (greater or equal) which values are selected (defaults is 2 if EvLog is false, and 1 otherwise)
exp_ev_include_inf(IncInf=false): include infinity values as diffexs ? (default is false if EvLog is false, and true otherwise)
gene_id_cnm(Gcnm=Symbols): column name for the key value in the pair lists: DEs and NonDEs.
mtx_dx_in(Mtx): can be used to return the Mtx used (from mtx(Exp,Mtx) call)
which(Wch=dx(UpIs,DownIs)): returns the up and down-regulated indices

?- lib(mtx),
   absolute_file_name(pack('bio_analytics/data/silac/bt.csv'), CsvF),
   mtx(CsvF, Mtx, convert(true)),
   assert(mtx_data(Mtx)).

CsvF = '/usr/local/users/nicos/local/git/lib/swipl/pack/bio_analytics/data/silac/bt.csv',
Mtx = [row('Protein IDs', 'Symbols', log2FC, adj.pvalue), row('Q9P126', ...), row(..., ..., ..., ...)|...].

?- lib(debug_call),
   debug(testo),
   mtx_data(Mtx),
   debug_call(testo, dims, mtx/Mtx),
   bio_diffex(Mtx, DEPrs, NonDEPrs, []),
   debug_call(testo, length, de/DEPrs),
   debug_call(testo, length, nde/NonDEPrs).

% Dimensions for matrix,  (mtx) nR: 1245, nC: 4.
% Length for list, de: 426.
% Length for list, nde: 818.

DEPrs = ['CLEC1B'- -4.80614893171469, 'LGALS1'- -2.065877096524, ... - ...|...],
NonDEPrs = ['CNN2'- -0.69348026828097, 'CXCR4'-0.73039667221395, ... - ...|...].

?-
    mtx_data(Mtx),
    bio_diffex(Mtx, DEPrs, NonDEPrs, exp_pv_cut(0.01)),
    debug_call(testo, length, de/DEPrs),
    debug_call(testo, length, nde/NonDEPrs).

% Length for list, de: 286.
% Length for list, nde: 958.

?-
    mtx_data(Mtx),
    Opts = [exp_ev_cut_let(inf),exp_ev_cut_get(-inf)],
    bio_diffex(Mtx, DEPrs, NonDEPrs, Opts),
    debug_call(testo, length, de/DEPrs),
    debug_call(testo, length, nde/NonDEPrs).

% Length for list, de: 581.
% Length for list, nde: 663.

?- mtx_data(Mtx),
    Opts = [exp_ev_cut_let(inf),exp_ev_cut_get(-inf),as_pairs(false)],
    bio_diffex(Mtx, DEs, NonDEs, Opts),
    debug_call(testo, length, de/DEPrs),
    debug_call(testo, length, nde/NonDEPrs).

% Length for list, de: 582.
% Length for list, nde: 664.

Mtx = [row('Protein IDs', 'Symbols', log2FC, adj.pvalue), row(..., ..., ..., ...)|...],
Opts = [exp_ev_cut_let(inf), exp_ev_cut_get(-inf), as_pairs(false)],
DEs = [row('Protein IDs', 'Symbols', log2FC, adj.pvalue), row('Q9P126', 'CLEC1B', -4.80614893171469, 2.057e-37), row(..., ..., ..., ...)|...],
NonDEs = [row('Protein IDs', 'Symbols', log2FC, adj.pvalue), row('B4DUT8;Q6FHC3;Q6FHE4;Q99439;B4DDF4;Q53GK7;B4DN57;B4DHU5;K7ES69;H3BVI6;H3BQH0', 'CNN2', -0.69348026828097, 0.0814675), row(..., ..., ..., ...)|...].

author: - nicos angelopoulos
version: - 0.1 2019/5/2; - 0.2 2020/9/3, ability to return sub-matrices (9/14): diffex_mtx(); - 0.3 2022/12/16, changed name from exp_diffex/4 to bio_diffex/4.

bio_p_adjust(+Obj, +Adj, +Opts)

Adj is the p.adjusted version of Obj.

Obj can be a list, an R vector, or Cid:Mtx term (see pl_vector/3). In the later case RelPos (see options) is the relative position for the new column (see mtx_relative_position/4), if none is given Adj is returned as a list.

The adjustment happens via p.adjust() function in R.

Opts

debug(Dbg=true): informational, progress messages
hdr(Hdr=q.value): the column header if we adding Adj
rel(RelPos=1): when Obj is a matrix and this is given, it is taken to be the relative position for a new column of the new values to be added to the matrix (see mtx_relative_position/4). Def is to append the column one after the pvalue column. New matrix returned in Adj.

Opts are passed to pl_vector/3 (predicate only loaded at run time, so this is a promised dependency to pack(b_real)).

Examples

?-   Mtx = [row(exp,pv),row(a,0.1),row(b,0.02),row(c,0.001)],
     bio_p_adjust( pv:Mtx, Adj, cps(Cps) ).

author: - nicos angelopoulos
version: - 0.1 2023/09/02
See also: - pl_vector/3

bio_symbols(+Vect, -Symbs, +Opts)

Convert gene identifiers pointed by Vect to Symbols.

Vect can be a list or pairlist (K-V) of which the K element is taken to be the identifier. Vect can also be an atomic representing the Nth column on column for Mtx.

Opts

debug(Dbg=false): progress, informational messages
mtx(Mtx): to be used if Vect is atomic
org(Org=hs): organism the data come from, via bio_db_organism/2
gid(Gid=symb): the type of the experimental gene ids for the organism. default depends on Org, but currently all map to symb (caution, this used to be org_exp_id(), changed in v0.2

?- hgnc_homs_hgnc_ncbi( 19295, Ncbi ), bio_symbols( [Ncbi], Symbs, gid(ncbi) ).
Ncbi = 114783,
Symbs = ['LMTK3'].

author: - nicos angelopoulos
version: - 0:1 2022/12/21; - 0:2 2024/06/05, changed option org_exp_id() to gid()
See also: - bio_db_organism/2

bio_volcano_plot(+Mtx, +Opts)

Create a volcano plot of differential expressed values (x) against p-values (y).

Uses ggplot2 and bio_analytics:bio_diffex/4. The predicate creates an R data.frame and creates a ggplot2 object from which typically output files are produced.

Opts

clr_down(ClrDw=brandeisblue): bio_colour_hex/2 compatible colour for down regulated entries
clr_inv(ClrInv=cadetgrey): colour for invariant entries
clr_line(ClrLn=bazaar): colour for cut-off lines
clr_up(ClrUp=cadmiumred): colour for up regulated entries
dir(Dir): as understood by os_dir_stem_ext/2- see Odir, below too
debug(Dbg=true): informational, progress messages
ext(Ext=pdf): extension that defines the type of output (passed to os_dir_stem_ext/2)
legend_cnm(LgCnm=regulation): the legend frame column name (also appears as legend title, if none is given)
legend_down(LegUp=down): token for up-regulation
legend_inv(LegUp=invariant): token for invariants
legend_title(LgTitle): by default LgCnm is used.
legend_up(LegUp=up): token for up-regulation
lim_x_max(LimXmax): if LimXmin is given this defaults to -LimXmin. By default ggplot2 decides this.
lim_x_min(LimXmin): if LimXmax is given this defaults to -LimXmax. By default ggplot2 decides this.
lim_y_max(LimXmax): By default ggplot2 decides this.
lim_y_min(LimY=0): this is not used unless LimYmax is given
odir(Odir): as understood by os_dir_stem_ext/2- preferred to Dir above
plot_file(File): returns the output file
rvar_cv(Rvar=bvp_cv): R var for colors vector
rvar_df(Rvar=bvp_df): R variable for the 3 column data frame
rvar_gt(Rvar=bvp_gt): R variable holding the plot term
rvar_rmv(RRmv=true): remove the three R variables from above
stem(Stem=volc_plot): stem for output, set to false if no output is wanted
theme(Theme=classic): theme_Theme should be a ggplot() theme

Opts are passed to bio_diffex/4. This predicate also pinches the defaults from there.

Examples, fixme: use iris dataset.

?- bio_volcano_plot([]).

author: - nicos angelopoulos
version: - 0.1 2022/12/15
See also: - bio_diffex/4; - os_dir_stem_ext/2
To be done: - add args for influencing of the gg term and the output-to-file call

exp_gene_family_string_graph(+Exp, +Family, -Graph, +Opts)

exp_gene_family_string_graph(+Exp, +Family, ?DEPrs, ?NonDEPrs, -Graph, +Opts)

Generate and possibly plot the STRING graph of a known gene family as affected in a biological Exp_eriment. Each family gene is placed of the following states according to wether it was identified in Exp and how it was modified (in relation to background conditions):

Red = upregulated
Green = downregulated
Orange = either both directions or unknown direction
Grey = not identified

Note that de-reguation trumps identification, that is, currently there is no distiction between genes that seen to be both significantly de-regulated and identified versus just those that are simply significantly de-regulated.

In the exp_gene_family_string_graph/6 version DEPrs and NonDEPrs can be provided, which optimises use in loops over families (see exp_go_over_string_graphs/4).

Opts

exp_ev_log(EvLog=true): are the expression values (ev) log values ? (also passed to bio_diffex/4)
faint_factor(Fctr=1.2): faint factor for non-significant nodes
include_non_present(IncP=true): false excludes non-indentified family genes
include_non_significant(IncS=true): false excluded non-significant family genes
node_colours(Clrs=clrs(red,green,yellow,khaki4,grey)): colours for the different types of nodes
org(Org=hs): organism in which Family is looked in (and experiment was performed)
org_exp_id(ExpID): the type of the experimental gene ids for the organism. default depends on Org, but currently all map to symb
plot(Plot=true): whether to plot the graph via wgraph_plot/2
wgraph_plot_opts(WgOpts=[]): options for wgraph_plot/2 (in preference to any defaults from Self)
wgraph_plot_defs(WgDefs=[]): options for wgraph_plot/2 (added to the end, after any defaults from Self)

These Options are passed to a number of other pack predicates: bio_diffex/4, bio_symbols/3, symbols_string_graph/3 and, selectively, wgraph_plot/2.

See [pack('bio_analytics/examples/bt.pl')].

?- lib(real), lib(mtx),
   absolute_file_name( pack('bio_analytics/data/silac/bt.csv'), CsvF ),
   Ppts = [ vjust= -1, node_size(3), mode="fruchtermanreingold", format(svg), stem(bt)],
   Opts = [ exp_ev_cut_let(inf), exp_ev_cut_get(-inf),
            include_non_present(false), include_non_significant(false),
            minw(200), wgraph_plot_opts(Ppts) ],
   exp_gene_family_string_graph( CsvF, autophagy, G, Opts ).

Produces file: bt.svg

[[../doc/images/bt.svg]]

author: - nicos angelopoulos
version: - 0.1 2019/4/15
See also: - bio_diffex/4, bio_symbols/3, symbols_string_graph/3; - gene_family/3; - wgraph_plot/2

exp_go_over(+CsvF, -GoOver, +Opts)

Perform gene ontology over-representation analysis.

For experimental data in CsvF select de-regulated genes and on
those perform over representation analysis in gene ontology.
Results in GoOver are either as a values list or a csv file, the name of which is assumed to be the ground value of GoOVer.

Opts

go(GoSec=BP)

gene ontology section, in: [BP,MF,CC]

go_frame(GoFrame=bioc_ann_dbi)

which go frame of GoTerms, Evidence and Gid to use. Default, bioc_ann_dbi, uses r_lib(org.<Org>.eg.db) frames, alternative bio_db uses bio_db tables, before the intro of the option this was the way, so use this for backward compatibility

go_over_pv_cut(PvCut=0.05)

p value filter for the results

org(Org=hs)

one of bio_db_organism/2 first argument values

gid(Gid)

the type of the experimental gene ids for the organism. The default depends on Org, most take symb, apart for pig which takes ensg (Caution, this used to be org_exp_id(OrgExpId).)

gid_to(GidTo)

returns the db type for gene ids used in underlying call, mostly ncbi, apart for mouse where it is mgim

stem(Stem=false)

stem for output csv file. When false, use basename of CsvF.

to_file(ToF=false)

when GoOver is unbound, this controls whether the output goes to a file or a values list

universe(Univ=go_exp)

Univ in : [genome,go,go_exp,experiment]

genome(Gen): all gene identifiers in relevant map predicates
go(Go): all gene identifiers appearing in any ontology
go_exp(GoExp): gene ontology genes that appear in experiment
experiment(Exp): all identifiers in the experiment
ontology(Onto): not implemented yet. like Go above, but only include this Ontology branch

Options are also passed to bio_diffex/4.

?- absolute_file_name( pack('bio_analytics/data/silac/bt.csv'), CsvF ),
   exp_go_over( CsvF, GoOver, [] ).

CsvF = '.../swipl/pack/bio_analytics/data/silac/bt.csv',
GoOver = [row('GOBPID', 'Pvalue', adj.pvalue, ...), row('GO:0061387', 0.000187, ...)|...].

?- absolute_file_name( pack('bio_analytics/data/silac/bt.csv'), CsvF ),
   exp_go_over( CsvF, GoOver, [stem(here),to_file(true)] ).

CsvF = '.../swipl/pack/bio_analytics/data/silac/bt.csv',
GoOver = here_gontBP_p0.05_univExp.csv.

?- absolute_file_name( pack('bio_analytics/data/silac/bt.csv'), CsvF ),
   exp_go_over( CsvF, 'a_file.csv', [stem(here),to_file(true)] ).

CsvF = '.../swipl/pack/bio_analytics/data/silac/bt.csv',

?- lib(by_unix).
?- @wc(-l,'a_file.csv').
183 a_file.csv

?- @wc(-l,'here_gontBP_p0.05_univExp.csv').
183 here_gontBP_p0.05_univExp.csv
true.

?- absolute_file_name( pack('bio_analytics/data/silac/bt.csv'), CsvF ),
   exp_go_over( CsvF, OverF, [to_file(true)] ).

CsvF = '.../swipl/pack/bio_analytics/data/silac/bt.csv',
OverF = go_over_gontBP_p0.05_univExp.csv.

?- absolute_file_name( pack('bio_analytics/data/silac/bt.csv'), CsvF ),
   exp_go_over( CsvF, OverF, [to_file(true),stem(false)] ).

CsvF = '.../swipl/pack/bio_analytics/data/silac/bt.csv',
OverF = '.../swipl/pack/bio_analytics/data/silac/bt_gontBP_p0.05_univExp.csv'.

author: - nicos angelopoulos
version: - 0.1 2019/5/2; - 0.2 2022/12/20, Univ=go and Org=gallus; - 0.3 2023/6/5, option go_frame(GoFrame)
See also: - go_over_universe/6

exp_go_over_string_graphs(+Exp, ?GoOver, ?Dir, -Opts)

Create string graphs for all the over-represented terms in GoOver.

When GoOver is a variable exp_go_over/3 is called to generate it.

Opts

dir_postfix(Psfx=go_strings): postfix for outputs directory (when Dir is a variable)
go_id_clm(GoIdClm=1): column id for over represented GO terms
ov_max(MaxOvs=false): when a number, it is taken as the maximal integer of terms to plot graphs for
stem_type(Sty=go_pair_ord): similar to go_string_graph/3, but different default, others: go_name, go_id. Here the length of GO terms (Len) is added to form go_pair_ord(I,Len) for forwarding
viz_de_opts(VizOpts=[]): options for restricting genes to visualise via bio_diffex/4. Default does not restrict what genes are visualised. diffex_only = VizOpts restricts genes to significants only.
wgraph_plot_opts(WgOpts=WgOpts): defaults to [vjust = -1, node_size(3), format(svg)].

Options are passed to exp_gene_family_string_graph/4.

?- debug(real).
?- debug(exp_go_over_string_graphs).
?- absolute_file_name( pack('bio_analytics/data/silac/bt.csv'), Exp ),
   exp_go_over_string_graphs( Exp, Gov, Dir, [] ).

% Sending to R: pltv <- ggnet2(lp_adj,vjexp_go_over_mtxust = -1,size = 3,
...    label = pl_v_1,color = pl_v_2,edge.size = pl_v_3,edge.color = "#BEAED4")

author: - nicos angelopoulos
version: - 0.1 2019/5/5; - 0.2 2020/9/6, option ov_max()
See also: - exp_go_over/3; - exp_gene_family_string_graph/4

exp_go_over_string_graphs_multi(+Opts)

Run multiple exp_go_over_string_graphs/4 runs.

Can ran across a number of values for parameters and number of datasets. A directory is created for each combination of input parameters. Subdirectories are created for each input dataset.

Can be used with single parameter values as well, in order to add that parameter to the output directory's name.

Opts

data(Data=data): which datasets to use, can be a directory
data_ext(DtExt=csv): extension selector for when Data is a directory
dated(Dated=true): adds date (ate_two_digit_dotted/1) stamp on output directories
dfx_opts(DfxOpts): invariant options for exp_go_over_string_graphs/4
dfx_viz_opts(VizOpts): invariant options for viz_de_opts() of exp_go_over_string_graphs/4
keep_mtx_sel(KpSel=true): record, to file, the mtx of values significant values selected
keep_mtx_viz(KpViz=true): record, to file, the mtx of values added for vizualisation (non significants)
mtx_opts(MtxOpts=[]): options to pass when reading the input matrices
multi(Keys, Vals): parameters and associated possible values to pass to exp_go_over_string_graphs/4 (Vals, can be a single value). If multiple Keys are given, then they synchronise- always taking the
viz_diffex_only_constraint(DoCon=true): enforce that when de_max() and exp_pv_cut() are the same, use viz_de_opts(diffex_only)
viz_pv_constraint(PvCon=true): enforce that PvCut for non-significants must be equal or larger than that for significants

multi() options are processed in the order they are given, with later ones varied first in the runs.

?- Opts = [],
   exp_go_over_string_graphs_multi( Opts ).

author: - nicos angelopoulos
version: - 0:1 2020/09/17
To be done: - shorten to exp_multi/1 ?

exp_reac_over(+Etx, ?ReOver, +Opts)

Perform reactome pathway over-representation analysis.

Etx is the matrix of experimental values (should pass as 1st arg in mtx/2). Opts

debug(Dbg=false)

informational, progress messages

org(Org=hs)

organism id for experimental data; bio_db_organism/2 first argument values

gid(Gid)

the gene id db token identifier for the genes in Etx. Default depends on Org.

gid_to(Gto)

returns the gene id db token identifier for interrogating the reactome db (currently returns ncbi)

mtx_cutoff(Cnm=_487364,Val=_487370,Dir=false)

filter the output matrix (see mtx_column_threshold/3)

pways(Pways=univ)

pathways to consider, the value only affects the corrected p.values as the longer the list of pathways the stronger the correction. In order of tightness:

dx(Dx): pathways that contain at least one Gid picked by bio_diffex/4
exp(Exp): pathways that contain at least one Gid in Etx
univ(PwUniv): pathways that contain at least one Gid in Univ (below option universe())
reac(Reac): all reactome pathways for Org

universe(Univ=experiment)

the genes universe, or background for genes in the statistical test (also: reac(tome))

Examples

?- exp_reac_over([]).
?- exp_reac_over([mtx_cutoff(0.05,'adj.pvalue',<)]).

author: - nicos angelopoulos
version: - 0.1 2024/03/16
See also: - bio_diffex/4; - mtx/2; - bio_db_organism/2; - org_gid_map/3
To be done: - add oddsRatios to make the output equivalent to output from exp_go_over/3.

gene_family(+Family, -Symbols, +Opts)

If Family is a known family alias it is expanded to a list of its constituent gene symbols.

Family can be a gene ontology term (atom of the form GO:XXXXXX). If family is a list of numbers or atoms that map to numbers, then they are taken to be Entrez ids which are converted to gene symbols in Symbols.

If family is a list of symbols, it is passed on to Symbols.

Listens to debug(gene_family).

Known families:

autophagy/human

from: http://autophagy.lu/clustering/index.html

Opts (v0.2)

org(Org=hs): organism for the symbols

Opts are also passed to: go_symbols_reach/3.

 Located family as the bio_analytics gene collection: autophagy
 ?- gene_family(autophagy, Auto, org(human)), length(Auto, Len).
 Auto = ['AMBRA1', 'APOL1', 'ARNT', 'ARSA', 'ARSB', 'ATF4', 'ATF6', 'ATG10', 'ATG12'|...],
 Len = 232.

 ?- debug(gene_family).
 ?- gene_family(375, Symbs, []), length(Symbs, Len).
 % Located GO term as the family identifier.
 Symbs = ['BCAS2', 'DBR1', 'DDX23', 'GEMIN2', 'KHSRP', 'LSM1', 'MPHOSPH10', 'PRPF3', 'PRPF4'|...],
 Len = 26.
 
 ?- gene_family('GO:0000375', Symbs, []), length(Symbs, Len).
 % Located GO term as the family identifier.
 Symbs = ['BCAS2', 'DBR1', 'DDX23', 'GEMIN2', 'KHSRP', 'LSM1', 'MPHOSPH10', 'PRPF3', 'PRPF4'|...],
 Len = 26.

 ?- gene_family([55626, 8542, 405], Auto, []).
 Converted input from Entrezes to Symbols.
 Auto = ['AMBRA1', 'APOL1', 'ARNT'].

 ?- gene_family(unknown, Auto, org(hs)).
 ERROR: Unhandled exception: gene_family(cannot_find_input_family_in_the_known_ones(unknown,[autophagy]))

Family datasets are in pack(bio_analytics/data/families).

author: - nicos angelopoulos
version: - 0:1 2019/3/5, from old code; - 0:2 2023/6/7, added options
See also: - go_symbols_reach/4
To be done: - error reporting via print_message/2

go_org_symbol(+Org, +Go, -Symb)

Get symbols of Go id for Organism.

Go can be either a GO: atom or an integer.

[debug]  ?- go_org_symbol( mouse, 2, Symbs ), write(Symbs), nl, fail.
Akt3
Mef2a
Mgme1
Mpv17
Mrpl15
Mrpl17
Mrpl39
Msto1
Opa1
Slc25a33
Slc25a36
Tymp
false.

[debug]  ?- go_org_symbol( hs, 'GO:0000002', Symbs ), write(Symbs), nl, fail.
AKT3
LONP1
MEF2A
MGME1
MPV17
MSTO1
OPA1
PIF1
SLC25A33
SLC25A36
SLC25A4
TYMP
false.

author: - nicos angelopoulos
version: - 0:1 2019/04/07; - 0:2 2023/06/09, added pig, moved to bio_org.pl

go_org_symbols(+GoT, +Org, -Symbols)

Gets the symbols belonging to a GO term.

?- go_org_symbols( 'GO:0000375', human, Symbs ), length( Symbs, Len ).
Symbs = ['SCAF11', 'SLU7', 'SRSF10', 'WDR83', 'DBR1', 'MPHOSPH10', 'PRPF4', 'PRPF6', 'SF3B1'|...],
Len = 25.

?- go_org_symbols( 'GO:0000375', mouse, Symbs ), length( Symbs, Len ).
Symbs = ['Rnu4atac', 'Rnu6atac', 'Srrm1', 'Sf3a2', 'Srsf10', 'Dbr1', 'Scaf11', 'Slu7', 'Srsf10'|...],
Len = 11.

?- go_org_symbols( 375, human, Symbs ), length( Symbs, Len ).
Symbs = ['SCAF11', 'SLU7', 'SRSF10', 'WDR83', 'DBR1', 'MPHOSPH10', 'PRPF4', 'PRPF6', 'SF3B1'|...],
Len = 25.

?- go_org_symbols( 375, chicken, Symbs ), length( Symbs, Len ).
Symbs = ['DBR1'],
Len = 1.

?- go_org_symbols( 375, pig, Symbs ), length( Symbs, Len ).
Symbs = [],
Len = 0.

author: - nicos angelopoulos
version: - 0.1 2019/4/7
See also: - go_org_symb/3, just a findall of; - go_symbols_reach/3 for a version that travells the ontology

go_over_plot(+GovF, +Opts)

Plots a lollipop plot for the GO over represented (csv) file or matrix (mtx/2), GovF.

By default when GovF is a file, an .svg is generated at the same name but postfixed by top_TopN (see options for TopN value).

Opts

debug(Dbg=false): informational, progress messages
go(GoSec): gene ontology section, in: [BP,MF,CC], used for the label. Default is looked for in the filename.
pfx_len(PfxLen=50): how many initial characters of the GO term name should be included ?
pos_go_term(GoTPos=8): argument/column position for the GO term name
pos_p_val(PvalPos=2): argument/column position for the p value
stem(Stem)
top_n(TopN=30): how many of the top terms to include

?- go_over_plot( 'res_resist-24.06.05/genes_15kb_resistant_enh_bp_go_over.csv', [] ).

author: - nicos angelopoulos
version: - 0:1 2024/06/05
See also: - b_real: gg_lollipop/2
To be done: - allow prepending order such as 01- as titles have to be unique; - allow adding the Count/Size on tick label names

go_string_graph(+GoTerm, -Graph, +Opts)

Plot the STRING graph of all the Symbols in a GO term.

The Graph can be saved via options to wgraph_plot/2.

Opts are passed to symbols_string_graph/3 and wgraph_plot/2.

Opts

org(Org=hs): organism for the operation
plot(Plot=false): whether to plot the graph
save(Save=false): whether to save the graph (passed to wgraph_plot/2
stem_type(Sty=go_name): constructs stem for the output filenames: go_id, go_pair, go_pair_ord(I,Len).

?- go_string_graph( 'GO:0016601', G, true ).
?- go_string_graph( 'GO:0016601', G, plot(true) ).

?- go_string_graph( 'GO:0016601', G, [plot(true),save(true)] ).
G = ['ABI2', 'AIF1', 'CDH13', 'HACD3', 'NISCH', 'ALS2'-'RAC1':892, 'BRK1'-'CYFIP1':999, ... - ... : 904, ... : ...|...].
?- ls.
% Rac_protein_signal_transduction.csv                Rac_protein_signal_transduction_graph.csv          Rac_protein_signal_transduction_layout.csv
true.

?- go_string_graph( 'GO:0016601', G, [plot(true),save(true),stem_type(go_pair)] ).

author: - nicos angelopoulos
version: - 0.1 2016/4/12; - 0.2 2020/9/3, expanded go_string_graph_stem/3
To be done: - make sure underlying options are compatible.

go_symbols_reach(+GoT, -Symbols, +Opts)

Gets the symbols belonging to a GO term. Descents to GO child relations, which by default are includes (reverse of is_a) and consists_of (reverse of part_of) to pick up Symbols recursively.

Opts

org(Org=hs): should be recognised by 1st arg of bio_db_organism/2.
go_frame(GO=bioc_ann_dbi): which implementation of GO to follow (v0.3 this changes default behaviour). Some of the options below are not functional with GO=bioc_ann_dbi. Alterantive: =\bio_db\= (which used to be the default, before the option was introduced).
descent(Desc=true): whether to collect symbols from descendant GO terms
as_child_includes(Inc=true): collect from edge_gont_include/2
as_child_consists_of(Cns=true): collect from edge_gont_consists_of/2
as_child_regulates(Reg=false): collect from edge_gont_regulates/2
as_child_negatively_regulates(Reg=false): collect from edge_gont_negatively_regulates/2
as_child_positively_regulates(Reg=false): collect from edge_gont_positively_regulates/2
debug(Dbg=false): see options_append/3

Listens to debug(go_symbols_reach).

?- go_symbols_reach( 'GO:0000375', Symbs, [] ), length( Symbs, Len ).
Symbs = ['AAR2', 'ALYREF', 'AQR', 'ARC', 'BCAS2', 'BUD13', 'BUD31', 'CACTIN', 'CASC3'|...],
Len = 293.

?- go_symbols_reach( 'GO:0000375', Symbs, org(mouse) ), length( Symbs, Len ).
Symbs = ['4930595M18Rik', 'Aar2', 'Aqr', 'Bud13', 'Bud31', 'Casc3', 'Cdc40', 'Cdc5l', 'Cdk13'|...].
Len = 190.

?- go_symbols_reach( 375, Symbs, true ), length( Symbs, Len ).
Symbs = ['AAR2', 'ALYREF', 'AQR', 'ARC', 'BCAS2', 'BUD13', 'BUD31', 'CACTIN', 'CASC3'|...],
Len = 293.

author: - nicos angelopoulos
version: - 0.1 2015/7/26; - 0.2 2019/4/7, added org, moved to new pack; - 0.3 2023/6/7, added option go(GoImpl), which changes default behaviour in comparison to past
To be done: - re-establish the bio_db interface- it is not far off, and compare to bioc_ann_dbi, bio_db likely to be faster

symbols_string_graph(+Symbols, -Graph, +Opts)

Create the string database Graph between Symbols.

Opts

cohese(Coh=max): method for cohesing multiple edges between two nodes [false,max,min,umax,umin] the max and min versions are more efficient but do sort the edges, whereas umax and umin take much longer but leave edges in found order
include_orphans(Orph=true): set to false to exclude orphans from Graph
org(Org=human): which organism do the gene symbols come from, via bio_db_organism/2
minw(500): minimum weight (0 =< x =< 999) - not checked
sort_pairs(Spairs=true): set to false to leave order of edges dependant on order of Symbols
sort_graph(Sort=true): set to false for not sorting the results

?- Got = 'GO:0043552', gene_family( Got, Symbs, true ), length( Symbs, SymbsLen ),
   symbols_string_graph(Symbs, Graph, true ), length( Graph, GraphLen ).

Got = 'GO:0043552',
Symbs = ['AGAP2', 'AMBRA1', 'ATG14', 'CCL19', 'CCL21', 'CCR7', 'CD19', 'CDC42', 'EPHA8'|...],
SymbsLen = 32,
Graph = ['AGAP2', 'CDC42', 'EPHA8', 'NOD2', 'P2RY12', 'TNFAIP8L3', 'AMBRA1'-'ATG14':995, ... - ... : 994, ... : ...|...],
GraphLen = 94.


Got = 'GO:0043552',
Symbs = ['AMBRA1', 'ATG14', 'CCL19', 'CCL21', 'CCR7'|...],
SymbsLen = 31,
Graph = ['TNFAIP8L3', 'AMBRA1'-'ATG14':981, 'AMBRA1'-'PIK3R4':974|...],
GraphLen = 235.

% the following is flawed as it uses Got from mouse and tries to build the graph in human STRING...
?- Got = 'GO:0043552', gene_family( Got, mouse, Symbs ), length( Symbs, SymbsLen ),
   symbols_string_graph(Symbs, Graph, true ), length( Graph, GraphLen ).

Got = 'GO:0043552',
Symbs = Graph, Graph = ['Ambra1', 'Atg14', 'Ccl19', 'Cd19', 'Cdc42', 'Epha8', 'Fgf2', 'Fgfr3', 'Fgr'|...],
SymbsLen = GraphLen, GraphLen = 28.

% this the correct way for running the first query for mouse:
?- Got = 'GO:0043552', gene_family( Got, mouse, Symbs ), length( Symbs, SymbsLen ),
   symbols_string_graph(Symbs, Graph, org(mouse) ), length( Graph, GraphLen ).

Got = 'GO:0043552',
Symbs = ['Ambra1', 'Atg14', 'Ccl19', 'Cd19', 'Cdc42', 'Epha8'|...],
SymbsLen = 28,
Graph = ['Ambra1', 'Cdc42', 'Lyn', 'Nod2', 'Pdgfra', 'Sh3glb1', 'Tnfaip8l3', 'Vav3', ... : ...|...],
GraphLen = 117.

author: - nicos angelopoulos
version: - 0.1 2016/1/18; - 0.2 2019/4/8, added organism to incorporate mouse; - 0.3 2020/9/5, option cohese(), and debugs
To be done: - implement cohese() values: min, umax and umin