Frequently Asked Questions (FAQs)
  1. What are the general steps in network analysis from gene expression data?
  2. Data Processing

  3. What is the data format accepted by NetworkAnalyst?
  4. What if my microarray platform or organism is not supported?
  5. What does gene-level summarization mean?
  6. How should I choose a suitable normalization procedure?
  7. Differential Expression Analysis

  8. My data contains multiple metadata, how should I choose a proper method for differential analysis?
  9. I received the error message "no residual degrees of freedom", what should I do?
  10. What are the differences between pair-wise comparisons and time-series comparisons?
  11. What is the nested comparison?
  12. Network Construction

  13. How are the networks generated from my data?
  14. Which database is used for creating the network?
  15. How many nodes can be visualized (network size limit)?
  16. How do I construct a network composed only of nodes connecting the seed proteins?
  17. What if the network is very small (< 100)?
  18. Can I construct a network for a large number (> 1000) of significant genes?
  19. How does the Trim function work?
  20. Network Analysis

  21. How do I identify important nodes using degree and betweenness?
  22. What is a module and how are modules identified in NetworkAnalyst?
  23. How do I interpret the p value of a module?
  24. Can I perform GO or pathway analysis on my significant genes alone?
  25. Can I test enriched functions for highlighted selection?
  26. How is enrichment analysis performed?
  27. Network Visualization

  28. How do I interpret the node colors and sizes in the default Topo View?
  29. Can I view my queries in the network?
  30. How can I create a 300 dpi high-resolution network for publication?
  31. Can I change the background color of the network?
  32. Can I change node color, size or shape?
  33. Can I change the position of a node?
  34. Can I change the position of a node cluster?
  35. How can I label the nodes in the network?
  36. Can I delete nodes from the network?
  37. How do I manually highlight any arbitrary sections of the network?
  38. Can I extract a module or the highlighted section from the network?
  39. Functional Enrichment Analysis

  40. For the p-value associated with each enriched theme, are these FDR corrected P-values?
  41. Can I view the expression profile of a pathway or a GO category?
  1. What are the general steps in network analysis from gene expression data?

    Biological networks are very complex with tens of thousands of nodes and millions of connections among them. In order to understand patterns of gene expression within the context of biological networks, it is impractical and often unnecessary to directly analyze and visualize the complete network. NetworkAnalyst was developed based on a widely-used approach that contains three sequential steps: 1) identification of genes of interest (i.e. by differential expression analysis to get ~100s of significant genes or seed proteins), 2) mapping these seed genes to the complete network (an interaction database) and creating (sub)networks composed only of these genes and their close neighbours with ~ 1,000s of nodes; 3) performing network analysis such as hub analysis and/or module analysis, followed by functional enrichment analysis on these modules or hubs. In summary, the general steps are Identify genes of interest => Networks construction => Module/Hub identification and Functional assignment.

  2. What is the data format accepted by NetworkAnalyst?

    Three main data types supported by NetworkAnalystare: gene list data, a single gene expression data table and multiple gene expression datasets. Gene list data is a list of gene IDs with optional fold change values. Gene expression data (microarray intensity values or read counts) is a table in tab delimited text file format with genes in rows and samples in columns. The first column is for gene or probe IDs. The following common ID types are supported:

    • Gene IDs: GenBank Accession, RefSeq ID, Entrez ID, Ensembl gene ID, and Ensembl transcript ID;
    • Probe IDs: Popular microarray platforms from Affymetrix, Illumina and Agilent for human(21) and mouse(16);

    The sample names must be in the first row, followed by the sample labels (metadata). Each metadata starts with a new line beginning with "#CLASS:". A small example dataset with two metadata is shown below:
    #NAME           Sample1	Sample2	Sample3	Sample4	Sample5	Sample6	Sample7	Sample8 Sample9
    #CLASS:ER	Y   	N	N	Y	N	Y	Y	N       N
    100_g_at        -3.06	-2.25	-1.15	-6.64	0.4	1.08	1.22	1.02    1.15
    1000_at         -1.36	-0.67	-0.17	-0.97	-2.32	-5.06	0.28	1.32    0.73
    1002_f_at       1.61	-0.27	0.71	-0.62	0.14		0.11	0.98    0.54
    1008_f_at       0.93	1.29	-0.23	-0.74	-2	-1.25	1.07	1.27    1.02
    You can choose to use our example datasets for your testing. Example microarray gene expression data from eight Affymetrix Human Genome U95 chips (hgu95av2) can be downloaded here for testing. Using the Network meta-analysis module, multiple datasets can be uploaded with the same format for each individual dataset.
  3. What if my microarray platform or organism is not supported?

    For other microarray platforms for human, mouse, C. elegans and D. melanogaster, you can first annotate your probe IDs to gene IDs supported by NetworkAnalyst using the corresponding annotation file of the platform. It is possible to add support for other model organisms/platforms based on user requests. Feel free to send us your suggestions (jeff.xia [at]

  4. What does gene-level summarization mean?

    Microarray data provides probe-level expression measurements, and RNA-seq data provides expression at exon-level or transcript-level (i.e. different isoforms of the same gene) expression measurements. However, current functional annotations are mainly assigned at gene or protein level. Therefore, it is desirable to first map the probe-level or transcript-level measurements to corresponding gene-level measurements.

    When multiple probes or transcripts are mapped to the same gene, they need to be summarized into a single value for the corresponding gene. At the Gene Annotation step, users can choose to use the averages or medians of multiple probe intensities (microarray), or sums of counts from multiple transcripts (RNA-seq) to perform gene-level summarization.

  5. How should I choose a suitable normalization procedure?

    Yes, if the data is not already log transformed. This is mainly because the program uses linear model (Limma) for differential expression analysis. It is generally considered that differences in expression exist on a multiplicative scale: log transformation brings them into the additive scale, where a linear model (i.e. Limma) may apply. Log transformation can usually make the distribution more symmetric and Gaussian-like, allowing many additional statistical analyses to be applied.

    In order to perform log transformation, the data must not contain zero or negative values. To deal with this issue, We provides three versions. The Log_simple replaces only these values with a small positive value (i.e. detection limit); The Log_vsn_max will add large values to all data values with some adjustments based on the actual values; The Log_vsn_min is similar to the Log_vsn_max, but the values added are very small so as to be close to the original scales. The underlying R codes are given below. Note, if you are performing meta-analysis, you should NOT use different log normalization for different datasets. The transformation has a very large impact on the data. The first one is simple and easy to understand (suitable if data contains a small portion of negative values). The last one gives more desirable statisitcal properties (suitable for data with large amount of negative values).

    Log_simple:     min.val <- min(data[data>0], na.rm=T)/10;
                    data[data<=0] <- min.val;
                    data <- log2(data);
    Log_vsn_min:    min.val <- min(data[data>0], na.rm=T)/10;
                    data <- log2((data + sqrt(data^2 + min.val^2))/2);
    Log_vsn_max:    max.val <- max(data[data>0], na.rm=T)*10;
                    data <- log2(data+sqrt(data^2+max.val))

    If you are not sure whether the data is already log transformed or not, you can easily figure this out by visualizing the data (i.e. boxplot). For microarray data, log transformed data values are usually less than 16. For count data with 1 million count, log2(1,000,000) is less than 20. Therefore if all data values are all below 20, it is reasonable to assume that the data has already been log transformed.

  6. My data contains multiple metadata, how should I choose a proper method for differential analysis?

    The answer depends on your biological questions. Here are several suggestions:

    • Do a simple analysis first using the primary metadata of interest;
    • If you want to include secondary metadata, you need to decide whether this metadata is of interest by itself, or is included because it potentially affects the results of the primary metadata (e.g. studies where multiple samples are collected from the same subjects including paired samples, tissue types, or any potential batch effect). In the first case, a two-factor analysis is appropriate (i.e. you are interested in two independent metadata and their interactions). In the second case, the second metadata is a blocking factor. NetworkAnalyst will conduct comparisons within the block, which typically improves the accuracy of the result;

  7. I received the error message "no residual degrees of freedom", what should I do?

    This means you do not have enough samples to perform the analysis you specified. This usually happens when you want to combine two metadata for an independent two-factor analysis (i.e. the second metadata is not specified as a blocking factor). In this case, the total number of groups will be the product of the group numbers in each metadata (i.e. if the primary metadata contains 3 groups, and the secondary metadata contains 4 groups, the total groups will be 3 * 4 = 12 for the combined analysis). We recommend a minimum of 3 samples per group, therefore at least 36 samples are required in order to perform the analysis.

    In this case, you should focus on a single primary metadata and leave the seconday metadata as "Not available", and perform differential analysis with regard to individual metadata. You can then choose the other metadata as the primary metadata and perform the analysis again. If there are no or very few significant genes identified, it is most likely that incorporating the metadata into the analysis will not affect the result.

  8. What are the differences between pair-wise comparisons and time-series comparisons?

    The time-series comparison is only a subset of "all pairwise" comparisons. A time-series comparison only compares two groups that are directly neighbouring each other. For instance, take three groups A, B, and C. The "all pairwise" comparison will be A-B, A-C, and B-C; however, the time-series analysis will only compare A-B and B-C.

  9. What is a nested comparison?

    In nested comparisons, the results from two differential expression analyses are compared and combined. For example, assume there are four conditions: A, B, C, D. If you choose the nested comparison as (B-A) versus (D-C), then the final significant genes (full model) are from three different sources:

    1. Genes significant in analysis B-A;
    2. Genes significant in analysis D-C;
    3. Interactions: genes significant in the overall comparison (i.e. genes that respond differently in B-A vs D-C);
    Note, you can choose to return significant genes from the interaction only.

  10. How are the subnetworks generated from my data?

    The networks are generated by first mapping the significant genes/proteins to the underlying PPI database. A search algorithm is then performed to identify first-order neighbours (proteins that directly interact with a given protein) for each of these mapped proteins ("seeds"). The resulting nodes and their interaction partners are returned to build the subnetworks.

    The above approach will typically return one giant subnetwork ("continent") with multiple smaller ones ("islands"). Most subsequent analyses are performed on the continent. Note, networks with less than 3 nodes will be excluded.

  11. Which databases are used for creating the network?

    NetworkAnalyst uses a comprehensive high-quality protein-protein interaction (PPI) database based on InnateDB. The database contains manually curated protein interaction data from published literature as well as experimental data from several PPI databases including IntAct, MINT, DIP, BIND, and BioGRID. The database currently contains 14755 proteins and 145955 interactions for human, and 5657 proteins and 14491 interactions for mouse. For C. elegans and D. melanogaster the PPI data is from the iRefWeb.

    Unless otherwise specified, PPI data added recently for new organisms were downloaded from the STRING database (version 10). The database contains information from numerous sources (including experimental data, computational prediction methods and public text collections), and is probably the only resource for less well-studies organisms.

  12. How many genes can be visualized (network size limit)?

    The visualization is actually limited by the performance of users' computers and screen resolutions. Too many nodes will make the network too dense to visualize and the computer slow to respond. We recommend limiting the total number of nodes to between 200 ~ 2000 for the best experience. For very large networks, please make sure you have a decent computer equipped with a modern browser (we recommend the latest Google Chrome).

  13. How do I construct a network composed only of nodes connecting the seed proteins (minimum interaction network)?

    You need to first create a large network connecting most of the seed proteins/genes. For instance, if the largest subnetwork from the default first-order interaction does not include most of the seed proteins, you can try to expand the network first. Note, some nodes may never connect to the main network due to the incomplete coverage of the PPI database. When you are satisfied with the result, trim the network to its minimum. Note, if there are many seed proteins, the procedure can take a while to compute.

  14. What if the network is very small (< 100)?

    NetworkAnalyst allows you to increase your network during the network construction step. You can either increase the input gene number or expand your search of the PPI database to higher-order interactors (i.e. including both friends and friends of friends).

  15. Can I construct a network for a large number (> 1000) of significant genes?

    When there are a large number of signficant genes or seed proteins, the resulting networks will be too large and complex to be visualized or interpreted. There are two possible solutions here:

    1. To reduce the networks using direct connections between seed proteins (zero-order interactors);
    2. To trim the networks to keep only seeds and their connecting nodes;
    3. To reduce the input genes by using larger fold change and/or smaller p value cutoffs;
    The above approaches aim to reduce the network size and complexity, and to retain the most relevant information for downstream functional analysis.

  16. How does the Trim function work?

    The "Trim" function is designed for cases when the first-order subnetwork is too large or too dense to be visualized effectively. The goal is to extract a minimally connected subgraph containing all the seed genes from this "big and dense" subnetwork. This is a well-known Steiner tree problem and the exact solution is far too slow to use on the public server. NetworkAnalyst implements an approximate approach based on shortest paths: we compute pair-wise shortest paths between all seed nodes, and remove the nodes that are not on the shortest paths. Some optimizations have also been applied to improve its performance when there are large numbers of nodes.

  17. How do I identify important nodes using degree and betweenness?

    Important nodes can be identified based on their position within the network. The assumption is that changes in the key positions of a network will have more impact on the network than changes on marginal or relatively isolated positions. NetworkAnalyst provides two well-established node centrality measures to estimate node importance - degree centrality and betweenness centrality. In a graph network, the degree of a node is the number of connections it has to other nodes. Nodes with higher node degree act as hubs in a network. The betweenness centrality measures the number of shortest paths going through the node. It takes into consideration the global network structure. For example, nodes that occur between two dense clusters will have a high betweenness centrality even if their degree centrality values are not high. Note, you can sort the node table based on either degree or betweenness values by double clicking the corresponding column header.

  18. What is a module and how are modules identified in NetworkAnalyst?

    Modules are tightly clustered subnetworks with more internal connections than expected randomly in the whole network. They are considered as to be relatively independent components in a graph. Members within a module are likely to work collectively to perform a biological function. The biological functions of a module can be revealed by functional enrichment analysis as described below.

    NetworkAnalyst currently uses a random walk based approach known as the Walktrap Algorithm for module detection. The general idea is that if you perform random walks on the graph, then the walks are more likely to stay within the same module because there are only a few edges that lead outside a given module. The Walktrap algorithm runs multiple short random walks and uses the results of these random walks to merge separate modules in a bottom-up manner.

    NetworkAnalyst also integrates the gene expression values as edge weights during module searches. Weights are calculated as the square of the mean absolute log fold changes of the two adjacent nodes. Larger weights mean closer connections during random walks. To avoid zero-weight errors for non-seed proteins during program run, pseudo-expression values are given to non-seed proteins of 1/10 of the minimal absolute log fold changes of the seed proteins. By giving larger weights to seed proteins, the program encourages detecting modules containing more seed proteins (shorter distances).

  19. How do I interpret the p value of a module?

    Let's call the edges within a module "internal" and the edges connecting the nodes of a module with the rest of the graph "external". Then the p value of a given module can be calculated using a Wilcoxon rank-sum test of the "internal" and "external" degrees. The null hypothesis of the test is that there is no difference between the number of "internal" and "external" connections to a given node in the module. More internal than external edges show that the module is significant. Note, the p values are calculated solely based on their connectivity. Users should also consider whether they are 'active' under the experimental conditions, by taking into account of the number of seed proteins, their average fold changes, as well as enriched functions, as displayed in the Module Explorer table.

  20. Can I perform GO or pathway analysis on my significant genes alone?

    Yes, you can test enriched gene ontologies or pathways (KEGG/Reactome) for only your query genes. To do so, first select and highlight query genes using the Highlight Color toolbar on the top left (you may have to highlight twice for upregulated and downregulated genes respectively); or you can use the Hub Explorer and select queries from the node table. After that, select a functional catergory from the Function Explorer section, and click the Submit button.

  21. Can I test the enriched functions of my highlighted selection?

    Yes. Users can perform enrichment tests on currently highlighted nodes in the network.

    • Module highlight: automatic: first perform module detection, then click on a module; manual: Set Scope to "including dependents", double click a node in the network to highlight the node together with its direct neighbours. Repeat the process to select more nodes.
    • Node highlight: manual: select nodes from the node table on the left or by double clicking on a node (Single Mode); automatic: using Hub Highlighting or Data Highlighting to select nodes based on degree or betweenness values.
    After you have selected the nodes or modules, click the Perform Enrichment Analysis button. The result table will be displayed in the panel below. Note, enrichment analyses are performed on ALL currently highlighted nodes. To ensure only your current selections are being used, first Reset the network, then perform highlighting/selections before performing the enrichment analysis.

  22. How is enrichment analysis performed?

    The enrichment analysis is to test whether any functional modules (gene sets) from the user selected library are significantly enriched among the currently highlighted nodes within the network. NetworkAnalyst uses hypergeometric tests to compute the enrichment p values.

  23. How do I interpret the differences in node colors and sizes in the default network?

    In the default network generated by NetworkAnalyst, the size of the nodes are based on their degree values, with a big size for large degree values. The color of nodes are proportional to their betweenness centrality values. When user switches to Expression View, the color will be based on their expression values (if available).

  24. Can I view my queries in the network?

    Yes, to view your query genes or proteins, use the color palette on the top-left corner of the network viewer to set a highlight color. From the "Display Options" on the top right panel, click the "Highlight". Select "Upregulated nodes" or "Downregulated nodes", then click Submit button. You may also want to increase their node sizes by using the Size function under Node Options. Nodes will be labeled automatically when their size increase above a certain level.

  25. How can I create a 300 dpi high-resolution network for publication?

    Please use the Download option and choose "SVG Format" to save the current network view (tested using Chrome or FireFox, known issue with Safari). SVG is a vector based graphic format and you can then export it into any resolution static image (i.e. png) using a suitable graphic tool, for example, Adobe Illustrator or the free tool InkScape. Note, it is best to save SVG in white background, as the default background color in InkScape is in white. If your SVG is saved in Black background, after opening the SVG in InkScape, set the Background color to black (hex code: #222222) using the Document Properties menu.

  26. Can I change the background color of the network?

    Yes. NetworkAnalyst currently supports black (default) and white background. To swtich background color, click the pull-down menu next to Background on the toolbar at the top of the screen. From the dropdown menu list, select a color. More background colors may be supported in future.

  27. Can I change node color, size or shape?

    You can change the color and size of a node. The shape cannot be changed in the current implementation. To change the node color, you need to first choose the color using the Color Palette for the next selection, then select (by clicking on the node) you want to change. The node color will be changed to your specification. To change node size, you can keep clicking it (double-clicking) to increase its size. You can also use the Node Size functions to increase or decrease the node size. Currently, the node shapes are all circles. Other node shapes are not supported.

  28. Can I change the position of a node?

    Yes. You can simply put your mouse cursor over the node. When its label shows up, left click and drag the node to a position. Release the mouse.

  29. Can I change the position of a node cluster?

    Yes. First use the Scope option on the top menu bar to make sure that the option including dependents is selected. Then drag the central node of the node cluster to a new position. Note, only dependant nodes (nodes that are only connected with the central node, but not to any other nodes) will be affected. If you also want to adjust the position of these non-dependant nodes, switch the Scope to "Current node", and then drag these nodes individually to the new position.

  30. How can I label the nodes in the network?

    Nodes will be automatically labeled when their sizes reach a certain threshold. Therefore, you can simply increase node size to label any node. To do so:

    • Label a single node: set to single node mode, and repeatedly click a node to increase its size untill the label appears;
    • Label all highlighted nodes: use the Node tab in the Display Options panel on top right, select "Highlighted nodes" and "Increase ++", then keep clicking Submit button to increase the size untill labels show up.
    • If you would like to highlight all of the nodes in the current network, perform the same steps as the above, except you choose "All nodes" in the network.

  31. Can I delete nodes from the network?

    Yes. You can delete nodes (with their associated edges) from the current network. First you need to select the nodes from the Node Table in the left pane. Then click the Delete button at the top of the node table. A confirmation dialog will appear asking if you really want to delete these nodes. Note, this action will trigger network re-arrangement, especially if hub nodes are removed. In addition, "orphan" nodes may be produced due to removal. These nodes will also be excluded during re-arrangement.

  32. How do I manually highlight any arbitrary sections of the network?

    There are two basic steps in the network highlighting - setting the highlight color and making selections. Use the Color Palette to set the color for the Next selection. You also need to choose among two different Scopes:

    • Current node: for highlighting the node being clicked only;
    • Including-dependents: for highlighting the node and its direct neighbours;
    Now, double click on nodes to make your selections. Note, you can repeat the steps above to change colors and scope to make different effects.

  33. Can I extract a module or a highlighted section from the network?

    Yes. To do this, first select or highlight section of the network, then click the Extract icon on the left tool bar in the network view window. Note, the operation is expensive, and you have to wait for ~20 seconds for the extracted network to return. The returned network will be named as "moduleX" and is available in the "Network Explorer" panel on the top-left of the page for future reference.

  34. For the p-values associated with each enriched theme, are these FDR corrected P-values?

    No. These are raw p values from hypergeometric tests. They have not been adjusted for multiple testings. As these pathways or biological processes are non-independent and correlated to each other, the false discovery rate (FDR) approach is not suitable. Using a permutation based approach is too time consuming for a public server. Therefore, the p values should be used as a rough guide to help understand interesting patterns and modules or for hypothesis generation.

  35. Can I view all the gene members of a pathway or a GO category within the current graph?

    Yes, after you have performed functional enrichment analysis, the over-represented themes will be displayed in the table below. By double clicking on a pathway name, all gene members of the pathway will be displayed on the focus view (heatmap analysis), or as highlighted nodes within the current network (network analysis), or as highlighted chords (chord diagrams analysis).

Processing ....
Your session is about to expire!

You will be logged off in seconds.

Do you want to continue your session?