Chapter 4 Network Construction

4.1 Network Construction Process

Below are the steps for calculating the “gene score” of each location that quantifies the degree of association between a base position and the disease/trait of interest. A higher score means a stronger association.

For each position \(g\), we perform marginal regression and obtain a p-value, \(p_g\). Every position then is assigned a score \(\sigma(g) = -2log(p_g)\) (Vandin et al., 2012). Two positions share an edge if they occur together, and the edge weight is their co-occurrence number.

# record p values for each position
mutations_pVal = c()
disease_status_vector = c(rep(1,n/2),rep(0,n/2)) # Arbitrarily assign the first half with disease

for (i in 1:length(position)){
  outcome <- "disease_status_vector"
  variable <- position[i]
  f <- as.formula(paste(outcome, 
        paste(variable), 
        sep = " ~ "))
  model <- lm(f, data = sub_data)
  mutations_pVal = c(mutations_pVal,summary(model)$coefficient[,4][2])
}

# compute score for each mutated position
mutations_score <- -2*log(mutations_pVal)

edges <- melt(A) %>% 
  rename(Source = Var1, Target = Var2, Weight = value) %>%
  mutate(Type = "Undirected") %>% 
  select(Source, Target, Type, Weight) %>% 
  filter(Weight != 0)

nodes <- as.data.frame(colnames(sub_data)) %>%
  mutate(Label = colnames(sub_data))
colnames(nodes) <- c("Id","Label")

A.dat <- as.data.frame(A)

4.2 Analysis

After revising our simulation methodology, we simulated using 3000 base positions and 500 samples to obtain the following graphs. 12 of the 15 previously identified SNP locations are chosen to be part of at least one genetic pathway: X520, X521, X526, X1237, X1320, X1533, X1627, X2075, X2106, X2699, X2766, and X2881. These 12 effective SNP locations form 4 causal pathways that lead to the disease:

X2699, X1320, X520, X2881
X2766, X1237, X2075, X521, X526
X526, X2766, X2106, X2075, X1533
X2699, X521, X2881, X1533, X1627

After a round of random mutations, we obtain 1187 mutated positions in total. This gives us an undirected weighted graph with 1187 nodes and 2838 edges, which produces an average degree of4.78. There is a single major component containing roughly 65% of its nodes. Expectedly, all causal pathways identified beforehand exist within this major connected component. Figure 4.1 shows the full graph, while Figure 4.2 shows its single major component. Using a resolution of 1, we obtained 25 modularity classes within the major connected component. The 12 selected positions are divided between two largest modules.

\[\text{Figure} \; 4.1 :\; \text{Full Network of Mutated Gene Base Positions}\]

\[\text{Figure} \; 4.2 :\; \text {Modularity Class Decomposition within the Major Connected Component, with Nodes Sized According to Closeness Centrality}\]

4.3 Centrality Measures Report

Table 4.1 reports the top 12 nodes ranked by their degrees, eigenvector centrality, closeness centrality and betweeness centrality. The 12 effective disease-related positions selected are exactly the top 12 nodes with the greatest degrees and eigenvector centrality. This might be a result of the high degree of inter-connectivity among these positions.

As a comparison, the \(8^{th}\), \(10^{th}\), and \(12^{th}\) nodes with the highest closeness centrality are random variations, while 8 of the top 12 nodes with the greatest betweenness centrality are random variations.

Nodes colored blue below are random variables; nodes colored in red are nodes representing 12 SNP locations, where each is part of at least one combination.

Rank	Degree	Eigenvector Centrality	Closeness Centrality	Betweeness Centrality
1	X2699	X2881	X2881	X2881
2	X2881	X2699	X2699	X2699
3	X521	X521	X1533	X1533
4	X2075	X1533	X520	X807
5	X2766	X2075	X1320	X2083
6	X526	X2766	X521	X2438
7	X1533	X526	X2075	X2251
8	X520	X520	X2766	X1301
9	X1320	X1320	X526	X1806
10	X1237	X1627	X39	X1739
11	X1627	X1237	X1627	X521
12	X2106	X2106	X1889	X571

\[\text{Table 4.1: Top 12 nodes with the highest degree, eigenvector, closeness and betweenness centrality}\]

4.4 Preliminary Conclusions

In this chapter, we found the exact 12 SNP locations that make up variant combinations. Network Science is powerful!

Here comes the question: Why don’t we simply use degree centrality and eigenvector centrality to detect the SNP locations then? Why don’t we stop here?

There are three reasons:

Our goal is to not only find which mutations cause diseases, but also which combinations cause diseases. The centrality measures we have do not allow us to infer the structure of the subnetwork of 12 nodes.
Our simulation uses co-occurrence counts to build edges, while in reality there could be more intrinsic ways to determine edges between SNPs.
We choose the top 12 nodes for centrality measures because we know the underlying truth that there should be 12 SNP locations from our simulation. It is difficult to find a cutoff without knowing the exact number of SNP locations involved, provided that a distinguishable gap in centrality measures (in this case, between 12th and 13th) does not exist.