We use the module NetworkX in this tutorial. It is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.
If you work with Anaconda, you can install the package as follows:
conda install -c anaconda networkx
Import modules:
import networkx as nximport matplotlib.pyplot as plt%matplotlib inlineimport warnings; warnings.simplefilter('ignore')
Social Network Basics
Each network consists of:
Nodes: The individuals whose network we are building.
Edges: The connection between the nodes. It represents a relationship between the nodes of the network.
Symmetric Networks (undirected)
The first network that we create is a group of people who work together. This is called a symmetric network because the relationship “working together” is a symmetric relationship: If A is related to B, B is also related to A.
What if the relationship between nodes is ‘child of’, then the relationship is no longer symmetric. This is the case if someone follows someone else on Twitter. Or in the case of hyperlinks.
If A is the child of B, then B is not a child of A. Such a network where the relationship is asymmetric (A is related to B, does not necessarily means that B is associated with A) is called an Asymmetric network.
We can build the asymmetric network in NetworkX using DiGraph method, which is short of Directional Graph.
Till now we had networks without weights, but it is possible that networks are made with weights, for example, if in our initial network we consider the number of projects done together as a weight, we will get a weighted Network.
Let us make one again of the employees, but this time we add weight to the network, each edge has a weight signifying the number of projects they have done together.
elarge = [(u, v) for (u, v, d) in G_weighted.edges(data=True) if d['weight'] >8]esmall = [(u, v) for (u, v, d) in G_weighted.edges(data=True) if d['weight'] <=8]pos = nx.circular_layout(G_weighted) # positions for all nodes# nodesnx.draw_networkx_nodes(G_weighted, pos, node_size=700)# edgesnx.draw_networkx_edges(G_weighted, pos, edgelist=elarge,width=6)nx.draw_networkx_edges(G_weighted, pos, edgelist=esmall,width=6, alpha=0.5, edge_color='b', style='dashed')# labelsnx.draw_networkx_labels(G_weighted, pos, font_size=20, font_family='sans-serif')plt.axis('off')plt.show();
Clustering coefficient
It is observed that people who share connections in a social network tend to form associations. In other words, there is a tendency in a social network to form clusters.
We can determine the clusters of a node, local clustering coefficient, which is the fraction of pairs of the node’s friends (that is connections) that are connected with each other.
To determine the local clustering coefficient, we make use of nx.clustering(Graph, Node) function.
In the symmetric employee-network, you will find that Michelle has a local clustering coefficient of 0.67 and Laura has a local clustering coefficient of 1.
The average clustering coefficient (sum of all the local clustering coefficients divided by the number of nodes) for the symmetric employee-network is 0.867.
nx.clustering(G_symmetric,'Michelle')
0.6666666666666666
nx.clustering(G_symmetric,'Laura')
1.0
nx.average_clustering(G_symmetric)
0.8277777777777778
Network Distance Measures
Degree
Degree of a node defines the number of connections a node has. NetworkX has the function degree which we can use to determine the degree of a node in the network.
nx.degree(G_symmetric, 'Michelle')
3
This will return a value of 3, as Michelle has worked with three employees in the network.
Distance
We can also determine the shortest path between two nodes and its length in NetworkX using nx.shortest_path(Graph, Node1, Node2) and nx.shortest_path_length(Graph, Node1, Node2) functions respectively.
We can find the distance of a node from every other node in the network using breadth-first search algorithm, starting from that node. networkX provides the function bfs_tree to do it.
And so if you use M = nx.bfs_tree(G_symmetric, 'Michelle') and now draw this tree, we will get a network structure telling how we can reach other nodes of the network starting from Michelle .
S = nx.bfs_tree(G_symmetric, 'Steven')nx.draw_networkx(S)
M = nx.bfs_tree(G_symmetric, 'Michelle')nx.draw_networkx(M)
Eccentricity
Eccentricity of a node A is defined as the largest distance between A and all other nodes.
It can be found using nx.eccentricity() function. In the symmetric employee-network, Michelle has an eccentricity of 2, and Steven has an eccentricity of 1 (he is connected to every other node).
nx.eccentricity(G_symmetric,'Michelle')
2
nx.eccentricity(G_symmetric,'Steven')
1
Centrality measures
Above we learned some of the network distance measures and they are useful in knowing how the information will spread through the network.
In this section, we will learn how to find the most important nodes (individuals) in the network. These parameters are called as centrality measures. Centrality Measures can help us in identifying popularity, most liked, and biggest influencers within the network.
Degree Centrality
The people most popular or more liked usually are the ones who have more friends.
Degree centrality is a measure of the number of connections a particular node has in the network. It is based on the fact that important nodes have many connections. NetworkX has the function degree_centrality() to calculate the degree centrality of all the nodes of a network.
The Betweenness Centrality is the centrality of control.
It represents the frequency at which a point occurs on the shortest paths that connected pair of points. It quantifies how many times a particular node comes in the shortest chosen path between two other nodes.
The nodes with high betweenness centrality play a significant role in the communication/information flow within the network.
The nodes with high betweenness centrality can have a strategic control and influence on others. An individual at such a strategic position can influence the whole group, by either withholding or coloring the information in transmission.
Networkx has the function betweenness_centrality() to measure it for the network. It has options to select if we want betweenness values to be normalized or not, weights to be included in centrality calculation or not, and to include the endpoints in the shortest path counts or not.
pos = nx.spring_layout(G_symmetric)betCent = nx.betweenness_centrality(G_symmetric, normalized=True, endpoints=True)node_color = [20000.0* G_symmetric.degree(v) for v in G_symmetric]node_size = [v *10000for v in betCent.values()]plt.figure(figsize=(10,10))nx.draw_networkx(G_symmetric, pos=pos, with_labels=True, node_color=node_color, node_size=node_size )plt.axis('off');
This dataset consists of ‘circles’ (or ‘friends lists’) from Facebook. Facebook data was collected from survey participants using this Facebook app. The dataset includes node features (profiles), circles, and ego networks.
Facebook data has been anonymized by replacing the Facebook-internal ids for each user with a new value. Also, while feature vectors from this dataset have been provided, the interpretation of those features has been obscured. For instance, where the original dataset may have contained a feature “political=Democratic Party”, the new data would simply contain “political=anonymized feature 1”. Thus, using the anonymized data it is possible to determine whether two users have the same political affiliations, but not what their individual political affiliations represent.
Let us start with the Facebook data, for our analysis here we will use Facebook combined ego networks dataset, it contains the aggregated network of ten individuals’ Facebook friends list. You can download the required facebook_combined.txt file from the Stanford University site.
We can also visualize the network such that the node color varies with Degree and node size with Betweenness Centrality. The code to do this is:
pos = nx.spring_layout(G_fb)betCent = nx.betweenness_centrality(G_fb, normalized=True, endpoints=True)node_color = [20000.0* G_fb.degree(v) for v in G_fb]node_size = [v *10000for v in betCent.values()]plt.figure(figsize=(20,20))nx.draw_networkx(G_fb, pos=pos, with_labels=False, node_color=node_color, node_size=node_size )plt.axis('off');
You can also know the labels of the nodes with the highest betweenness centrality using:
We can see that some nodes are common between Degree Centrality, which is a measure of degree, and Betweenness Centrality which controls the information flow.
It is natural that nodes that are more connected also lie on shortest paths between other nodes. The node 1912 is an important node as it is crucial according to all three centrality measures that we had considered.
Social Network Analysis with NetworkX in Python
We use the module NetworkX in this tutorial. It is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.
If you work with Anaconda, you can install the package as follows:
Import modules:
Social Network Basics
Each network consists of:
Symmetric Networks (undirected)
The first network that we create is a group of people who work together. This is called a symmetric network because the relationship “working together” is a symmetric relationship: If A is related to B, B is also related to A.
Now we visualize the network with the
draw_networkx()
function.Asymmetric Networks (directed)
What if the relationship between nodes is ‘child of’, then the relationship is no longer symmetric. This is the case if someone follows someone else on Twitter. Or in the case of hyperlinks.
If A is the child of B, then B is not a child of A. Such a network where the relationship is asymmetric (A is related to B, does not necessarily means that B is associated with A) is called an Asymmetric network.
We can build the asymmetric network in NetworkX using
DiGraph
method, which is short of Directional Graph.To make sure that all nodes are distinctly visible in the network, use the
spring_layout()
function, followed by thedraw_networkx()
function.Weighted Networks
Till now we had networks without weights, but it is possible that networks are made with weights, for example, if in our initial network we consider the number of projects done together as a weight, we will get a weighted Network.
Let us make one again of the employees, but this time we add weight to the network, each edge has a weight signifying the number of projects they have done together.
Clustering coefficient
It is observed that people who share connections in a social network tend to form associations. In other words, there is a tendency in a social network to form clusters.
We can determine the clusters of a node, local clustering coefficient, which is the fraction of pairs of the node’s friends (that is connections) that are connected with each other.
To determine the local clustering coefficient, we make use of
nx.clustering(Graph, Node)
function.In the symmetric employee-network, you will find that Michelle has a local clustering coefficient of 0.67 and Laura has a local clustering coefficient of 1.
The average clustering coefficient (sum of all the local clustering coefficients divided by the number of nodes) for the symmetric employee-network is 0.867.
Network Distance Measures
Degree
Degree of a node defines the number of connections a node has. NetworkX has the function
degree
which we can use to determine the degree of a node in the network.This will return a value of 3, as Michelle has worked with three employees in the network.
Distance
We can also determine the shortest path between two nodes and its length in NetworkX using
nx.shortest_path(Graph, Node1, Node2)
andnx.shortest_path_length(Graph, Node1, Node2)
functions respectively.Breadth-first search
We can find the distance of a node from every other node in the network using breadth-first search algorithm, starting from that node. networkX provides the function bfs_tree to do it.
And so if you use
M = nx.bfs_tree(G_symmetric, 'Michelle')
and now draw this tree, we will get a network structure telling how we can reach other nodes of the network starting from Michelle .Eccentricity
Eccentricity of a node A is defined as the largest distance between A and all other nodes.
It can be found using
nx.eccentricity()
function. In the symmetric employee-network, Michelle has an eccentricity of 2, and Steven has an eccentricity of 1 (he is connected to every other node).Centrality measures
Above we learned some of the network distance measures and they are useful in knowing how the information will spread through the network.
In this section, we will learn how to find the most important nodes (individuals) in the network. These parameters are called as centrality measures. Centrality Measures can help us in identifying popularity, most liked, and biggest influencers within the network.
Degree Centrality
The people most popular or more liked usually are the ones who have more friends.
Degree centrality is a measure of the number of connections a particular node has in the network. It is based on the fact that important nodes have many connections. NetworkX has the function
degree_centrality()
to calculate the degree centrality of all the nodes of a network.Eigenvector Centrality
It is not just how many individuals one is connected too, but the type of people one is connected with that can decide the importance of a node.
Eigenvector centrality is a measure of how import a node is by accounting for the fact of how well it is connected to other important nodes.
We can use the
eigenvector_centrality()
function of NetworkX to calculate eigenvector centrality of all the nodes in a network.The Google’s Pagerank algorithm is a variant of Eigenvector centrality algorithm.
Closeness Centrality
Closeness Centrality is a measure where each node’s importance is determined by closeness to all other nodes.
Betweenness Centrality
The Betweenness Centrality is the centrality of control.
It represents the frequency at which a point occurs on the shortest paths that connected pair of points. It quantifies how many times a particular node comes in the shortest chosen path between two other nodes.
The nodes with high betweenness centrality play a significant role in the communication/information flow within the network.
The nodes with high betweenness centrality can have a strategic control and influence on others. An individual at such a strategic position can influence the whole group, by either withholding or coloring the information in transmission.
Networkx has the function
betweenness_centrality()
to measure it for the network. It has options to select if we want betweenness values to be normalized or not, weights to be included in centrality calculation or not, and to include the endpoints in the shortest path counts or not.Facebook Case Study
This dataset consists of ‘circles’ (or ‘friends lists’) from Facebook. Facebook data was collected from survey participants using this Facebook app. The dataset includes node features (profiles), circles, and ego networks.
Facebook data has been anonymized by replacing the Facebook-internal ids for each user with a new value. Also, while feature vectors from this dataset have been provided, the interpretation of those features has been obscured. For instance, where the original dataset may have contained a feature “political=Democratic Party”, the new data would simply contain “political=anonymized feature 1”. Thus, using the anonymized data it is possible to determine whether two users have the same political affiliations, but not what their individual political affiliations represent.
Source: J. McAuley and J. Leskovec. Learning to Discover Social Circles in Ego Networks. NIPS, 2012
Let us start with the Facebook data, for our analysis here we will use Facebook combined ego networks dataset, it contains the aggregated network of ten individuals’ Facebook friends list. You can download the required facebook_combined.txt file from the Stanford University site.
We read in the file and construct the Graph:
Download the file
The network consists of 4,039 nodes, connected via 88,234 edges.
We can also visualize the network such that the node color varies with Degree and node size with Betweenness Centrality. The code to do this is:
You can also know the labels of the nodes with the highest betweenness centrality using:
We can see that some nodes are common between Degree Centrality, which is a measure of degree, and Betweenness Centrality which controls the information flow.
It is natural that nodes that are more connected also lie on shortest paths between other nodes. The node 1912 is an important node as it is crucial according to all three centrality measures that we had considered.
Sources of examples: