Introduction to Network Analysis

or: Pretty Graphs Aren't Worth 16 Hours Of Debugging

Aug 17, 2023  │  m. Aug 27, 2023 by Zachary Plotkin  │  #graph-theory   #network-analysis  

twitter-x

Introduction

I used to spend a lot of time on Twitter, especially around the machine learning communities. I read a lot, and I was struck by just how often certain institutions kept coming up — Google, OpenAI, Facebook, etc. It seemed like these institutions had a disproportionately high impact on the communities they were a part of, and so I wanted to study what that actually looked like in the grand scheme of things.

I think that this is pretty important primarily because of how impactful ML is in the current political climate; a lot of public policy and attention, as well as funding, is dictated by public opinion. If certain corporations are unduly affecting the public discourse, then they could similarly be having a disproportionately high effect on matters of policy and funding. I think we have a responsibility to check the power of corporations, especially on matters pertaining to big data and ML.

Going in, I had a few primary hypotheses:

  1. Google would dominate the public narrative;
  2. Few corporate/institutionally unaffiliated members of the community would have a loud impact or voice;
  3. Even beyond Google, the primary discourse around ML Twitter would be fueled by corporations and corporate interests.

This is especially relevant to our Race After Technology readings, and I don’t think much else needs to be said aside from this compelling quote in an early section of the readings:

This is an industry with access to data and capital that exceeds that of sovereign nations, throwing even that sovereignty into question when such technologies draw upon the science of persuasion to track, addict, and manipulate the public. We are talking about a redefinition of human identity, autonomy, core constitutional rights, and democratic principles more broadly.

Network Analysis Introduction

graph-theory

Network Analysis is pretty much just the study of these mathematical structures called “graphs.” You can think of graphs like normal land maps. In the case of a graph, each city would be something called a “node,” and each road would be something called an “edge.” Just like a map displays all the cities and roads between them, a graph is effectively just studying the connections between nodes. Studying graphs through network analysis is a very powerful way for us to visualize complex representations of things, and is often used in a bunch of different problem spaces for a variety of different purposes.

graph-theory-2

In this case, I’m building our graph by creating “roads” connecting a tweet author to an @mention. In doing so, my hope is that we’ll find some rather interesting information regarding the connections of these tweets. Who is talking to who? And just how popular are they? Will I ever be a niche-internet-micro-celebrity on Twitter X? All of these questions are ones that network analysis has the power to answer.

Writing the Code

In the interests of space, I’m mostly just going to be giving a high-level overview of what my code is doing and then detailing the specific annoyances I had with the project as a whole. With that said, let’s get started!

regex

The Libraries

### Install Libraries ###
!pip install twarc --upgrade
!pip install twarc-csv --upgrade
!pip install pyvis
!pip install networkx

While I used networkx at first, I wound up instead building the graphs using pyvis. They are both python libraries that allow you to work with graphs, and I had more experience with networkx so I wanted to use it more. Unfortunately, the data visualization aspects of things was a bit poor (I’m sure there were better ways to do it than what I wound up doing), so I pivoted to using pyvis instead.

Both twarc and twarc-csv are libraries helpful for handling Twitter X API calls and working with the data returned from them.

!twarc2 search "(from:_jasonwei OR from:quocleix OR from:ylecun OR from:HugoTouvron OR from:AndrewYNg) -is:retweet has:mentions -is:nullcast" --start-time "2023-07-01" --end-time "2023-08-01" --limit 500 --archive > original_userbank.jsonl

My plan for this project was to start with five users who are big in the machine learning Twitter field, who also didn’t have much overlap. I would then then crawl back through about a month of tweets, see who they were talking to, and use it to generate a larger list of users that I could further analyze.

!twarc2 csv original_userbank.jsonl original_userbank.csv

We use twarc-csv to parse our data into something a bit more readable.

import csv
import re

# we're using a set because we don't care about repeats. sets are mathematical
# constructions similar to lists but are unordered and duplicate entries don't
# matter and won't be counted.

userset = set({})


# This part just parses out all of the characters we don't want.
regex = re.compile('[^a-zA-Z,]')

with open("original_userbank.csv", "r") as f:
    reader = csv.reader(f, delimiter=",")
    for row in reader:
      parsed = regex.sub("", row[34])       # row[34] is our @mentions
      line = (parsed.split(","))            # splits our @mentions
      for name in line:                       
        userset.add(name)

# because we're prowling through the entire original_userbank.csv, our header
# is actually entities_mentions (parsed down to entitiesmentions), we want to
# get rid of it because our little script up there can't tell whether or not
# it's a real username
userset.remove("entitiesmentions") 

names = list(userset)

with open('queries.txt', 'w') as f:
    for name in names:
      f.write(f'from:{name} has:mentions -is:retweet -is:nullcast')
      f.write("\n")

Specific comments on parts of the code above are in the actual code block itself, but a quick overview is that we created a set of users, parsed out the data from @mentions, and then put every name we found into that set. Then we did a bit more parsing and created a query for each name and dropped it into a .txt file so that we could…

!twarc2 searches --start-time "2023-07-01" --end-time "2023-08-01" --archive --limit 5 queries.txt > total_tweets_attempt2.jsonl

COMBINE THEM ALL! MUAHAHAHAHAHA.

coughs

Anyways, we’re now doing the same thing we did before, but now with every single user we put into our userlist. That way we have… big graph. Then we gotta clean it up, and we’re done with the data collection part!

!twarc2 csv total_tweets_attempt2.csv parsed_tweets_2.csv

The Graph

import csv
import re
from pyvis.network import Network
# import networkx as nx
import random

# G = nx.Graph()
G = Network(height="750px", width="100%", bgcolor="#222222", font_color="white")

# This is our dictionary! If we wanna add hard-coded brand colors, we'd drop it in here. We can also put in identical brands, like metaai/meta/facebookai and stuff.
color_dict = {
    "meta":"#0668E1",
    "facebookai":"#0668E1",
    "metaai":"#0668E1",
    "openai":"#00A67E",
    "google": "#4285F4",
    "googleai": "#4285F4",
    "googledeepmind": "#4285F4",
    "nyugamelab": "#57068C",
    "nyudatascience": "#57068C",
    "nyuling": "#57068C",
    "uwaterloo": "#FFEA3D",
    "goldmansachs": "#6B96C3",
    "bloomberg": "#000000",
    "stanfordcrfm": "#B1040E",
    "stanfordnlp": "#B1040E",
    "stanfordailab": "#B1040E",
    "stanford": "#B1040E",
    "cornellcis": "#B31B1B",
    "princetoncitp": "#FF8F00",
    "unaffiliated": "#D3D3D3"
}

org_dict = {}


# This part just sets up regex (regular expressions) to parse out all of the characters we don't want.
regex = re.compile('[^a-zA-Z0-9,_-]')
regex2 = re.compile('[^a-zA-Z0-9 @_]')

def check_color(org): # generates some nice pretty colors for our graph
  # checks to see if the color already exists in our dict
  if org in color_dict: 
    return "".join(color_dict.get(org))
  # if not, generate a new color and assign to our org, send it to the dict
  else:
    color = ["#"+''.join([random.choice('ABCDEF0123456789') for i in range(6)])]
    color_dict[org] = color
    return "".join(color)

# checks to see if our org is in the dict, if not add it and set counter

with open("parsed_tweets_2.csv", "r") as f:
    reader = csv.reader(f, delimiter=",")
    for row in reader:
      bio = row[49].replace(r"\n", " ")
      parsed_bio = regex2.sub("", bio).replace("\n", " ").lower()
      orgs = parsed_bio.split(" ")
      for org in orgs:
        if org == "": continue
        if org[0] != "@": continue
        org = org.replace(org[0], "", 1)
        if org == "": continue
        try:
          counter = org_dict.get(org)
          org_dict[org] = counter + 1
        except:
          org_dict[org] = 1

with open("parsed_tweets_2.csv", "r") as f:
    reader = csv.reader(f, delimiter=",")

    # first, we want to set up our nodes
    for row in reader:
      bio = row[49].replace(r"\n", " ")
      parsed_name = regex.sub("", row[47])
      parsed_bio = regex2.sub("", bio).lower()

      orgs = []
	  
	  # this is just parsing and making sure our orgs are actually @mentions 
      words = parsed_bio.split(" ")                  
      for word in words:
        if word == "": continue
        if word[0] != "@": continue
        word = word.replace(word[0], "", 1)
        if word == "": continue

        orgs.append([word, int(org_dict.get(word))])

	  # sort the @mentions in the bio by how many times it has shown up
      try:
        prim_org = sorted(orgs, key = lambda x: x[1])[0][0]
      except:
      # if an org is not found, assign them the unafiliated tag 
        prim_org = "unaffiliated" 

      # find the primary org by seeing which org in the bio was the most popular

      G.add_node(parsed_name, title=prim_org, color=check_color(prim_org))  


with open("parsed_tweets_2.csv", "r") as f:
    reader = csv.reader(f, delimiter=",")
    # next we set up our mentioned nodes
    for row in reader:
      parsed_mentions = regex.sub("", row[34])       # row[34] is our @mentions
      mentions = (parsed_mentions.split(","))        # splits our @mentions #47 

      for mention in mentions:
        if mention == '': continue
        G.add_node(mention)

with open("parsed_tweets_2.csv", "r") as f:
    reader = csv.reader(f, delimiter=",")
    # next we want to create our edge connections
    for row in reader:
      parsed_mentions = regex.sub("", row[34])       # row[34] is our @mentions
      parsed_name = regex.sub("", row[47])           # row[47] is our username

      mentions = parsed_mentions.split(",")        # splits our @mentions #47                

	  # we don't care about the headers
      if (parsed_name == "authorusername") or (parsed_mentions == "entitiesmentions") or (parsed_bio == "authordescription"):
        continue
      for mention in mentions: 
        # print(f"({parsed_name}, {mention})")
        try:            
          G.add_edge(parsed_name, mention)
        except:
          print(f"failed to add edges {parsed_name} and {mention}")

# print(org_dict)

Coding Frustrations

Whoo boy. This one was annoying. So, the first issue I ran into was how I wanted to display the graph. I chose matplot since it was easy and they had networkx integration, but like — it just looked super ugly.

# Old code, creates a really ugly graph visualization
# Note: will have to turn back into networkx graph for this to work

import matplotlib.pyplot as plt

pos = nx.spring_layout(G) #specify layout for visual

f, ax = plt.subplots(figsize=(10, 10))
plt.style.use('ggplot')
nodes = nx.draw_networkx_nodes(G, pos, alpha=0.8)
nodes.set_edgecolor('k')
nx.draw_networkx_labels(G, pos, font_size=8)
nx.draw_networkx_edges(G, pos, width=1.0, alpha=0.2)

first-graph

So I played around with a few more libraries, eventually deciding on pyvis since it looked a whole lot better. Except, now there were issues: networkx graphs and pyvis.network weren’t 1-1, which meant that I had to re-engineer how I wanted the pyvis network to look. I eventually figured it out, and yeah— it looked way nicer this time around:

second-graph

But while this was pretty, I didn’t feel that it visually represented what I wanted to show: the organizational affiliations of the biggest twitter users. Who is the loudest? What are they a part of? Do they talk amongst themselves? Etc.

And so, I decided to make it… colorful. And this began the most frustrating part of my network analysis journey. First, I needed to somehow detect institutional affiliation. Which… hm. Was kinda an issue? Like, how am I supposed to figure out if Joe Shmoe works for Google, Apple, Facebook, or the Boston Public Library? I decided to look for @s in the bio, which wound up causing a whole host of issues down the line (I get into that in a bit), then map the @s to the number of times they were found throughout the entire dataset. That way, if Joe Shmoe’s bio has @s for his girlfriend, dog, or Barack Obama, it will instead go by the organization that is most @’d by sorting through counts.

coloooors

After that, we map the primary institution to a hex color (I tried to go by brand color for a few of the orgs but they kinda all have the same vibes, tbh), and if one doesn’t exist we generate a random color. This was really frustrating and required an absurd amount of parsing to get done, but I learned how regex works which was a pretty neat addition to my toolbox.

So, what was the issue? Well, I’m glad you asked. I COULDN’T REPLACE NEW LINES. This was such an annoying problem because no matter what I did, my institutions would have these absurdly strange additions afterward and just — yeah. It wasn’t nice. Turns out it was because python was looking for linebreaks when I said it to replace \n, instead of literally just replacing the characters “\”. I spent 2 hours trying to fix this. Anyways, there were also other annoying things that occurred but I’m trying to put it out of my head for now.

bug

Findings and VISUALIZATION!!

Smol Graph: https://twitter-graph.netlify.app/ (note: it is REALLY slow to load)

colored-graph

This is only with 2,000 tweets. I decided to hold out on a full-sized version until after everyone had already finished their projects so as to properly cannibalize every single last tweet we have left.

labeled-graph

Key: white nodes are unaffiliated users with no institutions, and colored nodes are users with institutional affiliations. I tried to keep colors on brand, but there were MANY organizations represented here.

With that said, there were still some rather interesting preliminary findings:

  1. The vast majority of ‘power users’ are actually unaffiliated – perhaps they simply don’t publicize their affiliation or keep it in their Twitter bio. Perhaps they actually are one-hundred-percent unaffiliated with any corporation or organization. Either way, the intermingling of unaffiliated/affiliated users was something I completely hadn’t expected. You have the obvious
  2. Those with research institution affiliation were the power users. This was pretty unexpected, but those who had the most engagement were, in fact not those who were affiliated with corporations. This is perhaps due to the low sample size; more data would be required to get a more accurate read and analysis of this network.
  3. Goldman Sachs had an absurdly disproportional effect on the conversation, finding itself at the intersection of both academia and industry.

Further Questions

  1. Why does Goldman Sachs have such a large impact? Is this just due to the small sample size? Or is there something more going on?
  2. Research institutions are actually having a huge impact on the public narrative, with schools like NYU having a surprising impact (along with obvious ones of Stanford, UWaterloo, Princeton, etc.), with more data, which institutions are actually getting the most attention? Compiling these stats with likes/followers/retweets/etc. would allow for another interesting analysis.
  3. Are unaffiliated users really as unaffiliated as they appear to be? Do they really have such a tangible impact on public discourse? With more data, this question would be a lot easier to ask and answer.
  4. How does the conversation flow? While academia does tend to connect to academia and industry tends to connect with industry, there is also overlap. Which institutions are more prone to overlap?
  5. Through interviews, a more holistic assessment of this data could occur. Directly asking users with institutional affiliation how much they overlap, what they notice about engaging in public discourse, and just who is engaging in this public discourse? Do citations overlap with twitter ‘clout’?

Conclusion

Social networks are complex. This much I thought I knew going into this, but the sheer scale of complexity was remarkable. My initial hypothesis that major corporations completely steered the public narrative turned out to be wrong, or if not wrong rather less simplistic than an initial thinking through would lead one to.

Once more, we are struck with the unfortunate reality of “we need way more data before we can realistically implement the things we want to implement.” With that said, there were still some rather interesting preliminary findings that would allow for much more directed future research regarding understanding how industry and academia are interacting together.

Unaffiliated users being as popular as they were was a surprisingly welcome piece of information, and Goldman Sachs involvement in Machine Learning Twitter was equally surprising. Hopefully Professor Kulkarni will let me consume the leftover API calls, because I really, really want to see where this data leads.



Next: Fixation