Introduction
I used to spend a lot of time on Twitter, especially around the machine learning communities. I read a lot, and I was struck by just how often certain institutions kept coming up — Google, OpenAI, Facebook, etc. It seemed like these institutions had a disproportionately high impact on the communities they were a part of, and so I wanted to study what that actually looked like in the grand scheme of things.
I think that this is pretty important primarily because of how impactful ML is in the current political climate; a lot of public policy and attention, as well as funding, is dictated by public opinion. If certain corporations are unduly affecting the public discourse, then they could similarly be having a disproportionately high effect on matters of policy and funding. I think we have a responsibility to check the power of corporations, especially on matters pertaining to big data and ML.
Going in, I had a few primary hypotheses:
- Google would dominate the public narrative;
- Few corporate/institutionally unaffiliated members of the community would have a loud impact or voice;
- Even beyond Google, the primary discourse around ML Twitter would be fueled by corporations and corporate interests.
This is especially relevant to our Race After Technology readings, and I don’t think much else needs to be said aside from this compelling quote in an early section of the readings:
This is an industry with access to data and capital that exceeds that of sovereign nations, throwing even that sovereignty into question when such technologies draw upon the science of persuasion to track, addict, and manipulate the public. We are talking about a redefinition of human identity, autonomy, core constitutional rights, and democratic principles more broadly.
Network Analysis Introduction
Network Analysis is pretty much just the study of these mathematical structures called “graphs.” You can think of graphs like normal land maps. In the case of a graph, each city would be something called a “node,” and each road would be something called an “edge.” Just like a map displays all the cities and roads between them, a graph is effectively just studying the connections between nodes. Studying graphs through network analysis is a very powerful way for us to visualize complex representations of things, and is often used in a bunch of different problem spaces for a variety of different purposes.In this case, I’m building our graph by creating “roads” connecting a tweet author to an @mention. In doing so, my hope is that we’ll find some rather interesting information regarding the connections of these tweets. Who is talking to who? And just how popular are they? Will I ever be a niche-internet-micro-celebrity onWriting the Code
In the interests of space, I’m mostly just going to be giving a high-level overview of what my code is doing and then detailing the specific annoyances I had with the project as a whole. With that said, let’s get started!
The Libraries
### Install Libraries ###
!pip install twarc --upgrade
!pip install twarc-csv --upgrade
!pip install pyvis
!pip install networkx
While I used networkx
at first, I wound up instead building the graphs using pyvis
. They are both python libraries that allow you to work with graphs, and I had more experience with networkx
so I wanted to use it more. Unfortunately, the data visualization aspects of things was a bit poor (I’m sure there were better ways to do it than what I wound up doing), so I pivoted to using pyvis
instead.
Both twarc
and twarc-csv
are libraries helpful for handling Twitter X API calls and working with the data returned from them.
The Initial Search
!twarc2 search "(from:_jasonwei OR from:quocleix OR from:ylecun OR from:HugoTouvron OR from:AndrewYNg) -is:retweet has:mentions -is:nullcast" --start-time "2023-07-01" --end-time "2023-08-01" --limit 500 --archive > original_userbank.jsonl
My plan for this project was to start with five users who are big in the machine learning Twitter field, who also didn’t have much overlap. I would then then crawl back through about a month of tweets, see who they were talking to, and use it to generate a larger list of users that I could further analyze.
!twarc2 csv original_userbank.jsonl original_userbank.csv
We use twarc-csv
to parse our data into something a bit more readable.
import csv
import re
# we're using a set because we don't care about repeats. sets are mathematical
# constructions similar to lists but are unordered and duplicate entries don't
# matter and won't be counted.
userset = set({})
# This part just parses out all of the characters we don't want.
regex = re.compile('[^a-zA-Z,]')
with open("original_userbank.csv", "r") as f:
reader = csv.reader(f, delimiter=",")
for row in reader:
parsed = regex.sub("", row[34]) # row[34] is our @mentions
line = (parsed.split(",")) # splits our @mentions
for name in line:
userset.add(name)
# because we're prowling through the entire original_userbank.csv, our header
# is actually entities_mentions (parsed down to entitiesmentions), we want to
# get rid of it because our little script up there can't tell whether or not
# it's a real username
userset.remove("entitiesmentions")
names = list(userset)
with open('queries.txt', 'w') as f:
for name in names:
f.write(f'from:{name} has:mentions -is:retweet -is:nullcast')
f.write("\n")
Specific comments on parts of the code above are in the actual code block itself, but a quick overview is that we created a set of users, parsed out the data from @mentions, and then put every name we found into that set. Then we did a bit more parsing and created a query for each name and dropped it into a .txt
file so that we could…
!twarc2 searches --start-time "2023-07-01" --end-time "2023-08-01" --archive --limit 5 queries.txt > total_tweets_attempt2.jsonl
COMBINE THEM ALL! MUAHAHAHAHAHA.
–coughs–
Anyways, we’re now doing the same thing we did before, but now with every single user we put into our userlist. That way we have… big graph. Then we gotta clean it up, and we’re done with the data collection part!
!twarc2 csv total_tweets_attempt2.csv parsed_tweets_2.csv
The Graph
import csv
import re
from pyvis.network import Network
# import networkx as nx
import random
# G = nx.Graph()
G = Network(height="750px", width="100%", bgcolor="#222222", font_color="white")
# This is our dictionary! If we wanna add hard-coded brand colors, we'd drop it in here. We can also put in identical brands, like metaai/meta/facebookai and stuff.
color_dict = {
"meta":"#0668E1",
"facebookai":"#0668E1",
"metaai":"#0668E1",
"openai":"#00A67E",
"google": "#4285F4",
"googleai": "#4285F4",
"googledeepmind": "#4285F4",
"nyugamelab": "#57068C",
"nyudatascience": "#57068C",
"nyuling": "#57068C",
"uwaterloo": "#FFEA3D",
"goldmansachs": "#6B96C3",
"bloomberg": "#000000",
"stanfordcrfm": "#B1040E",
"stanfordnlp": "#B1040E",
"stanfordailab": "#B1040E",
"stanford": "#B1040E",
"cornellcis": "#B31B1B",
"princetoncitp": "#FF8F00",
"unaffiliated": "#D3D3D3"
}
org_dict = {}
# This part just sets up regex (regular expressions) to parse out all of the characters we don't want.
regex = re.compile('[^a-zA-Z0-9,_-]')
regex2 = re.compile('[^a-zA-Z0-9 @_]')
def check_color(org): # generates some nice pretty colors for our graph
# checks to see if the color already exists in our dict
if org in color_dict:
return "".join(color_dict.get(org))
# if not, generate a new color and assign to our org, send it to the dict
else:
color = ["#"+''.join([random.choice('ABCDEF0123456789') for i in range(6)])]
color_dict[org] = color
return "".join(color)
# checks to see if our org is in the dict, if not add it and set counter
with open("parsed_tweets_2.csv", "r") as f:
reader = csv.reader(f, delimiter=",")
for row in reader:
bio = row[49].replace(r"\n", " ")
parsed_bio = regex2.sub("", bio).replace("\n", " ").lower()
orgs = parsed_bio.split(" ")
for org in orgs:
if org == "": continue
if org[0] != "@": continue
org = org.replace(org[0], "", 1)
if org == "": continue
try:
counter = org_dict.get(org)
org_dict[org] = counter + 1
except:
org_dict[org] = 1
with open("parsed_tweets_2.csv", "r") as f:
reader = csv.reader(f, delimiter=",")
# first, we want to set up our nodes
for row in reader:
bio = row[49].replace(r"\n", " ")
parsed_name = regex.sub("", row[47])
parsed_bio = regex2.sub("", bio).lower()
orgs = []
# this is just parsing and making sure our orgs are actually @mentions
words = parsed_bio.split(" ")
for word in words:
if word == "": continue
if word[0] != "@": continue
word = word.replace(word[0], "", 1)
if word == "": continue
orgs.append([word, int(org_dict.get(word))])
# sort the @mentions in the bio by how many times it has shown up
try:
prim_org = sorted(orgs, key = lambda x: x[1])[0][0]
except:
# if an org is not found, assign them the unafiliated tag
prim_org = "unaffiliated"
# find the primary org by seeing which org in the bio was the most popular
G.add_node(parsed_name, title=prim_org, color=check_color(prim_org))
with open("parsed_tweets_2.csv", "r") as f:
reader = csv.reader(f, delimiter=",")
# next we set up our mentioned nodes
for row in reader:
parsed_mentions = regex.sub("", row[34]) # row[34] is our @mentions
mentions = (parsed_mentions.split(",")) # splits our @mentions #47
for mention in mentions:
if mention == '': continue
G.add_node(mention)
with open("parsed_tweets_2.csv", "r") as f:
reader = csv.reader(f, delimiter=",")
# next we want to create our edge connections
for row in reader:
parsed_mentions = regex.sub("", row[34]) # row[34] is our @mentions
parsed_name = regex.sub("", row[47]) # row[47] is our username
mentions = parsed_mentions.split(",") # splits our @mentions #47
# we don't care about the headers
if (parsed_name == "authorusername") or (parsed_mentions == "entitiesmentions") or (parsed_bio == "authordescription"):
continue
for mention in mentions:
# print(f"({parsed_name}, {mention})")
try:
G.add_edge(parsed_name, mention)
except:
print(f"failed to add edges {parsed_name} and {mention}")
# print(org_dict)
Coding Frustrations
Whoo boy. This one was annoying. So, the first issue I ran into was how I wanted to display the graph. I chose matplot
since it was easy and they had networkx integration, but like — it just looked super ugly.
# Old code, creates a really ugly graph visualization
# Note: will have to turn back into networkx graph for this to work
import matplotlib.pyplot as plt
pos = nx.spring_layout(G) #specify layout for visual
f, ax = plt.subplots(figsize=(10, 10))
plt.style.use('ggplot')
nodes = nx.draw_networkx_nodes(G, pos, alpha=0.8)
nodes.set_edgecolor('k')
nx.draw_networkx_labels(G, pos, font_size=8)
nx.draw_networkx_edges(G, pos, width=1.0, alpha=0.2)
So I played around with a few more libraries, eventually deciding on pyvis
since it looked a whole lot better. Except, now there were issues: networkx
graphs and pyvis.network
weren’t 1-1, which meant that I had to re-engineer how I wanted the pyvis
network to look. I eventually figured it out, and yeah— it looked way nicer this time around:
But while this was pretty, I didn’t feel that it visually represented what I wanted to show: the organizational affiliations of the biggest twitter users. Who is the loudest? What are they a part of? Do they talk amongst themselves? Etc.
And so, I decided to make it… colorful. And this began the most frustrating part of my network analysis journey. First, I needed to somehow detect institutional affiliation. Which… hm. Was kinda an issue? Like, how am I supposed to figure out if Joe Shmoe works for Google, Apple, Facebook, or the Boston Public Library? I decided to look for @s in the bio, which wound up causing a whole host of issues down the line (I get into that in a bit), then map the @s to the number of times they were found throughout the entire dataset. That way, if Joe Shmoe’s bio has @s for his girlfriend, dog, or Barack Obama, it will instead go by the organization that is most @’d by sorting through counts.
After that, we map the primary institution to a hex color (I tried to go by brand color for a few of the orgs but they kinda all have the same vibes, tbh), and if one doesn’t exist we generate a random color. This was really frustrating and required an absurd amount of parsing to get done, but I learned how regex works which was a pretty neat addition to my toolbox.
So, what was the issue? Well, I’m glad you asked. I COULDN’T REPLACE NEW LINES. This was such an annoying problem because no matter what I did, my institutions would have these absurdly strange additions afterward and just — yeah. It wasn’t nice. Turns out it was because python was looking for linebreaks when I said it to replace \n, instead of literally just replacing the characters “\”. I spent 2 hours trying to fix this. Anyways, there were also other annoying things that occurred but I’m trying to put it out of my head for now.
Findings and VISUALIZATION!!
Smol Graph: https://twitter-graph.netlify.app/ (note: it is REALLY slow to load)
This is only with 2,000 tweets. I decided to hold out on a full-sized version until after everyone had already finished their projects so as to properly cannibalize every single last tweet we have left.
Key: white nodes are unaffiliated users with no institutions, and colored nodes are users with institutional affiliations. I tried to keep colors on brand, but there were MANY organizations represented here.
With that said, there were still some rather interesting preliminary findings:
- The vast majority of ‘power users’ are actually unaffiliated – perhaps they simply don’t publicize their affiliation or keep it in their Twitter bio. Perhaps they actually are one-hundred-percent unaffiliated with any corporation or organization. Either way, the intermingling of unaffiliated/affiliated users was something I completely hadn’t expected. You have the obvious
- Those with research institution affiliation were the power users. This was pretty unexpected, but those who had the most engagement were, in fact not those who were affiliated with corporations. This is perhaps due to the low sample size; more data would be required to get a more accurate read and analysis of this network.
- Goldman Sachs had an absurdly disproportional effect on the conversation, finding itself at the intersection of both academia and industry.
Further Questions
- Why does Goldman Sachs have such a large impact? Is this just due to the small sample size? Or is there something more going on?
- Research institutions are actually having a huge impact on the public narrative, with schools like NYU having a surprising impact (along with obvious ones of Stanford, UWaterloo, Princeton, etc.), with more data, which institutions are actually getting the most attention? Compiling these stats with likes/followers/retweets/etc. would allow for another interesting analysis.
- Are unaffiliated users really as unaffiliated as they appear to be? Do they really have such a tangible impact on public discourse? With more data, this question would be a lot easier to ask and answer.
- How does the conversation flow? While academia does tend to connect to academia and industry tends to connect with industry, there is also overlap. Which institutions are more prone to overlap?
- Through interviews, a more holistic assessment of this data could occur. Directly asking users with institutional affiliation how much they overlap, what they notice about engaging in public discourse, and just who is engaging in this public discourse? Do citations overlap with twitter ‘clout’?
Conclusion
Social networks are complex. This much I thought I knew going into this, but the sheer scale of complexity was remarkable. My initial hypothesis that major corporations completely steered the public narrative turned out to be wrong, or if not wrong rather less simplistic than an initial thinking through would lead one to.
Once more, we are struck with the unfortunate reality of “we need way more data before we can realistically implement the things we want to implement.” With that said, there were still some rather interesting preliminary findings that would allow for much more directed future research regarding understanding how industry and academia are interacting together.
Unaffiliated users being as popular as they were was a surprisingly welcome piece of information, and Goldman Sachs involvement in Machine Learning Twitter was equally surprising. Hopefully Professor Kulkarni will let me consume the leftover API calls, because I really, really want to see where this data leads.