Sampling Twitter


Update: I had made a mistake in gathering the samples and took this post down temporarily to fetch new data. A typo I made in one of my scripts radically under-counted the number of friends each account had. Mea culpa, and the lesson I learned is that if a number seems counter-intuitive, then it likely is. I apologize for the mistake!



...

Most of the surveys that have attempted to study Twitter usage do so by scraping the public stream of tweets. This provides reasonable data about how people are publicly using Twitter but it suffers from sample bias insofar as only active Twitter accounts are counted while private accounts and accounts that go dormant are overlooked.

After Dion posted that I was the first person he subscribed to on Twitter, a conversation started in which we speculated about the distribution of used vs. unused user IDs on Twitter. Specifically, I became curious as to how Dion, who signed up for Twitter less than three months after I did, could have a user id 3.5 million higher than mine. Was Twitter really signing up users at a rate of more than one million a month this time last year? Doubtful, but the only way to find out was to gather some hard data.

Instead of mining data from the public feed, I wrote a short script to query a sample across of all the possible Twitter ids. I first created a a test Twitter account, which was assigned a new user ID of 18496098. Using this id as an upper bound on the population of all Twitter IDs, I selected samples at random from the range (0, 18496098) (exclusive), and queried the Twitter API at a metered rate from several machines over the course of a day.

After rerunning the script on queries that returned transient server-side or client-side errors ("502 Bad Gateway", "503 Service Temporarily Unavailable", etc), I arrived at an clean, unbiased sample pool of 4414 ids.

Of the 4414 sampled ids, 1270 ids have been assigned (returned "200 OK") and 3144 are not in use (returned "404 Not Found"). Of the 3144 unassigned ids, 3120 of those were "Not found" (and presumably were never used), and 24 were "User has been suspended".

By this ratio, we can infer that approximately 5,000,000-5,500,000 accounts have been created since Twitter's private launch in early 2006.

Of the 1270 sampled ids that have been assigned to a user, 847 accounts have posted at least one update, 759 are being followed, and 735 are following another user.

Further breaking down the 1270 assigned ids, we find that 635 ids are both being followed and follow someone else, 574 ids have posted a status and are being followed, and another 541 have posted a status and follow someone else. And 501 have posted a status, follow someone else, and are being followed.

Of the 1270 ids that have been assigned, 97 have protected their status updates, and 1173 have left their status updates public (the default).

Of the 1173 public ids, 1048 of the accounts were created more than 30 days ago.

Of the 1048 more-than-30-day-old public ids, 691 have posted a status message at least once.

And of those 691 sampled public ids that are over 30 days old and have posted at least one status message, 305 of those accounts have returned to post an update more than 30 days after their account was first created.

This last metric -- users that have returned to post again more than 30 days after creating their account -- is the best metric I can come up with for a return user on Twitter. Extrapolating this ratio of return vs non-return users back across the segment of users too new to test and the private accounts, we estimate that 29.1% of assigned ids, and thus roughly 8.3% of all possible ids, are assigned to someone that has returned at least to post something more than 30 days their account was initially created.

Given an potential max population of 18,496,098, this ratio implies that there are up to 1,500,000-1,600,000 users that have returned to Twitter to post again after first creating their account, which is a respectable number, and is consistent with the estimates made by others observing the pattern of public status updates.

Of the 305 return accounts, 249 are both following at least one other account and have at least one follower.

Again extrapolating for accounts too new to test and private accounts, this suggests that 23% of all assigned ids, and thus 6.8% of all potential user ids, are assigned to someone who is posting regularly, is following other users, and is being followed by at least one other user. This implies that there there are up to 1,200,000-1,300,000 active, connected users on Twitter.

Of the public users sampled, 470 are followed by no one, 524 are followed by between 1 and 10 people, 159 are followed by 11 to 100 people, and 20 are followed by more than 100.

Of the public users sampled, 499 are following no one, 514 are following between 1 and 10 people, and 139 are following 11 to 100 people, 20 are following more than 101-1000 people, and 1 is following more than 1000.

Of the public users sampled, 403 have no status updates, 535 have posted between 1 and 10 status updates, 141 have posted between 11-100 status updates, 87 have posted between 101-1000 status updates, 6 have posted between 1001-10000 status updates, and 1 has posted more than 10,000 status updates. (The outlier with 10000+ updates is a bot.)

And to return to the question that started the experiment, there is indeed both an upward trend to the number of users created per month, and a sharp transition in early 2008 when huge blocks of ids no longer went unassigned between the creation of each account.

These numbers, while not perfect, should be a reasonably accurate ballpark estimate -- ±2% within 2σ over the total population, and ±3% over population of assigned ids -- and the numbers likely wouldn't change significantly with a larger sample. However, there's always the chance of mistakes, so please feel free to download the data set to confirm. Or if you work at Twitter and would like to verify these numbers, even privately, I'd love to hear how close the sampled numbers come to reality.