Mining the Yesternet

The Internet is an immense treasure house of data for social scientists. Digital records allow them to study social interactions on an unprecedented scale. They can address such issues as the diffusion of innovation and beliefs, the self-organization of online communities, and the collective behavior of individuals.

The probability of joining a LiveJournal community as a function of the number of friends already in the community. L. Backstrom et al.

LiveJournal, for instance, is an online community with more than 10 million members. Many of these members are highly active. On any given day, as many as 300,000 people update the content of their LiveJournal Web pages.

LiveJournal lets members maintain journal pages and individual and group blogs. It also allows people to declare which other members are their friends and to which communities they belong.

Recently, computer scientists at Cornell University turned to LiveJournal data to study the evolution of communities. They wanted to gain insights into the processes by which communities come together and attract new members.

The researchers viewed the act of joining a particular group as a kind of behavior that spreads through a network—a diffusion process. They initially focused on determining how the probability of joining a group depends on the friends that someone already has in the group.

They discovered that the probability of joining increases in a way that depends on the number of friends already in the group, but, in the long run, with diminishing returns. In other words, having additional friends in a group has a successively smaller effect.

This result is somewhat surprising. Theoretical diffusion models typically display a behavior in which a process increases very slowly at the start, then accelerates after it reaches a critical point, before leveling off because the population of interest is bounded. The process tracks an S-shaped curve.

In the LiveJournal case, the startup effect is apparent only for the first one or two friends already in the group. Then the diminishing-returns behavior takes over.

Intriguingly, curves with diminishing returns have also popped up in other contexts. They appear in recommendation data for online purchases, the probability of friendship as a function of shared acquaintances and classes at a college, and for joint authorship of publications among computer scientists (where co-authorship is analogous to friendship and conferences to communities).

In each of these cases, there’s a steep rise, followed by an inexorable leveling off. The barrier to getting started, however, is remarkably small.

“It is an interesting question to look for common principles underlying the similar shapes of the curves in these disparate domains,” Lars Backstrom and his Cornell colleagues note.

Such results may be relevant to the development of improved search algorithms or to making predictions about whether a new technology will fizzle or soar.

In the meantime, to create new opportunities for social science research, a Cornell team of social, computer, and information scientists has initiated what they describe as the Yesternet project.

The Internet Archive consists of snapshots of the Web, collected and archived every 2 months for nearly 10 years. The collection totals more than 40 billion Web pages.

The Cornell team is now copying and transforming large portions of this massive collection into a relational database that, when completed, will be available for social science research.

Researchers will initially focus on tracking the diffusion of innovation.

“We suspect the greatest potential of the Yesternet will be to suggest new questions and modes of inquiry,” William Arms and his colleagues suggest. “For example, intelligent search tools could permit researchers to discover innovations that spread the fastest or the farthest over a given time period and those that failed.”

If you wish to comment on this article, see the MathTrek blog version.