Scripting Boffo Box Office

Making and marketing a movie can cost as much as erecting a substantial skyscraper. But investments in the movie business are typically far riskier than those in real estate—or even wildcat oil-drilling.

The average Hollywood movie now costs about $60 million to make and another $30 million to distribute and market. Most movies don’t recover their costs at the box office. Movie studios manage to survive because a handful of films strikes it very rich, making up for all the losses.

Nonetheless, despite the risks, investors continue to be lured into the movie marketplace. But they’re also looking for tools that would help them decide which films to back—which ones might deliver a box-office bonanza. In response, researchers are coming up with a variety of mathematical models for predicting a movie’s chances of success and making such investment bets a bit safer.

Josh Eliashberg, a marketing professor at the Wharton School, and his colleagues have recently focused on movie scripts.

Movie producers must sift through thousands of screenplays each year to select the few that will be filmed. A lot hinges on this rather hit-or-miss process.

“Not surprisingly, the result of this process is highly unpredictable,” Eliashberg, Sam K. Hui, and Z.J. Zhang write in a recent paper. “Even the scripts for highly successful movies, such as Star Wars and Raiders of the Lost Ark, were initially bounced around at several studios before Twentieth Century Fox and Paramount, respectively, agreed to green-light them.”

To help studios and movie investors screen scripts, Eliashberg and his coworkers developed a new computer tool for forecasting the potential return on investment based only on the proposed storyline. For a given script, their tool applies natural language processing to extract key textual information and make a prediction.

“The rationale for our approach is simple,” the researchers say. “A good storyline is the foundation for a successful movie production.”

Eliashberg and his coworkers didn’t have access to movie shooting scripts in electronic form. To develop their script prediction tool, they instead worked with detailed, blow-by-blow movie summaries known as spoilers, written by viewers after they watch a movie. Each spoiler is about 4 to 20 pages long (see

The researchers implemented their approach using spoilers from 281 movies, all released in the period 2001–2004. A randomly selected sample of 200 movies was used to train their system and the resulting model was then applied to the remaining 81 movies.

In the so-called bag-of-words model in natural language processing, a document is represented entirely by the words it contains and how many times each word appears, without regarding the order in which the words appear. Such a representation allowed Eliashberg and his coworkers to pick up the themes, scenes, and emotions in a script.

“For instance, the frequent appearance of words such as ‘guns,’ ‘blood,’ ‘fight,’ ‘car crashes,’ and ‘police’ may indicate that the script contains a crime story with action sequences,” the researchers comment. “When this information is coupled with known box office receipts for the movies already made in the recent past, we would know if the movies of this type tend to sell well or not in theatres.”

But because word order really does matter (there’s a huge difference in the emotional impact on the audience between “the villain kills Superman” and “Superman kills the villain”), the researchers also incorporated domain knowledge from screenwriting experts—ending up with a list of 22 script elements that increase the likelihood of success.

Even that wasn’t enough. Eliashberg and his coworkers also applied statistical learning techniques to uncover complex interactions among the predictors extracted from movie scripts (or spoilers).

To test their model, the researchers used it to choose 30 movies from a set of 81, aiming for a superior investment package. The resulting portfolio had a total budget of $1,044.5 million and generated gross box office revenue of $1,996.8 million, giving the studios net revenue of $1,098.2 million (5.1 percent return on investment). In comparison, randomly selected portfolios of 30 movies had an average net loss of 18.6 percent. And portfolios that replicated a typical studio’s slate returned –24.4 percent.

“Our approach always produces a significant economic gain no matter how many movies are selected for the portfolio,” Eliashberg and his coworkers conclude, “suggesting that our model is able to capture determinants from the textual information in movie scripts and hence significantly improve the studio’s profitability.”

Would such a model, if employed widely in the movie industry, stifle creativity, leading to formulaic scripts and limiting choice?

“Rather than coming out with a set of rigid rules to follow, our approach will only suggest the structural regularities that a successful script generally possesses,” Eliashberg and his coworkers reply. “We believe that there is room for creativity within the structural regularities.”

Maybe it couldn’t get much worse anyway. With an emphasis lately on sequels and on recycling old TV shows and movies, Hollywood has already gone the formula route.

If you wish to comment on this article, see the MathTrek blog version. For more math fun, go to

More Stories from Science News on Math