The world that research scientist Angela Fan sees around her is a lot more diverse than what she sees on Wikipedia. At Meta, Angela is working on solving the problem of representation on Wikipedia using artificial intelligence.
Wikipedia is often the first stop for many people looking for information about historical figures and changemakers. But not everyone is equally represented on Wikipedia. Only about 20% of biographies on the English site are about women and we imagine that percentage is even smaller for women from intersectional groups, such as women from Africa or Asia and women who work in science.
As a part of her PhD project as a computer science student at the Université de Lorraine, CNRS, in France, Angela worked with her advisor Claire Gardent to develop a new way to address this imbalance using artificial intelligence. Together, they built an artificial intelligence system that can research and write first drafts of Wikipedia-style biographical entries. There is more work to do, but we hope this new system will one day help Wikipedia editors create many thousands of accurate, compelling biography entries for important people who are currently not on the site.
Angela is open-sourcing an end-to-end artificial intelligence model that automatically creates high-quality biographical articles about important real-world public figures. The model searches websites for relevant information and drafts a Wikipedia-style entry about that person, complete with citations.
Along with the model, they’re releasing a novel data set that was created to evaluate model performance on 1,527 biographies of women from marginalised groups. This data set can be used to train models, evaluate performance, and push the model forward. Angela believes these artificial intelligence-generated entries can be used as a starting point for people writing Wikipedia content and fact-checkers to publish more biographies of underrepresented groups on the site.
How the Model Works
The model first retrieves relevant information from the internet to introduce the subject. Next, the generation module creates the text, while the third step, the citation module, builds the bibliography linking back to the sources that were used. The process then repeats, with each section predicting the next, covering all the elements that make up a robust Wikipedia biography, including the subject’s early life, education and career.
Spotlighting More Underrepresented People on Wikipedia
Angela’s model addresses just one piece of a multifaceted problem. Some sources have a bias that must be considered. For example, when women are represented, their biographies are more likely to include extra details about their personal lives. A 2015 study found the word “divorced” appears four times as often in women’s biographies as it does in biographies of men. As a result, personal details end up being more likely to be mentioned in articles about women, distracting from accomplishments that should be in the spotlight and celebrated.
Wikipedia’s former chief executive explained how an algorithm discovered also found an important mistake on the site. While Wikipedia health articles are vetted by medical editors, for years some articles on critical women’s health issues, such as breastfeeding, were labeled “low importance.”
There is even more work to be done for other marginalised and intersectional groups around the world and across languages. Our evaluation and data set focuses on women, which excludes many other groups, including nonbinary people. According to a 2021 study on social biases on Wikipedia, articles about transgender and nonbinary people tend to be longer, but much of the additional space is devoted to their personal life instead of expanding on the person’s accomplishments. It’s important to recognise that bias exists in varying forms, especially in default online sources of information.
We hope that our techniques can eventually be used as a starting point for human Wikipedia writers — and ultimately lead to a more equitable availability of information online that can be accessed by students writing biographies — and beyond.