Paving the Way for Semi-Automated Privacy Policy Summarization-Institute for Software Research - Carnegie Mellon University

Thursday, March 10, 2016

Paving the Way for Semi-Automated Privacy Policy Summarization

Few people read privacy policies. Studies have projected that it would take an average user over 600 hours to read every privacy policy associated with every website they visited in one year. However, research conducted over the past two years by researchers at Carnegie Mellon University, Fordham University and Stanford University, is paving the way to a day where technology might be able to provide users with short summaries of privacy policies.

As part of an effort to share early results, the Usable Privacy Policy project just released a website that enables visitors to navigate over 23,000 privacy policy annotations covering 193 websites. The project leverages crowdsourcing, machine learning and natural language processing to semi-automatically annotate privacy policies, extracting relevant statements from the often long and rather convoluted policies found on many websites and mobile apps today.

“This is the first site to provide analysis of privacy policies at this scale,” says School of Computer Science Prof. Norman Sadeh, Lead Principal Investigator of the study and researcher in CyLab, Carnegie Mellon’s security and privacy institute. “Our objective is to produce succinct yet informative summaries that can be included in browser plug-ins or interactively conveyed to users by privacy assistants that inform users about salient privacy practices.”

In its current form, the Usable Privacy Policy website features interactive functionality that allows lay users to explore the content of a number of privacy policies. Color codes help users select from a menu of privacy practices that might interest them. For instance, a user interested in learning more about the data collected by a given site can select “first party collection practices” and all statements identified in the policy about data collection will be highlighted. Similarly, users can click the “third party sharing practices” option and see a display of statements made by the site about different entities with which it is shares user data. The interactive tool covers a comprehensive number of different practices, including whether the site provides opt-out or opt-in choices to users, whether it discloses its retention policy, whether it includes statements about “Do Not Track,” as mandated by California law (CalOPPA) and much more.

“While navigating our site, people will notice how complex and fragmented many privacy policies are,” says Sadeh. “The vast majority of statements are about first-party collection and third-party sharing and contain significant levels of ambiguity when it comes to determining exactly what is being collected and with whom it is shared.”

The tool also gives each privacy policy a grade on reading level based on its language. Google’s privacy policy, for example, is written on a Grade 13 (college) reading level. The privacy policy for Playstation.com, a site with a presumably large population of children and teen visitors, is written for grade 17 (college graduate) according to the tool.

“Color codes also make it clear that privacy policies tend to mix a variety of different statements in the same paragraph, often requiring the reader to read large portions of the policy, if not the entire policy, before hoping to be able to answer simple questions,” added Prof. Joel Reidenberg, the Fordham Principal Investigator on the project and director of Fordham Center on Law and Information Policy. “Many sites hardly provide users with any real choices. Most policies that mention ‘Do Not Track’ do so by simply indicating that they do not handle Do Not Track requests – the bare minimum required under CalOPPA.”

While the annotations on the website were crowdsourced from law students at Fordham University, the researchers say they’re working towards automation.

“We are now using machine learning and natural language processing to semi-automate and hopefully one day fully automate, the analysis of privacy policies,” says Sadeh.

The Usable Privacy Project is supported by a grant from the National Science Foundation. The website design team also included Institute for Software Research post-doctoral fellows Dr. Mads Schaarup Andersen, Dr. Florian Schaub, Dr. Shomir Wilson, Language Technologies Institute graduate student Aswarth Dara and undergrad computer science freshman Sushain Cherivirala.

Original article. Learn more about the Usable Privacy Policy project and Cylab.

By: Daniel Tkacik