Digital & Social Media

How Big Data’s Inaccuracy Hurts People

How Big Data’s Inaccuracy Hurts People
Share

Forget Edward Snowden and government spying. Law professor Frank Pasquale examines today’s widespread and unfettered data collection for profit in The Black Box Society: The Secret Algorithms That Control Money and Information, scheduled for release Nov. 17. He raises questions about what can happen to people, companies and economies when some of that data turns out to be incorrect. Should the subjects under the data microscope be allowed to face their accusers, so to speak? Pasquale talks to FleishmanHillard TRUE about a few reforms that might rein in some of the worst abuses.

TRUE: What do you mean when you talk about a black box society?

Pasquale: Engineers use the term black box to describe a system where we can watch inputs going in and see outputs coming out, but in between exists some opaque, intangible process that transforms the inputs into outputs under a veil of secrecy. Google sometimes refers to its black box algorithm as the secret sauce that gives them a competitive advantage. Businesses are increasingly relying on this secrecy strategy when it comes to data, using the rationalization that people would game the system if they knew how it worked. So that’s the most obvious explanation of the term black box.

But when I came up with the title, I was also thinking about the black box on airplanes that most often comes into play in crashes. That black box is something that records all that’s going on in the plane and monitors something like 30,000 variables. If you look at the Internet of Things, big data and the ubiquity of sensor networks, it’s almost like we are all being monitored just like the plane. We all have our own black boxes. We expect our Internet use to be tracked, sliced and diced, and monitored, but now with the Internet of Things, even our real space beyond the Internet is being monitored as well.

That’s why when there is a data breach — if you think about data collection in this way — you see the potential enormity of the damage. Think about the dark markets for data where filenames, passwords, usernames, social security card numbers, credit card numbers and even medical records are sold. The potential of data collection extends far beyond mere personalization of marketing messages.

TRUE: What kind of data do you consider the most potentially harmful because of the lack of transparency?

Pasquale

Pasquale: It’s hard to say what’s the most potentially dangerous. But I worry about the dissemination of personal health data. People need to be aware that their health data and even health profiles of them are being compiled that have no protection under HIPAA. When you see a doctor or go to a hospital, you have rights to check data collected to make sure it’s accurate and know how it’s being used. But for anything not covered by HIPAA — and that’s virtually anything that isn’t collected by a health provider — there’s no regulation. Detailed profiles can be put together based on sites you visit, searches you perform. Most of this is used for marketing purposes, but it is filtering into other uses. What if prospective employers have access to lists of possible diabetics or people who are depressed or alcoholics? There are laws that restrict that. But in the world of big data in which we live, it’s impossible for people to find out everything that an employer is looking at. Even the employer may not know what they are looking at. Some data companies provide scores on people, and do users always know what’s included in that score, what’s influencing or informing scores?

For instance, if an employer tells you he is not hiring you because you’re a diabetic, that’s clearly illegal. But what if there is some scoring entity that puts out a number about your robustness as an employee or potential costs to the company and the scoring entity secretly included data on an employee’s potential diagnosis, that’s problematic and almost impossible to prove because employees almost never know what is going into an employer’s decision not to interview someone or not to give them a job and the employer doesn’t know what’s going into the score. The Equal Employment Opportunity Commission is in the process of considering disputes stemming from employer personality tests that have questions that seem to be looking for patterns of thought connected to mental illnesses and are unrelated to the job. It’s an emerging area.

Now go one step further and ask what if the information the scoring entity is using is wrong. For instance, let’s say they have concluded you are suffering from depression based on Google searches or sites you have visited or information you have requested. That may be for a friend or colleague or maybe just research. They don’t have access to actual health data because of HIPAA.

TRUE: What made you begin to look into these potentially negative uses of data?

Pasquale: In 2010, I was writing about lawyers who were looking for legal ways for Google to more easily collect, archive, and rank and rate Internet sites. I shared their enthusiasm, but simultaneously I was seeing stories about people who felt like they were being treated unfairly in searches. Some complained that they were ranked too low, but often times people complained because what came up was always the most embarrassing thing about them or even lies about them were the top result. People who found themselves in that situation were told, “Well you just have to get yourself out there a little bit more.” But the real problem was it was just very unclear for most people how Google was ranking and rating sites. It was a little like credit scoring. It’s a way of sorting people that often had mysterious results. Back in the ’60s, credit bureaus were keeping very creepy information of people. They were paying people to watch households and report whether the yards were messy or the men of the households used feminine gestures. In the 1970s, the Fair Credit Reporting Act began to regulate those activities. Today, the difference between credit scoring and data collection: With credit scoring, you can demand to see what goes into your score. The collection of this very sensitive information about people that the law tried to put a stop to in the 1970s is coming back again because of big data.

TRUE: In a recent New York Times op-ed, you made the point that a lot of the data behind these lists and that goes into scoring is incorrect. Why is that the case?

Pasquale: There are a couple of reasons for that. The most important is probably the fact that data collection doesn’t need to be perfect or entirely accurate to improve targeting.

If I get a list and am paying only a penny a name, I’m still doing much better than I would have not knowing anything about the data set if only 70 percent of the names and information match up. And that’s really the promise of big data — that even with so-called dirty data with some inaccuracies, you’re still getting a basically accurate picture. So there’s no downside for the person using the list.

TRUE: Is there a way the government can force the list aggregators to be accurate?

Pasquale: I’m not sure the government is going to want to force 100 percent accuracy. I think they can make the aggregators of the lists register them and then have them alert people as they get put on a list. Like with credit rating, people should have the ability to contest data associated with their name. So I would say transparency is definitely the first step, but one must then also be given a right to dispute, correct and know who sees the information.

TRUE: You also have talked about how these lists are used by employers. Could you discuss briefly the problems you see with this practice?

Pasquale: There’s a list out there called the theft list or the fraud list or something like that. It tracks people who have been accused of stealing by their employer. It doesn’t indicate that they have been convicted in court, and many are unlikely to have gone through any kind of official process. It was purely an accusation. Maybe they had to sign something saying they stole something in order to keep their job, but the point is they may have done nothing wrong. Suddenly, their names will show up on one of these blacklists, and it will likely be very hard, if not impossible, for them to get another retail job. I’ve read that it’s harder to get a job at Wal-Mart than to get into Harvard. There are just so many people applying, and if Wal-Mart has 20 people for a cashier’s job and three of them are on one of these lists, well it’s one way of narrowing down the selection and you still have 17 people to choose from.

TRUE: It’s a little like McCarthyism, when people were being accused of being communists because someone didn’t like the way they looked or the books they read or the fact they were gay.

Pasquale: That’s a brilliant comparison. That’s exactly right.

TRUE: So where are we in the evolution of big data? Have we reached a tipping point, and people and government will start pushing back?

Pasquale: There’s a growing concern that, left unchecked, we’re headed for a pervasive 24/7 surveillance of every moment of our lives. But I don’t think the government or even experts have fully grasped the extent of the problem. They’re still stuck in the mindset that says people should defend themselves from surveillance and defend their own privacy. But they’re being unrealistic. It’s unrealistic to expect most people to encrypt their own iPhone. This just isn’t the solution, and although we’re seeing a little more activity from the states trying to limit data collection here and there, we’re pretty far off from a comprehensive response to what’s happening.

Subscribe

About the author

Frank Pasquale is a professor of law at the University of Maryland’s Francis King Carey School of Law, an Affiliate Fellow at Yale Law School’s Information Society Project, and a member of the Council for Big Data, Ethics, and Society. His research focuses on the challenges to information law posed by rapidly changing technology. In his latest book, Pasquale analyzes the threat from the lack of transparency as companies and governments seek to quench an exponentially expanding desire to connect society’s data dots.