AI and the Deliberate Faking of Data

Please subscribe so that you can receive updates
and new articles in your inbox. They are always free.

Original: 02/09/20
Revised: no

Data being the most important component of an AI system, if data is not correctly reflecting the domain of interest, the AI system will make useless or even dangerous inferences about that domain. So the issues about the veracity of data are central. Some of these issues revolve around the origin of the data (the provenance of data is the technical term), others are around the context of the data (how it is being used) and yet others involve the security and integrity of the data, as it is stored and moved around. AI systems are software programs that are usually part of larger systems tasked with certain business objectives. Software vulnerabilities can be found and exploited in almost all software systems, but systems built around AI have an added vulnerability because of their dependency on data. The online social media systems (Facebook, Tweeter, etc.) are particularly vulnerable, because for the data on which they are based, the provenance, context, and integrity are many times extremely difficult to assess. And these social media systems, as well as the more general AWI systems, are the systems we are mostly interested in.

Although we will look at various solutions to improve the veracity of data, one main theme that we already stressed throughout is that, as AI is intruding more into human lives, coexisting with them will place increased demands on critical thinking, in other words on the human critical evaluation of the data that these systems use and produce. You probably noticed in the clip above the reference to a gaming app on cell phones that would allow people to train their critical thinking. In the context of this website, and especially as it relates directly to the central X-Quartet of (critical thinking, character fortification, truth in data, formally specified morality), it is hard to think of more worthy apps to develop. When such apps are gamefied (turned into games), people can have some fun while using them; and if they have fun, they will come back and play them again. These apps would be dream apps for young data scientists to work on, combining fun with a very useful goal, especially when these apps are used from within social media, where critical thinking is most needed.

Social media has experienced from the very beginning a conflict between the producers of fake data and the scientists and engineers employed by the social media companies, who are trying to detect and prevent this fake data. There is a clear distinction between the detection/prevention of fake data after the data has been used/viewed and the detection/prevention taking place before the data has been used/viewed. For example, if a violent video is posted and viewed by many, damage has already been done, one cannot erase the imprint such a video leaves in one's memory. The posting of fake data is quite similar to cyber attacks against institutions, especially financial institutions. Just like cybersecurity professionals continuously develop signatures of known attacks, so do the professionals working in social media companies. These signatures increase the defenses against future actions, but only when the signature of a future action matches one of the signatures stored in the database of signatures.

However, detecting a new type of cyber-attack (or a different type of violence within a video) whose signature is not known in advance is very difficult. AI algorithms, especially the unsupervised type, are increasingly used for detection of these so called zero-day attacks. In the case of cyberattacks, banks prefer many times to pay the attackers than to report the attack; and other times hackers are paid to help plug those vulnerabilities. Of course, stronger login measures could be taken, but all these measures bump against the security-versus-convenience trade-off: if you increase one, you decrease the other. This trade-off also occurs in the next video in the context of social media, although it comes under the name rules-versus-usability trade-off. But apart from showcasing this trade-off, the video will establish a framework for our discussion and will introduce some terminology we need. Although the focus is on Twitter, the same ideas apply to all the social media platforms. It's half hour long but it's worth it, saving us countless justifications for what we have to say down below.

We have mentioned above the worth of developing gaming apps to improve our critical thinking. We also mentioned the increasing use of AI algorithms to detect fake data. We give the following example for two reasons. First, it is simple enough that it can be understood by a large number of readers; we covered all the needed background in the section on NLP in the Main AI Concepts article. As you watch the video, notice that the components of the analyzed text have become vectors, and transforming the problem into linear algebra is the method of solving it. Secondly, the presenter makes a good pitch on how to get a job as a starting data scientist, which would increase the reader's familiarity with our subject. The passion and the hands-on experience behind the presentation are more convincing than a dry resume.

Part of our critical thinking is the ability to figure out if the message/text/posting/video comes from a real account or a fake one. Fake accounts are abundant, many of them being created automatically. Much of the U.S. election interference, past, present, and future, is and most likely will continue to be through fake accounts. The techniques used to create fake accounts are increasing in sophistication, and the methods used to detect these accounts are too. Both sides, the attackers and the defenders, are using AI methods. Here is an example of the use of AI unsupervised learning, this time for the detection of fake profiles. The fact that social media allows the creation of these fake accounts is another aspect of that security-versus-convenience trade-off. Currently, social media wants to make it as easy as possible to sign up, but that ease will certainly go through some revisions.

Numbers do not lie, so goes the saying. In general, information that is presented to us as numbers, as opposed to opinion statements, usually garners a higher degree of trust. But it turns out that numerical data sets can be also be presented in such a way as to deceive. Numbers do not have to be modified to lie, just the presentation and the intent behind the numbers is sufficient to completely change the entire meaning of the numerical data set. You can take a spreadsheet and present the statistics from that spreadsheet in ways that do not convey the true content of the data. Even graphs can be used to mislead, again, without modifying their underlying data:

Truth and the Politics Around Social Media

Politicians wish and expect a certain degree of responsible curation of the data shown by social media, while technical staff behind those AI algorithms have the unenviable task of explaining why this curation is so hard to accomplish, without descending into censorship. The problem is that both sides are correct and that is the problem we all have to work on. Two well articulated positions in the clip below show the problem social media has with truthful data. Congresswoman Alexandria Ocasio-Cortez asks how far a politician or political group could push the truth on Facebook before Facebook's fact-checking algorithms sniff it out, a very pointed question, going right to the crux of the matter. Facebook is most likely doing the best that they can to root out certain types of violent content, incitement to physical harm, voter suppression, and so on. But it is a challenge to explain how hard this task of Facebook is. If they take down too much they would get into hotter waters. It's one of the most vexing problems with social media, the magnitude of which we could not guess when it all started. Contrary to what AOC wants, these are far from yes-no situations.

It is not just fake data that is the problem. The data could be truthful but the intention behind posting it being the problem. We will see that truth in many social situations has many layers attached to it, going all the way from the basic fact underlying it, to the moral intention behind showing it. If the intention is to manipulate or deceive or hurt, that intention is very difficult to detect. The problem is that facts presented on social media have immediate and wide visibility; using our analogy with cyber attacks against banks again: banks can reduce their visibility and not mention them to the public.

Finally, are the social media companies politically biased? In other words, are they favoring liberal or conservative causes? We are not talking here about bias in the data that they use, we are talking about designing bias into the AI algorithms, which is the public perception. It is a bit puzzling why these companies cannot provide an unequivocal NO, because that is indeed the case. That is the case not because of some high moral ground that they are obliged to follow, it is because there would be no business gains to do such a thing. It would also be expensive and difficult to accomplish that bias through AI algorithms, without being obvious; thirdly, it would be humanly impossible to muzzle all the scientists and engineers who would be working on such bias. Even if bias were to be attempted, such bias would be quickly discovered, and the stock would suffer; if you ever had lunch at a Silicon Valley restaurant, you would understand what a sin that would be.

The popular perception that these high-tech giants could sacrifice good business practices on the altar of political agendas, and could do a lot more than they are actually able to do, serves neither them, nor us, nor any potential regulatory legislation, any good purpose. In the case of social media, the success has been a lot more spectacular than the original plans called for; no one dreamed of the many-headed monster that would be so successfully created when many human (and increasingly non-human) heads are allowed to build together communities without many laws binding them, from scratch.

The President attacks the tech industry accusing it of bias against conservatives and specifically filtering out information that is supportive of conservative causes. That is not the case, and it cannot be the case, for strictly business reasons. It is however true that the accelerated polarization which was started in 2016 is also happening in Silicon Valley, and that many tech people who used to profess political apathy now are actively political. While this new activism is not necessarily bad, it exacerbates the disconnect between the high tech industry and the Administration, at exactly the wrong time, when we need the two to come together and set our future direction in the competition for AI. The attacks on the high-tech, as exemplified in the following video, show an extraordinary mismatch of perception between the two sides, which must be somehow addressed. How are we supposed to be guided to a destination by Google Maps if the app does not know our location? How are we supposed to send emails through gmail without having those emails stored somewhere for replies and future reference? How can a personal assistant spring up into action if it does not listen to your voice, waiting for commands? The regulatory oversight needs to focus on the real problem, which is not the fact that our private data can be used, but rather what the laws governing that use should be. There is no such a thing as privacy when you connect to the Internet.

The Negatives

A 2014 study shows that half of the American public believe in at least one conspiracy theory. It was also thought that conservatives were more receptive to conspiracy theories. Indeed, the most spectacular conspiracies have been spread in conservative circles. But the merit of generalizing that tendency into a theory is not clear, as the more recent liberal belief that there was a coordination between the Trump campaign and Russia during the 2016 election cycle shows. The magnitude of the shock of the election result needed a large (however unlikely) explanation, although to many of us, such a possibility looked quite small from the beginning, for simple human reasons: first, it would have been a risk hard to contemplate, and second, campaigns are messy affairs and to keep a serious coordination secret, not just isolated meetings, would have been close to impossible. It is also unlikely that the President's taxes contain unlawful maneuvers. Creative accounting? Probably. But unlawful accounting, probably not. There are good explanations for why people are drawn to conspiracy theories. First, the Proportionality Bias principle from psychology states that there is a tendency in humans to believe that large events must have large causes. So, JFK could not have been murdered by a lone gunman and 9/11 could not have been the work of 19 ordinary terrorists, despite of what all the evidence pointed to. Secondly, when faced with extraordinary events, the limbic system kicks into a frantic search for patterns and we will see below that patterns are easy to find in large data sets.

Many seemingly nefarious aspects of our government appear to be the result of careful designs by groups of people with secret agendas. The existence of such secret agendas would be very relevant to our discussion of AI, because there could not be a more useful technology to such groups than AI. The "deep state", does it exist? The famous Area 51 in Nevada, does it hide something? Is the Bilderberg Group a shadow world government? As we will see later, extraordinary beliefs based on extraordinary conspiracies, would crush our AI systems if they are fed to it. An AI system would detect the "extraordinary-ness" of that data and assign it undeserved importance. If half of the digital twins in the national graph believe in statement A, and A is a weighty proposition, how are the graph algorithms or the AI algorithms working on that graph supposed to treat A? One can see why the critical thinking of the X-Quartet is so important.

Let's look closer at that most conspiratorial term of all, the deep state, because our understanding of the need for this concept will help with the debunking of all the other theories. In the largest interpretation of the term, the deep state is the massive bureaucracy formed by the entire federal government, about 2 million people: all the career civil servants of the federal government would be part of the deep state, notice that "career" means non-political. They must follow the law, not the particular political wishes of an essentially political Administration. These bureaucrats usually stay in the same positions for many years, even as Administrations change every 4 or 8 years. This was done on purpose and for very good reasons, by the Pendleton Civil Service Act of 1883. The Pendleton Act set into law the mechanism of permanent federal employment based on merit rather than on political party affiliation. It has been a very successful mechanism. But naturally, such massive bureaucracy moves at a speed far lower than the political wishes of any new Administration, be it Democrat or Republican. So there has always been and always will be some tension between the political leaders of various sections of the government, appointed by a new Administration, and the longer-term bureaucrats. But to assign different motives and a grand scheme by which this bureaucracy moves slow, or even purposely against any Administration, is a big stretch. The inertia of such a large mass of people is a much more probable explanation.

OK, but what about the intelligence community, which is part of this bureaucracy? The intelligence community is made up of no less than 16 organizations (17 if one adds the Office of the Director of the National Intelligence, which oversees the other 16), but we will focus on the agencies only, and out of these agencies only on the CIA, the NSA, and the FBI:

With these agencies the situation is more touchy. There are many reasons for this, but the main one is that by definition the operation of these organizations is secret, even the breakdown of their budgets is not publicly known; you saw in the video that the so-called "back budget" is not at this present time broken down into its various assignments. Obviously, by definition their operation must be secret, but this secrecy is the source of much speculation, and it is also used to justify various conspiracies. The fact is that the history of these agencies has been murky and controversial, and there is no doubt that for good reasons the American public (and this includes the President) is many times suspicious of their operations, both at home and abroad. Are they working against the President at this point in time?

After we watched the video, it is tempting to answer yes to that question about these agencies working against the President. And to also heighten the worry that their surveillance of all of us has gone to unacceptable levels. Both of these issues, their would-be unlawful opposition to the current Administration and their "rampant surveillance" of our citizens, need evaluation. We looked at surveillance in the precious article, AI and Liberal Democracy. We will analyze it again in the article on AI: Most Powerful Weapon Ever, because that article offers a better context. Here we tackle the issue of deception, and we'll look at some general laws to guide us while we sift through worries, lies, and conspiracies. But maybe it's time for a deep breath:

Total Disorder is Impossible.
- Theodore Motzkin in the book "Ramsey Theory" by Ronald Graham, Bruce Rothschild, and Joel Spencer

Let's look at a mathematical theory which helps explaining the existence of beliefs in conspiracies. Living in a seemingly random world, humans like to find some order, and will go to great lengths to find that order, including finding patterns in text or shapes in the night sky. The mathematical theory explaining why that search for order is so many times successful is called Ramsey Theory. Frank Ramsey proved a theorem in combinatorics, a theorem about graphs actually, and that theorem started the theory.

The rise of these conspiracy theories and the rise of the accompanying "alternative realities" is a dangerous development in the presence of AI systems. These systems optimize on the data given to them and they assign higher priority to more recent data. Falsifying this more recent data has negative outcomes, because AI systems are not yet able to distinguish between the actual reality and the other "realities". According to a September 2019 poll, Americans' Trust in Mass Media Edges Down to 41%, whereas in 1976 that trust was at 72%. Repeating a lie, or even criticizing it, or even trying to prove that it is a lie, will actually reinforce, among those who are predisposed to accept the lie, the belief that the lie is the truth; the more it is heard, regardless of positive or negative context, the closer to being accepted it becomes.

Repeatedly referring to journalists as "the enemy of the people" is not new in politics, President Nixon did it too. But the reality is more mundane: no large group of people in this country can be "the enemy of the people", not the media, not the CIA, not the FBI, and so on. These groups are made of people (still!) and the size and diversity of these groups work against would-be desires to deceive in any organized manner. Naturally, individuals wishing to deceive exist, but not large groups. If the President is concerned that the information coming to our senses is false (which many times is, as we saw above in the case of social media), then it would be his responsibility as an elected leader to work with Congress on fixing that problem with appropriate legislation, because our senses are all we have (at this time!) to gather information from our environment. If on the other hand, he is urging us to only believe his interpretations of the truth, then it is our responsibility to politely decline:

But maybe, using Hannah Arendt's wording, we are getting used to the banality of fake data. With the increased danger fake data brings to AI, a strong political backbone is needed to conquer this new "banality". And so, the current weakening of the Republican party should be of concern not just to Republicans, but to Democrats and Independents as well. To face the stress that AI will bring, we need the balance of ideas coming from two strong parties. The Republican party is hard to recognize today as the party of Lincoln, the party of reform, the party of immigration and the party against slavery. It was not long ago that the Republicans were the progressives and the Democrats were the conservatives. The momentous shift that the Civil Rights Act of 1964, apart from being one of Lyndon Johnson's biggest accomplishments, passed the House with more Republican support than Democratic support. That revolving lazy Susan of American politics is valuable, continuously refreshing the political discourse with unorthodox points of view from both sides; this refresh will be much needed when working on AI regulation.

When John McCain passed away, part of the fabric that made the Republican party unraveled. Acceptance of demagoguery, of a habitual bending of the truth, and of a tilt towards authoritarian rule, were not usually associated with the party ... they are now. This acceptance is threatening not just the integrity of the party, but our national ability to sustain our liberal democracy under the challenges posed by AI. We have seen in the AI and liberal democracy article that AI does not favor democracy. We have seen how such central control of AI can be exercised over people, and we are seeing already what that control does in the case of the Social Credit system in China. Two well matched and strongly principled parties would strengthen a search for AI solutions in ways that would not undercut our democracy.

Artificial Intelligence, under the control of three "strongmen" in all top three military powers of the world, would truly be the most fearful AI doomsday scenario, for tomorrow, not years from now. Hopefully we will survive our current predicament somehow, and then we will be left with an urgency to legislate and make precise the limits of the Executive so that we do not reach this point again in the future. No one could have foreseen this, not even the Founding Fathers. Many of us shrug off the present chaos, the nonsense seems so big that it just does not seem possible that it can lead to anything dramatic. But it did not work like that in the past, the bigger the nonsense was, the more dangerous it was; history is littered with examples.

The use of AI techniques to deceive our most trusted senses is best exemplified by the creation of deepfakes, i.e., altered videos. We trust videos more than any other form of information. "Seeing is believing" goes the saying; we saw in the Free Energy Principle article that we can view the brain as a Bayesian inference engine processing sensory information, and videos are carrying a lot of such sensory information, both visual and auditory. They overwhelm our senses and conclusively make up our minds many times. A race is developing now between AI systems that detect deepfakes and AI systems that create them. Deepfakes falsify politicians making statements, and they can stoke up fear. It is easier to create deepfakes than it is to detect them:

We can tie a few ideas together before we leave the negatives. We have seen in the Foundational Questions article that we do not yet understand how AI learning takes place. Second, we have seen in the same article that we cannot yet prove that AI systems indeed do what they are programmed to do. Thirdly, we have seen in the paragraphs above that the data we use to train these AI systems may have falsities or biases in it. If you put these three things together, you get a more accurate picture of the dangers that we face now, without having to speculate about the future. When these AI systems fail, they tend to fail really bad. You can push this realization just a bit further and imagine what AI failures would mean on the national graphs.

The Positives

The first component of the X-Quartet, critical thinking, seems like the best approach in our search for the third component, truth in data. We have seen in the sections above that the pressure to accept the information presented to us comes in many forms. We have seen also that methods to defend against mis-information are being developed, by the social media companies for example. But no matter how powerful these defensive methods are, they are no substitute for our doubt and our critical thinking. We start with Murray Gell-Mann's advice to have the courage to question even data that may seem obviously true. (There is a secondary reason to become acquainted with Gell-Mann, he is one of the founders of the modern theory of the elementary particles and their interactions, and we touch on a few concepts from quantum mechanics in our website; it is possible that you can appreciate those concepts after listening to Gell-Mann or, even better, by reading his books; among many other things he discovered the quarks). Although it may seem self-evident, it is worth repeating ad nauseam that curated data is needed if we are to produce valuable AI models based on it. In fact, data scientists spend most of their time analyzing the quality of data, long before that data is passed on to AI systems.

Murray Gell-Mann refers to data from the natural world and he is interested in exercising courage and doubt when trying to explain that data through a physics theory. In this article we will be preoccupied more with data originating with people rather than data from the natural world. The data from the natural world is usually produced by trusted instruments. With data produced by people, the truthfulness of that data is more questionable, and truth itself is harder to judge, since truth in social situations has more layers to it. Interestingly enough, producing untruthful data has a significant penalty and it brings more suffering to an individual's life:

So the nature of truth in humanly produced data requires more care. Our foray into politics had a decidedly negative tone in the paragraphs above. If we combine this difficulty of truth with that negative tone, are we doomed? Will the current partisan politics prevent us from finding truthful data on which AI needs to run? Not necessarily, it's more likely that the current malaise is a blip, as there are sufficiently many people of integrity on both sides of that political divide, people who, although they may strongly disagree with each other, nevertheless have the capacity to work together in that search for truth, and for better outcomes to problems that we may face. It is a pity that someone like Senator Jeff Flake of Arizona cannot find a place in the political life of this country, the moment he tapped Senator Chris Coons on his shoulder stands as one of the most dignified acts we have seen recently. It is worth watching more than once, as an antidote to the prevailing cynicism and resignation:

Continuing on this more optimistic path, data on social media coming from people we know and trust can be informative and useful, provided we continue to exercise critical thinking. With proper regulatory oversight, the social media will most likely find a healthy place in our lives. We have seen that data on social media sites can be mined by companies like Cambridge Analytica to promote political agendas, with distorted messaging, and targeted very precisely. But there are also groups who are mining that data with AI techniques for worthy causes:

Because deepfakes represent such an imminent danger, DARPA (Defense Advanced Research Projects Agency) is financing AI research into deepfake detection and some of this work is done at the Artificial Intelligence Center at SRI in Menlo Park, California. Through the years, the center has produced many breakthroughs in AI development, and many other breakthroughs, for example the protocols that make the Internet possible. The Business Week magazine called SRI "the Silicon Valley's soul".

W. Edwards Deming is credited with saying "In God we trust; all others must bring data". UK mathematician Clive Humby quipped in 2006 that "Data is the new oil" and the quip has become ubiquitous after the Economist published a report titled "The world's most valuable resource is no longer oil, but data". Businesses have been very quick to adopt this grand view of data and now they do bring data: actually, massive amounts of data not just about their products but also about their customers. This relentless gaging of our wants, our desires, and our actions, may seem overwhelming and intrusive, but it is probably here to stay. If it all looks depressing, let's end our evaluation of attitudes towards truth in data on a lighter note:

My head knocks against the stars.
My feet are on the hilltops.
My finger-tips are in the valleys and shores of
universal life.
Down in the sounding foam of primal things I
reach my hands and play with pebbles of
destiny.
I have been to hell and back many times.
I know all about heaven, for I have talked with God.
I dabble in the blood and guts of the terrible.
I know the passionate seizure of beauty
And the marvelous rebellion of man at all signs
reading "Keep Off."
My name is Truth and I am the most elusive captive
in the universe.
Who Am I?

- Carl Sandburg