1stopIT blog: More Terror Watch List Glitches?!

Shortly after I wrote Terror Watch List Glitches, more disturbing reports about problems with "terror-watch-list"-related systems started popping up. Most were saying that the expected upgrades (worth half-billion dollars!) might not correct the problems and might even create new ones. All the fuss seemed to be about "project RAILHEAD", which is supposed to replace Terrorist Identities Datamart Environment (TIDE).

I refrained from publishing this post the very next day. There were tons of articles that simply amounted to a bunch of snippets from a congressional memo slapped together, so I decided not to add my voice to the general noise. Instead, I took some time to learn more about TIDE and RAILHEAD, as well as to analyze those claims from a purely technical perspective, leaving out the politics.

CNN's piece Terrorist Watch List Woes^[1] (UPDATE: the video is not available anymore), Siobhan Gorman's article Flaws Found In Watch List For Terrorists^[2] in the Wall Street Journal, as well as some other sources^[3] ^[4], all seem to be alluding - with varying degrees of accuracy and technical competence - to one document: Staff Memo of August 21, 2008^[5] (UPDATE: the document is not available anymore) of the House Science and Technology Committee's Subcommittee on Investigations and Oversight. If you don't feel like reading all 11 pages of the memo (UPDATE: especially, now that it is not on their web site; aren't they supposed to keep that kind of stuff?!) and dig for more information, I am going to give you the scoop. Again, I am interested only in the IT aspects, not the politics.

First, some historical background

Before 9/11, according to Ronald Kessler, the government maintained four different terrorist identity databases and 13 independent watch lists. It is believed that it was in part because of the database incompatibility that two of the 9/11 hijackers managed to slip into this country.^[6] Intelligence Reform and Terrorism Prevention Act of 2004 gave birth to National Counterterrorism Center (NCTC), which - among other things - was supposed to bridge those gaps and to:

"establish common information technology standards, protocols, and interfaces";
"ensure development of information technology systems that include... intelligence integration capabilities".^[7]

NCTC's IT efforts led to the emergence of Terrorist Identities Datamart Environment (TIDE), which, according to NCTC, is the US Government’s central repository of information on international terrorist identities. TIDE supports various “watch lists”, TSA's "no-fly" list being one of them, and the US Intelligence Community’s overall counterterrorism mission.^[8] As the former NCTC Director, John Redd, once described it, it is "the mother of all databases"^[6]... or, at least, that is what it was supposed to be...

In reality, however, it is not as seamless as one would imagine. According to Ronald Kessler, as of August 2006, John Redd, NCTC Director at the time, needed seven separate computers to access the separate networks of the FBI, CIA, NSA, and DoD. All told, the NCTC tapped into 28 (according to other sources, over 30) separate computer networks.^[6] In his interview on C-SPAN in January 2007, Major General (Ret.) Dale Meyerrose, Chief Information Officer for the Director of National Intelligence, called it "wrestling the information technology..." He then added that his "job is, in essence, to make the information technology invisible."^[9]

Some time after Meyerrose's C-SPAN interview, the codename "project RAILHEAD" started floating around in open sources. RAILHEAD was, and probably still is, supposed to replace TIDE and bring all the "goodness" of General Meyerrose's "invisible" IT to its end users. I think, the most recent deployment time I heard was January 2009.

The memo and my comments

And here comes the 8/21/08 memo of the House Science and Technology Committee's Subcommittee on Investigations and Oversight. It claims that the planned upgrades to TIDE, Tide Online (TOL) and NCTC Online (NOL) would actually "diminish, not improve", their capabilities, limiting the ability to share terrorist intelligence data among federal agencies and crippling the ability of counterterrorism analysts to conduct searches of these databases.^[5] Like I said, I am not interested in the politics of this controversy, so I will spare you the where-did-the-money-go and who-is-to-blame type of stuff. Here is the "technical scoop" with my in-line comments.

According to the memo,

In TIDE (i.e. the system currently in use) "only about 60% of the data, including names and addresses, mentioned in CIA cables provided to NCTC are actually extracted from these messages and placed into the TIDE database"^[5]

I was not able to get hold of the original document cited in the memo ("Internal Users Focus Group", Meeting Minutes, RAILHEAD, May 12,2008). I would imagine that it contains more details. From the memo, however, it is not clear what exactly the problem is. What is the format of the "CIA cables"? If they are in plain text format (just an assumption), it is kind of hard for the application to "guess" which of the words are the names of people (as opposed to, say, names of villages) unless thay are specifically tagged as such. It is also not clear how come 60% do get extracted, while the other 40% don't. Probably the easiest way to make the CIA cables (assuming they are in plain text format) "parseable" would be for all the participating agencies to adopt some minimalistic set of tags and use them to mark-up those "cables", and then a parser can extract the marked-up (tagged) elements and put them into the appropriate columns of the appropriate tables of the database.
"The TIDE database has evolved overtime as both contractors and government employees have attempted to expand and enhance the database to improve their own use of the system. But none of them appear to have taken into account the overall design or engineering architecture of the entire system. As a result, there are now dozens of tables or categories for identical fields of information making the ability to search or locate key data inefficient, ineffective and more time consuming and difficult than necessary."^[5]

(a)

(b)

(c)

^[9]

At the same time, one should take into consideration that there may be extended time gaps between software product iterations. There may be situations when users need to store data NOW, not NEXT YEAR, and the database schema of the current software product iteration is unable to accomodate such needs. One can blame the business systems analysts for the inability to elicit the actual clients' needs, but I would rather blame them and the application designers for the inability to build extensibility into the system (even without explicit requirements to do so). Look at any off-the-shelf CRM, and you will see that most of them (at least the good ones) allow users to create different kinds of fields in addition to the default ones. How fundamentally different can a terrorist database be from a customer relationship (or constituent/employee/student/etc.) management system? The concept is pretty much the same: they store information about people (and allow users to add more attributes as needed), describe various relationships between people, people and documents, people and events, people and places, etc. Make the "terrorist management system" extensible by design in a similar fashion, and nobody will need to hack into the database itself and create "undocumented" tables.

tables

objects

This is a very crude explanation, but you probably got the general idea. How all of these are actually mapped to a database and in how many different tables the data "lives" is a totally different question, and it should not concern end users. All they need is available objects with universally applicable attributes (every person has a name and birthdate, every event has a location and date, etc.) that can be extended as needed (including creation of new object types). They should also be able to connect objects, i.e. define relationships between them. Relationships may also need to have some attributes. Users should also be able to create new types of relationships. All of this they should be able to perform through whatever user interface they have. This, in most cases, will save them the trouble of writing up new requirements for every single extra data field they might need and then waiting for those to be implemented or hacking the database.
"TIDE database relies on Structured Query Language (SQL), a cumbersome computer code that must utilize complicated sentence structures to query the tables, rows and columns that encompass the TIDE database. Without proper documentation on whether a table contains information on names, addresses, vehicles, license plates or an individual’s nationality, for instance, analysts have no valid mechanism to conduct a search of these “undocumented” tables."^[5]

Any normal human being can learn enough SQL to write SELECT statements with basic joins in about two weeks or less, so the complaint about the complexity of SQL is pretty much a bogus one. The question is, however, should end users be allowed to execute "raw" SQL queries. In most cases, most users, should NOT. Most users should only be allowed to interact with a database through forms where they enter search criteria and apply filters (when they view data), enter new data, edit or delete existing data. Behind it all there is still SQL, but it is hidden from users.

At the same time, one should take into consideration that it is very difficult to design a query form or a graphical query builder that is as flexible as a "raw" SQL statement. So, I believe that a very limited number of trained users should be allowed to enter and execute "raw" SELECT SQL queries. However, it does not really matter how good you are at writing SQL statements if you don't know which table contains what data.
“Existing TIDE data model is complex, undocumented, and brittle... which poses significant risk to RLSI [Railhead Lead System Integrator] data migration and modeling.”^[5]

The memo cites another document - "Technical Exchange Meeting", Bahama IDP, RAILHEAD, Monday, 16 July 2007 - which I was not able to obtain, but, I am sure, anyone would agree that data migration from a database that "has acquired a life of its own" is a sure recipe for disaster.
"'Pocket litter', for instance, the scraps of information obtained when law enforcement, military or other officials empty a suspect's pockets, including phone numbers, addresses or credit card information, is contained in 23 separate tables in TIDE, rather than one single uniform table."^[5]

This is a totally bogus complaint. Here is why. Let's imagine you want to make a database for your CD collection. Your first instinctive move would be to create "one uniform" table "CompactDisk" with the following columns: DiskID, DiskTitle, Artist, Genre. Seems logical, doesn't it? Now, if you start entering your CDs into the database, you will notice that you have CDs recorded by the same artist (e.g. "The Dream of the Blue Turtles" by Sting and "Soul Cages" by Sting). It means you have to enter "Sting" twice, but if you are a real fan of Sting and have most of his albums you would have to do it some 20 or so times. Not only is it inefficient from the point of database design, but it also makes you, the user, do extra work and increases input errors (try typing "Creedence Clearwater Revival" or "Rage Against the Machine" 20 times and see how many typos you will make). Wouldn't it be better to enter an artist's name once and then just pick it from a list? The answer is two tables: one - for albums, the other - for artists. And, yes, you will need one more for genres. And then, if you decide to store track titles, you will need one more. If you then decide to record when your friends borrow your CDs and return them, you will need one more table for friends and one more for check-in/check-out records. And so on. It is not as bad as it sounds because relational databases provide the mechanism to link related tables together. There is also such thing as "views" to make multiple tables look like one. If the application is correctly designed, end users should not even know from how many tables they retrieve data.
In RAILHEAD (i.e. the new system), "users will no longer have access to data that will not be migrated [to the new system], such as pocket litter and border summaries."^[5]

The memo cites another document - "RAILHEAD: System Concept Definition (SCD), SCD NOL-J Gap Analysis," Final Version 1.0, Submitted: 18 June 2008," SRI International - that I have not been able to obtain. It is not clear why the decision not to migrate some of the data to the new system has been made: maybe, because the database has been extended in an uncontrolled way (see above) or, maybe, for some other reason.
The next several pages of the memo deal with the RDBMS- vs. XML-based solution. The memo claims, citing multiple documents some of which I have been able to familiarize myself with, that the decision to build RAILHEAD on XML platform from MarkLogic Corporation was a very bad one.^[5] I am not going to reproduce the text of the memo here. Please, go to the original document if necessary.

First of all, I have to admit that I have been around relational databases for almost 20 years, so I am not impartial, but I am trying to be as objective as possible.

An RDBMS is an engine that stores data and allows users (or, rather, application front-end) in a more or less standardized way (through SQL) to manipulate (retrieve, insert, update, and delete) data that satisfies certain conditions. It also takes care of data integrity (for example, if you try to delete an artist in the CD database I described above, and there are albums associated with him/her, the database engine will - depending on the configuration - either not delete anything or delete the artist and all of his/her albums). These are probably the most important, but not the only, features most modern RDBMS's provide. Others worth mentioning might be: ability to compare old and new values before or after update, strict datatypes (you will not be able to enter the name of your friend who is borrowing your CDs into the field for check-out date), timestamps, etc.

^[13]

One could build an exellent solution, as well as a very bad one, on either platform. Considering that any data migration is a major undertaking that requires a lot of planning, time, and resources and can potentially lead to loss of data, it probably makes more sense to incrementally extend and improve the existing system (based on the feature requests and information about the shortcomings that should have been collected and translated into requirements since TIDE was deployed) rather than to make a leap to a totally new platform that does not seem to offer any tangible benefits.

Another aspect one should take into consideration is how large the installation base of a product is. The reason is very simple: the larger the installation base is, the more defects have been identified and - one would hope - fixed. Another reason is that there are more consultants available. The web site of Mark Logic Corporation has a fairly long list of companies that use their product. However, it cannot come even close to any popular RDBMS.

^[15]

RAILHEAD "failed to find non-exact matches for key entities, such as suspected terrorist's names."^[5]

(a)

(b)

(c)

^[14]

Terror Watch List Glitches

"Incredibly, it also failed to demonstrate the ability to use basic Boolean search terms, such as AND, OR and NOT."^[5]

^[14]

searching

The memo also mentions fairly high number of other tests that failed without going into specifics of what the actual defects were.

Although the numnber of defects identified is quite high, it does not necessarily mean that the software is bad. It would actually be more suspicious if no defects had been identified. One has to understand that there is no such thing as bug-free software and that finding and fixing bugs is a normal part of any software development process.

Conclusion... sort of...

I spent quite some time digging for information about TIDE and RAILHEAD. Still, I do not believe I know enough to make any conclusions as to whether RAILHEAD's overall design is flawed or not and if it will negatively affect counterterrosm data collection and analysis or not. I guess, we will just have to wait and see.

Sources:

1. Jeanne Meserve, "Terrorist Watch List Woes", August 23, 2008, CNN (the video is not available anymore)

2. Siobhan Gorman, "Flaws Found In Watch List For Terrorists", August 22, 2008, WSJ.com

3. Stephanie Condon, "Fatal flaws found in terrorism database", August 22, 2008, C|NET News

4. Michael Krigsman, "Failed government IT: ‘The mother of all databases’", August 26, 2008, ZDNet

5. RE: Questions regarding technical flaws, poor government oversight and potential contractor mismanagement on the National Counterterrorism Center's RAILHEAD information technology development program, Staff Memo, August 21, 2008, Subcommittee on Investigations and Oversight, U.S. House of Representatives Committee on Science and Technology (the memo cannot be found on the web site, or anywhere else for that matter)

6. Ronald Kessler, "NCTC: Up to 70 Terrorist Plots Each Day", August 15, 2006, NewsMax.Com (UPDATE: the article appears to be unavailable)

7. Intelligence Reform and Terrorism Prevention Act of 2004 available from the website of U.S. Government Publishing Office

8. Terrorist Identities Datamart Environment (TIDE), Fact Sheet, National Counterterrorism Center (NCTC) (UPDATE: the document appears to be unavailable at its original location, but still can be downloaded from the website of the Office of the Director of National Intelligence)

9. C-Span Roundtable with Dale W. Meyerrose on Intelligence Information Sharing, January 3, 2007, C-SPAN

10. Introduction to Defense Acquisition Management, 7th edition, September 2005, Defense Acquisition University Press

11. Spiral model, Wikipedia.org, retrieved on 2008-10-16

12. Barry W. Boehm, "A Spiral Model of Software Development and Enhancement", May, 1988 (not available at this URL anymore)

13. MarkLogic Server, Mark Logic Corporation (there is no "Marklogic Server" product anymore; the page is unavailable)

14. MarkLogic Server Search Features, MarkLogic Server 3.2, Mark Logic Corporation (there is no "Marklogic Server" product anymore; the document is unavailable)

15. Josh Cable, "Mark Logic technology key to success of Army’s Warrior Knowledge Base", June 23, 2008, GovPro.com

1stopIT blog

Friday, October 17, 2008

More Terror Watch List Glitches?!

First, some historical background

The memo and my comments

According to the memo,

Conclusion... sort of...

No comments:

Blog Archive