Friday, October 17, 2008

More Terror Watch List Glitches?!

Shortly after I wrote Terror Watch List Glitches, more disturbing reports about problems with "terror-watch-list"-related systems started popping up. Most were saying that the expected upgrades (worth half-billion dollars!) might not correct the problems and might even create new ones. All the fuss seemed to be about "project RAILHEAD", which is supposed to replace Terrorist Identities Datamart Environment (TIDE).

I refrained from publishing this post the very next day. There were tons of articles that simply amounted to a bunch of snippets from a congressional memo slapped together, so I decided not to add my voice to the general noise. Instead, I took some time to learn more about TIDE and RAILHEAD, as well as to analyze those claims from a purely technical perspective, leaving out the politics.

CNN's piece Terrorist Watch List Woes[1] (UPDATE: the video is not available anymore), Siobhan Gorman's article Flaws Found In Watch List For Terrorists[2] in the Wall Street Journal, as well as some other sources[3] [4], all seem to be alluding - with varying degrees of accuracy and technical competence - to one document: Staff Memo of August 21, 2008[5] (UPDATE: the document is not available anymore) of the House Science and Technology Committee's Subcommittee on Investigations and Oversight. If you don't feel like reading all 11 pages of the memo (UPDATE: especially, now that it is not on their web site; aren't they supposed to keep that kind of stuff?!) and dig for more information, I am going to give you the scoop. Again, I am interested only in the IT aspects, not the politics.

First, some historical background

Before 9/11, according to Ronald Kessler, the government maintained four different terrorist identity databases and 13 independent watch lists. It is believed that it was in part because of the database incompatibility that two of the 9/11 hijackers managed to slip into this country.[6] Intelligence Reform and Terrorism Prevention Act of 2004 gave birth to National Counterterrorism Center (NCTC), which - among other things - was supposed to bridge those gaps and to:
  • "establish common information technology standards, protocols, and interfaces";
  • "ensure development of information technology systems that include... intelligence integration capabilities".[7]

NCTC's IT efforts led to the emergence of Terrorist Identities Datamart Environment (TIDE), which, according to NCTC, is the US Government’s central repository of information on international terrorist identities. TIDE supports various “watch lists”, TSA's "no-fly" list being one of them, and the US Intelligence Community’s overall counterterrorism mission.[8] As the former NCTC Director, John Redd, once described it, it is "the mother of all databases"[6]... or, at least, that is what it was supposed to be...

In reality, however, it is not as seamless as one would imagine. According to Ronald Kessler, as of August 2006, John Redd, NCTC Director at the time, needed seven separate computers to access the separate networks of the FBI, CIA, NSA, and DoD. All told, the NCTC tapped into 28 (according to other sources, over 30) separate computer networks.[6] In his interview on C-SPAN in January 2007, Major General (Ret.) Dale Meyerrose, Chief Information Officer for the Director of National Intelligence, called it "wrestling the information technology..." He then added that his "job is, in essence, to make the information technology invisible."[9]

Some time after Meyerrose's C-SPAN interview, the codename "project RAILHEAD" started floating around in open sources. RAILHEAD was, and probably still is, supposed to replace TIDE and bring all the "goodness" of General Meyerrose's "invisible" IT to its end users. I think, the most recent deployment time I heard was January 2009.

The memo and my comments

And here comes the 8/21/08 memo of the House Science and Technology Committee's Subcommittee on Investigations and Oversight. It claims that the planned upgrades to TIDE, Tide Online (TOL) and NCTC Online (NOL) would actually "diminish, not improve", their capabilities, limiting the ability to share terrorist intelligence data among federal agencies and crippling the ability of counterterrorism analysts to conduct searches of these databases.[5] Like I said, I am not interested in the politics of this controversy, so I will spare you the where-did-the-money-go and who-is-to-blame type of stuff. Here is the "technical scoop" with my in-line comments.

According to the memo,

In TIDE (i.e. the system currently in use) "only about 60% of the data, including names and addresses, mentioned in CIA cables provided to NCTC are actually extracted from these messages and placed into the TIDE database"[5]

    I was not able to get hold of the original document cited in the memo ("Internal Users Focus Group", Meeting Minutes, RAILHEAD, May 12,2008). I would imagine that it contains more details. From the memo, however, it is not clear what exactly the problem is. What is the format of the "CIA cables"? If they are in plain text format (just an assumption), it is kind of hard for the application to "guess" which of the words are the names of people (as opposed to, say, names of villages) unless thay are specifically tagged as such. It is also not clear how come 60% do get extracted, while the other 40% don't. Probably the easiest way to make the CIA cables (assuming they are in plain text format) "parseable" would be for all the participating agencies to adopt some minimalistic set of tags and use them to mark-up those "cables", and then a parser can extract the marked-up (tagged) elements and put them into the appropriate columns of the appropriate tables of the database.

"The TIDE database has evolved overtime as both contractors and government employees have attempted to expand and enhance the database to improve their own use of the system. But none of them appear to have taken into account the overall design or engineering architecture of the entire system. As a result, there are now dozens of tables or categories for identical fields of information making the ability to search or locate key data inefficient, ineffective and more time consuming and difficult than necessary."[5]

    This statement is also rather unclear, but it looks like end users needed more fields to store data. Let me give you a very crude and oversimplified example. Say, there is a table "Person" with the following columns: PersonID, FirstName, LastName, BirthDate. At some point users realize that they also need to store person's birth place. This can be done in multiple ways: (a) add one more column to the table "Person", call the column BirthPlace and use it to store the country of birth and the city of birth together; (b) add two more columns to the table "Person", call them BirthCountry and BirthCity and store the data in those two separate columns; (c) add two more columns to the table "Person", call them BirthCountryID and BirthCityID, create two more tables, "Country" (with columns CountryID and CountryName) and "City" (with columns CountryID, CityID, and CityName), and store data in a more normalized fashion. I am not going to explain here which way is better and why because it does not seem to be the problem here. What seems to be the problem is that Agency A added their own additional tables, Agency B - their own, and Agency C - their own. Instead of hacking the database on their own, those agencies should have informed the developers (actually, business systems analysts) about their needs. Those needs should have been translated into requirements, and those requirements should have been implemented in the next iteration of the software thus preserving the overall architecture of the system. This is precisely how spiral model (which, by the way, General Meyerrose advocated on C-SPAN[9]) is supposed to work... ideally.

    At the same time, one should take into consideration that there may be extended time gaps between software product iterations. There may be situations when users need to store data NOW, not NEXT YEAR, and the database schema of the current software product iteration is unable to accomodate such needs. One can blame the business systems analysts for the inability to elicit the actual clients' needs, but I would rather blame them and the application designers for the inability to build extensibility into the system (even without explicit requirements to do so). Look at any off-the-shelf CRM, and you will see that most of them (at least the good ones) allow users to create different kinds of fields in addition to the default ones. How fundamentally different can a terrorist database be from a customer relationship (or constituent/employee/student/etc.) management system? The concept is pretty much the same: they store information about people (and allow users to add more attributes as needed), describe various relationships between people, people and documents, people and events, people and places, etc. Make the "terrorist management system" extensible by design in a similar fashion, and nobody will need to hack into the database itself and create "undocumented" tables.

    End users should not think in terms of tables. Instead, they should think in terms of objects similar to real-life objects. A person is an object. A person's most basic attributes are: FirstName, LastName, BirthDate, SSN, etc.; more may be added as needed, for example PhysicalMarks (like "left thumb is missing" and/or "has a tattoo of a dragon on the right shoulder blade", the latter can be linked to an image or images of the tattoo, if available). An organization is an object. Its most basic attributes might be: OrganizationName and OrganizationType (political party, business, school, etc.); again, more may be added as necessary. An event is an object. Its basic attributes most probably would be: EventLocation and EventDate, but one can add EventType (convention, rally, demonstration, meeting). A document (a memo, a letter, a photograph) is also an object. Objects are related to one another. Two people can be friends, siblings or a child and a parent. A person can be a member of an organization, a student of a school, etc. A person can be an attendee of an event. An organization can be an/the organizer or a participant of an event (which may indirectly link members of an organization to an event even though some of them have not actually attended it). People may belong to the same organization (which may indirectly link them even though they might not know one another). People may be related to a document as authors, recipients, or subjects of a document. And so on and so forth.

    This is a very crude explanation, but you probably got the general idea. How all of these are actually mapped to a database and in how many different tables the data "lives" is a totally different question, and it should not concern end users. All they need is available objects with universally applicable attributes (every person has a name and birthdate, every event has a location and date, etc.) that can be extended as needed (including creation of new object types). They should also be able to connect objects, i.e. define relationships between them. Relationships may also need to have some attributes. Users should also be able to create new types of relationships. All of this they should be able to perform through whatever user interface they have. This, in most cases, will save them the trouble of writing up new requirements for every single extra data field they might need and then waiting for those to be implemented or hacking the database.

"TIDE database relies on Structured Query Language (SQL), a cumbersome computer code that must utilize complicated sentence structures to query the tables, rows and columns that encompass the TIDE database. Without proper documentation on whether a table contains information on names, addresses, vehicles, license plates or an individual’s nationality, for instance, analysts have no valid mechanism to conduct a search of these “undocumented” tables."[5]

    Any normal human being can learn enough SQL to write SELECT statements with basic joins in about two weeks or less, so the complaint about the complexity of SQL is pretty much a bogus one. The question is, however, should end users be allowed to execute "raw" SQL queries. In most cases, most users, should NOT. Most users should only be allowed to interact with a database through forms where they enter search criteria and apply filters (when they view data), enter new data, edit or delete existing data. Behind it all there is still SQL, but it is hidden from users.

    At the same time, one should take into consideration that it is very difficult to design a query form or a graphical query builder that is as flexible as a "raw" SQL statement. So, I believe that a very limited number of trained users should be allowed to enter and execute "raw" SELECT SQL queries. However, it does not really matter how good you are at writing SQL statements if you don't know which table contains what data.

“Existing TIDE data model is complex, undocumented, and brittle... which poses significant risk to RLSI [Railhead Lead System Integrator] data migration and modeling.”[5]

    The memo cites another document - "Technical Exchange Meeting", Bahama IDP, RAILHEAD, Monday, 16 July 2007 - which I was not able to obtain, but, I am sure, anyone would agree that data migration from a database that "has acquired a life of its own" is a sure recipe for disaster.

"'Pocket litter', for instance, the scraps of information obtained when law enforcement, military or other officials empty a suspect's pockets, including phone numbers, addresses or credit card information, is contained in 23 separate tables in TIDE, rather than one single uniform table."[5]

    This is a totally bogus complaint. Here is why. Let's imagine you want to make a database for your CD collection. Your first instinctive move would be to create "one uniform" table "CompactDisk" with the following columns: DiskID, DiskTitle, Artist, Genre. Seems logical, doesn't it? Now, if you start entering your CDs into the database, you will notice that you have CDs recorded by the same artist (e.g. "The Dream of the Blue Turtles" by Sting and "Soul Cages" by Sting). It means you have to enter "Sting" twice, but if you are a real fan of Sting and have most of his albums you would have to do it some 20 or so times. Not only is it inefficient from the point of database design, but it also makes you, the user, do extra work and increases input errors (try typing "Creedence Clearwater Revival" or "Rage Against the Machine" 20 times and see how many typos you will make). Wouldn't it be better to enter an artist's name once and then just pick it from a list? The answer is two tables: one - for albums, the other - for artists. And, yes, you will need one more for genres. And then, if you decide to store track titles, you will need one more. If you then decide to record when your friends borrow your CDs and return them, you will need one more table for friends and one more for check-in/check-out records. And so on. It is not as bad as it sounds because relational databases provide the mechanism to link related tables together. There is also such thing as "views" to make multiple tables look like one. If the application is correctly designed, end users should not even know from how many tables they retrieve data.

In RAILHEAD (i.e. the new system), "users will no longer have access to data that will not be migrated [to the new system], such as pocket litter and border summaries."[5]

    The memo cites another document - "RAILHEAD: System Concept Definition (SCD), SCD NOL-J Gap Analysis," Final Version 1.0, Submitted: 18 June 2008," SRI International - that I have not been able to obtain. It is not clear why the decision not to migrate some of the data to the new system has been made: maybe, because the database has been extended in an uncontrolled way (see above) or, maybe, for some other reason.

The next several pages of the memo deal with the RDBMS- vs. XML-based solution. The memo claims, citing multiple documents some of which I have been able to familiarize myself with, that the decision to build RAILHEAD on XML platform from MarkLogic Corporation was a very bad one.[5] I am not going to reproduce the text of the memo here. Please, go to the original document if necessary.

    First of all, I have to admit that I have been around relational databases for almost 20 years, so I am not impartial, but I am trying to be as objective as possible.

    An RDBMS is an engine that stores data and allows users (or, rather, application front-end) in a more or less standardized way (through SQL) to manipulate (retrieve, insert, update, and delete) data that satisfies certain conditions. It also takes care of data integrity (for example, if you try to delete an artist in the CD database I described above, and there are albums associated with him/her, the database engine will - depending on the configuration - either not delete anything or delete the artist and all of his/her albums). These are probably the most important, but not the only, features most modern RDBMS's provide. Others worth mentioning might be: ability to compare old and new values before or after update, strict datatypes (you will not be able to enter the name of your friend who is borrowing your CDs into the field for check-out date), timestamps, etc.

    Is it possible to make XML do all those things? XML does not do anything by itself. It is just a format, but it is possible to create an engine capable of doing anything a database does while storing data in XML format. The question, however, is why would anyone want to re-invent the wheel if there is a wide array of excellent well-tested RDBMS's that already do all of those things. It is my understanding (based on the information provided on the company web site) that MarkLogic Corporation has built a database-like engine that handles XML. The engine is called MarkLogic Server. I have not used it, so I cannot say anything as to how good it is. Their web site lists a lot of features that sound quite impressive. In addition to the database-like features, they claim that, because they use XML, you don't have to struggle trying to squeeze your data into a pre-defined schema because an XML document is self-describing.[13] But, as I have explained (or at least tried to explain) above, if designed intelligently, an RDBMS-based system can also be made pretty much infinitely extensible without hacking the database schema. Honestly, of the features listed on the web site of Mark Logic Corporation (at least the ones that are clearly described and do not sound like promises of miracles, and some of them do sound that way), I have not seen any that cannot be implemented on top of an RDBMS.

    One could build an exellent solution, as well as a very bad one, on either platform. Considering that any data migration is a major undertaking that requires a lot of planning, time, and resources and can potentially lead to loss of data, it probably makes more sense to incrementally extend and improve the existing system (based on the feature requests and information about the shortcomings that should have been collected and translated into requirements since TIDE was deployed) rather than to make a leap to a totally new platform that does not seem to offer any tangible benefits.

    Another aspect one should take into consideration is how large the installation base of a product is. The reason is very simple: the larger the installation base is, the more defects have been identified and - one would hope - fixed. Another reason is that there are more consultants available. The web site of Mark Logic Corporation has a fairly long list of companies that use their product. However, it cannot come even close to any popular RDBMS.

    I have not been able to find any objective third-party reviews of MarkLogic Server. Maybe I was not looking hard enough, but the only article I found reads like a typical paid promotional piece[15]

RAILHEAD "failed to find non-exact matches for key entities, such as suspected terrorist's names."[5]

    It is not totally clear what kind of "non-exact matches" they are talking about. There may be several possible scenarios here: (a) searches with LIKE (e.g., "John" also returns "Johnson", "StJohn", StJohnson, and the like), but in this case users have to know exactly what sub-string they are looking for; (b) searches using a phonetic algorithm (using SOUNDEX as an example "John" will return "John", "Jon", "Jun", "Jane", "Joan", and even "J. On" or "Jo N."); unfortunately, phonetic algorithms based on the English phonetics are pretty much useless when it comes to foreign names (let alone typos and cases like "Bob Smith" vs "Robert Smith" vs "R. J. Smith"); (c) a "fuzzy search" algorithm that better handles typos, name variants and non-English names. There are fuzzy search algorithms available... even for free. $.5 billion should be more than enough for some customization. One should understand, however, that any fuzzy search will inevitably return irrelevant matches or "noise". The "fuzzier" is the search, the more "noise" it returns.

    According to Mark Logic, their platform supports searches with wildcards. Also, "MarkLogic Server uses an advanced double meta-phone algorithm for expansion of query terms using similar sounding words".[14] Not another phonetic algorithm (see my previous post Terror Watch List Glitches)!

"Incredibly, it also failed to demonstrate the ability to use basic Boolean search terms, such as AND, OR and NOT."[5]

    According to Mark Logic, "MarkLogic Server supports Boolean constructs of “and”, “or”, “and not”, “not” and “or” that can be arbitrarily nested to develop advanced, ordered Boolean constructs."[14]

    The title of the document cited in the memo, "RAILHEAD: NOL-J Gap Analysis", seems to imply that this defect applies only to NCTC Online (NOL). Still, if this is true (and, honestly, it is hard even to believe), considering that the whole point of this type of software is searching, it does sound like a pretty bad bug.

The memo also mentions fairly high number of other tests that failed without going into specifics of what the actual defects were.

    Although the numnber of defects identified is quite high, it does not necessarily mean that the software is bad. It would actually be more suspicious if no defects had been identified. One has to understand that there is no such thing as bug-free software and that finding and fixing bugs is a normal part of any software development process.

Conclusion... sort of...

I spent quite some time digging for information about TIDE and RAILHEAD. Still, I do not believe I know enough to make any conclusions as to whether RAILHEAD's overall design is flawed or not and if it will negatively affect counterterrosm data collection and analysis or not. I guess, we will just have to wait and see.

Sources:


1. Jeanne Meserve, "Terrorist Watch List Woes", August 23, 2008, CNN (the video is not available anymore)

2. Siobhan Gorman, "Flaws Found In Watch List For Terrorists", August 22, 2008, WSJ.com

3. Stephanie Condon, "Fatal flaws found in terrorism database", August 22, 2008, C|NET News

4. Michael Krigsman, "Failed government IT: ‘The mother of all databases’", August 26, 2008, ZDNet

5. RE: Questions regarding technical flaws, poor government oversight and potential contractor mismanagement on the National Counterterrorism Center's RAILHEAD information technology development program, Staff Memo, August 21, 2008, Subcommittee on Investigations and Oversight, U.S. House of Representatives Committee on Science and Technology (the memo cannot be found on the web site, or anywhere else for that matter)

6. Ronald Kessler, "NCTC: Up to 70 Terrorist Plots Each Day", August 15, 2006, NewsMax.Com (UPDATE: the article appears to be unavailable)

7. Intelligence Reform and Terrorism Prevention Act of 2004 available from the website of U.S. Government Publishing Office

8. Terrorist Identities Datamart Environment (TIDE), Fact Sheet, National Counterterrorism Center (NCTC) (UPDATE: the document appears to be unavailable at its original location, but still can be downloaded from the website of the Office of the Director of National Intelligence)

9. C-Span Roundtable with Dale W. Meyerrose on Intelligence Information Sharing, January 3, 2007, C-SPAN

10. Introduction to Defense Acquisition Management, 7th edition, September 2005, Defense Acquisition University Press

11. Spiral model, Wikipedia.org, retrieved on 2008-10-16

12. Barry W. Boehm, "A Spiral Model of Software Development and Enhancement", May, 1988 (not available at this URL anymore)

13. MarkLogic Server, Mark Logic Corporation (there is no "Marklogic Server" product anymore; the page is unavailable)

14. MarkLogic Server Search Features, MarkLogic Server 3.2, Mark Logic Corporation (there is no "Marklogic Server" product anymore; the document is unavailable)

15. Josh Cable, "Mark Logic technology key to success of Army’s Warrior Knowledge Base", June 23, 2008, GovPro.com

No comments: