Data.gov Should be a Library and a Factory

On August 8, Aneesh Chopra and Cass Sunstein called upon the public to share their ideas for the United States Open Government National Plan. In their call to action, they asked three key questions:

  • How can regulations.gov, one of the primary mechanisms for government transparency and public participation, be made more useful to the public rulemaking process?
  • OMB is beginning the process of reviewing and potentially updating its Federal Web Policy. What policy updates should be included in this revision to make Federal websites more user-friendly and pertinent to the needs of the public?
  • How can we build on the success of Data.Gov and encourage the use of democratized data to build new consumer-oriented products and services?

In this post, I will discuss the Data.gov question. At the core, my post is less about encouraging the use of democratized data and more about improving the utility of such data. The vision is to correct some upstream problems with data collection that would improve the downstream quality and utility of the data. Further, I’ll talk about some specific areas where Data.gov should lead the way in democratizing particular data sets more effectively, which could spur innovation in areas like participatory budgeting and public involvement with policy – we have focused on the results of government policy (spending) without the context of the policy and performance factors that drive those actions, and this must change.

Data.gov Isn’t Just for Citizens

In the National Plan, the White House should address how Data.gov can serve government better, thereby serving citizens and business better. Remember when Recovery.gov had all those problems with self-reported data, showing that money had been flowing to supposedly non-existent Congressional Districts? This should never happen in the 21st Century.

How could Data.gov have helped here? Well, Data.gov could serve as the Master Data Manager for the Federal government. If Data.gov had an authoritative data source that converted addresses into Congressional Districts, the software being used by reporting entities and the software being used by Recovery.gov could have implemented on-demand data validation services – kicking out “bad” data as exceptions and correcting the information upstream.

What would have been even better? Rather than an on-demand validation service, Data.gov’s Master Data Management function could provide an on-demand “fill-in-the-blank” service. This would proactively eliminate the need for self-reporting information that should be available in some lookup table.

The end result? Government data quality would improve, and developers would be happier.

Data.gov Can Address the Information Collection Burden

A good portion, but not all, of the Federal government’s data is collected in accordance with statutory requirements. Sometimes, when agencies issue rules, they have to collect information that will help them enforce that rule – so they have to estimate the burden of collecting that information (OMB has guidelines on how to do that).

However, agencies frequently don’t review the full breadth of data that the government already has, and duplication of collected information abounds. Duplication undermines the authoritativeness and introduces the kind of variability that causes downstream quality and consumption problems. Data.gov can change all of that. Imagine if an agency were able to search – truly search – the fields and database holdings of the entire government. Imagine if agencies were able to leverage master data management services to cut down on the burden their forms create – collecting minimum necessary amounts of information and augmenting submissions with the additional information later.

In short, Data.gov is an untapped resource in relating information collection, information dissemination, and the government’s data architecture. Efforts around these activities must be redoubled to demonstrate the value of this service to the business of government.

Data.gov Must Help the Statistical Community

Once upon a time, the government worked very hard to develop an authoritative source of its data holdings. The initiative was called FedStats. For reasons that are not quite clear, the major statistical agencies of the Federal government are not fully participating in Data.gov, and FedStats remains an anomaly on the Web. Some of the government’s most powerful and influential data sets are maintained by the federal statistical community, and it’s time to get them fully on board with Data.gov.

Data.gov Should Power Transparency Sites

Visit an agency Web site. Look for their budget books. Look for their Performance and Accountability Report. If you can find them, you’ll likely be greeted by a PDF. These days, at least the PDF will be searchable, but this still isn’t going to be very helpful.

Data.gov should be the primary source of underlying data in both the budget and performance reporting areas. Much effort was expended around initiatives like StratML, a markup language for agency strategic plans. As OMB develops performance.gov, which is still not live, it’s imperative that they embrace the principles of the Open Government Directive and build the infrastructure to release the underlying data behind performance and budget information in an open format. Simply collecting information and presenting sparklines, or embedding Excel graphs into 100+ PDF reports doesn’t meet the standard of open government, and OMB should lead the way when it comes to opening these data sets.

Data.gov Is Not Just for Developers

I have seen some calls for Data.gov to treat developers as its primary audience. Data.gov should be for anyone who wants to use information to inform - developer, journalist, research, statistician, economist, policy analyst, social worker, local government, state government, non-profit action organization. There is no single infomediary to rule them all, that's the whole point of democratizing data. All of those people should be served by the most powerful consumption-oriented platform available.

Data.gov should be a library AND a factory. Significant investments have been made to upgrade the capabilities of this important resource, ensuring that government has the capability to provide data in a manner that suits an on-the-go, dynamic consumption model. Yet, the more data that is registered in the Data.gov catalog, the more impossible it is to find. Librarians must be engaged early on to develop the right taxonomies that will support a sustainable, searchable catalog filled with findable, important data.