Date Tags Python

http://health2challenge.org/code-a-thon/washington-dc/

Someone sent a link about this to the DC Python Meetup Group (http://meetup.zpugdc.org/) a few weeks ago. It looked like fun and a way to learn about a new domain, so I signed up. I'm not aware if any other Python folks were there. I didn't bump into any.

I didn't really know what to expect. I knew pretty close to nothing about the field. I wondered what technology would be used. It wasn't clear how teams would be assembled.

A major motivation of this event was to leverage a growing collection of health-related databases:

http://health2challenge.org/code-a-thon/data-resources/

The event was fun, if a bit chaotic. It was hard to find an appropriate team and contribute. I gather some teams had formed ahead of time, but as an outsider, there didn't seem to be any way to get hooked up ahead of time.

I spent some time brainstorming with one loose team that was interested in raising awareness at the community level of the economic impact on a community of health issues. There were some ideas thrown around that didn't seem very realistic. The "public" aren't likely to visit dedicated health policy sites or even play health policy games.

I suggested that a good way to reach people in communities might be through their community newspapers and web sites. The idea was to develop database-based content in the form of mini applications, possibly augmented by prose written my health professionals that could be leveraged by community newspapers. Making this database-based meant that the content could be relevant to the local community.

This idea was well received. This was a pleasant surprise, since it's actually kinda close to my day job.

I worked for a while on a prototype application that would provide a small bit of content of the form:

where obviously MYCOMMUNITY and MYSTATE are community specific and X, Y and Z are provided by a health database. We used data from http://services.healthindicators.gov. The idea is that this blurb would be published as an app that community newspapers could use to create content. The specific blurb was just a proof of concept.

The database provides SOAP and REST interfaces. I ended up using suds, http://pypi.python.org/pypi/suds to access the SOAP interface. This was really easy:

from suds.client import Client
url = 'http://services.healthindicators.gov/v1/SOAP.svc?wsdl'
client = Client(url)

To get a list of all of the methods:

print client

To call a method:

client.service.SomeMethod()

(All of the methods in this API have camel-case names with initial upper case letters.)

Of course, since this is Python, I could do all of this interactively! (I say this for the benefit of Health 2.0 readers who read this.) I was exploring the API in a few minutes. Nice!

For some reason, the API breaks most requests into pages. Each request has three parts:

foo(some_args, page)

Get some data.

For example: GetLocales, GetIndicatorsByLocaleID, GetGenders.

fooCount(some_args)

Get the result count

For example: GetLocalesCount, GetIndicatorsByLocaleIDCount, GetGendersCount. (In case you're wondering, client.service.GetGendersCount() returns 2.)

fooPageCount(some_args)

Get the result count

For example: GetLocalesPageCount, GetIndicatorsByLocaleIDPageCount, GetGendersPageCount.

I ended up creating a helper function:

def paged(client, name, *args):
    r = []
    service = client.service
    for page in range(1, getattr(service, name+'PageCount')(*args)+1):
        r.extend(getattr(service, name)(*(args+(page, )))[0])
    return r

(If you're paying close attention, you might be wondering about the [0] in the code above. For some reason, each "page" of data was returned by suds as a sequence object with one item containing a list of the actual data. I don't know if this is a quirk of the API or of suds.)

This allowed me, for example, to get all locales with:

locales = paged(client, 'GetLocales')

to deal with the paged data.

As is to be expected, the database is challenging. Data are not uniformly available. Some data are available down to the county level, but other data isn't. For example, hospital readmission rates are available at the level of "Health Referral Region", which is typically (always?) much larger than a county. Different localities have different amounts of data. Prince William County has on the order of 300 health indicators available, while DC has around 10,000.

Speaking of "indicators", as with any domain, this one has confusing jargon. There were "indicator descriptions", like "Acute Hospital Readmission Rate" and "indicators", like "the value in Arlington is 17%". As it was explained to me, the indicator descriptions are the questions and the indicators are the answers. The answers are qualified and adjusted in various ways, probably based on whatever studies they came out of. I suspect that there will be lots of naive and misleading uses of this data. I hope these automated applications get some careful review by domain experts.

Using the database affectively requires either familiarity with the data, or the ability to quickly browse. The SOAP interface to the database is pretty slow and doesn't provide very targeted queries. For example, there's no way to request one type of indicator for a locale. You can pick an indicator, and get data for all locales, or pick a locale and get all indicators for it. Getting all of the indicators for DC took several minutes. They're working on their search capabilities, so I'm sure this will improve over time.

These sorts of databases will be used for a variety of applications and run-time use of the databases will likely prove to be impractical. Taking snapshots is unattractive, as data will be out of data. Probably, a download model with update subscriptions would be a better way to go. In other words, applications might be well served by downloading a database and either polling for updates or getting updates sent to them.

We decided to bail on our prototype because we didn't feel the data was local enough. This was a mistake! We should have finished the prototype. The actual data didn't matter. The presentation of the prototype would have been a good time to discuss the issues. Dang.

I wandered over to another team that was working with the same database. They were working on a system for looking at local policy decisions based on county government databases and connecting these to outcomes via the health indicators database. I think this is a cool idea and they were led by a domain expert who had a pretty definite idea of what he was trying to accomplish. I'm pretty sure that this will lead to success.

I was hoping to provide some help because I has gained some familiarity with the database. Unfortunately, they were bogged down accessing the database using some Java-based SOAP interface. Gaaaa. Their Java programmer was obviously good, but he was still using Java. Most of the developers were just sitting around waiting for the Java programmer. I tried to explain some of the issues with the data, but the Java programmer was just too busy hacking Java. I ended up learning the Google chart API so I could help them eventually display the data.

I eventually got bored and left early. I wish, in hindsight, I'd finished the prototype I was working on. Hopefully, this blog will be useful and make up for this a little bit. :)

I wouldn't mind doing this again, especially if I could hook up with a team ahead of time. I'd even be willing to finish that prototype if there was interest. I can't spend too much time on this though, as I have to many other interesting projects.