|
People have been saying it for ages "Why use flat text files and Fortran/Python/IDL to process climate data?"? Instead why not put the station files into a database - a searchable, relational (preferably third normalised form) database? As E.M. Smith put it "Flat files are so 1970...". There are now several such RDBMS databases that could be used including MS Access, SQL Lite, MySQL or PostgreSQL for example.
Well here's the score on one of them.
Background
Back in October 2009, verity Jones of teh Digging in The Clay blog was geting frustrated by her limited ability with processing the GISS dataset, she'd downloaded the GHCN v2.mean.z file, which is an input for GISTemp. and had the idea of putting it into MS Access so that she could at least search for files quickly, but didn't get much past importing the few initial large tables, partly due to other priorities. She'd been given a lot of encouragement from others and about a month later an email from TonyB introduced her to computer professional KevinUK (that's yours truly!), who also had an interest in creating a database.
After a few exchanges a plan was formed. On her own admission Verity's lack of knowledge and ability was limiting her vision; and the database we've now created between us and its associated utilities now go way beyond what she initially thought was possible. Verity is a typically modest person who underestimates her IMO very considerable abilities. I can tell you that but for Verity's input the TEKTemp database would not exist.
The Technical Stuff
The source for GISS (and it seems the majority of Hadley CRUtemp dataset) climate analysis are the NCDC Global Historical Climate Network (GHCN) files - aggregated files of temperature data from stations worldwide. I initially downloaded the GHCN v2.mean.z and v2.mean_adj.z files from here and imported and normalised this raw mean and adjusted mean temperature data and all the country code and station inventory data into an MS Access database.
I then emulated NCDC/GISS methods for combining overlapping duplicate station series together to form a single station series for each WMO Station code/imod combination. GISS combined/unadjusted and combined/homogenised data has now been added in its own database (this is the updated set after GISS started using USHCN data in mid November). Eventually other data sets can be added. The database ("TEKTemp") is freely available: anyone who is interested in becoming a user may register here*. I've since adapted the software to chart and tabulate these raw/adjusted temperatures - it can plot graphs and create tables rapidly from the data stored in the database.
Having done self checks to ensure that all teh GHCN data had been imported correctly, I then did some 'unit testing' to demonstrate there are no serious errors in the code i'd ritten to combine teh duplicate series. After that I could be confident of any results that TEKTemp produces for nearly 4,500 WMO station code/imod combinations. Not a bad achievement in the space of a couple of weeks.
Data Quality
I should say a little about data quality. In trying to emulate the seasonal and annual means, Verity and I realised there was something odd about the way both NCDC GHCN and GISS calculate their annual mean temperatures when there is data missing for one or more individual months within the year. I'm quite fastidious (some would say overly so) about using unmanipulated data as far as possible and my pragmatic response to this isue has been to exclude from the analysis any year in which there is one or more missing month of data. This has undoubtedly altered trends slightly, but does ensure that only the actual data available for the station and not some 'filled in product' in used when fiiting trends to the raw and adjusted station temperature data.
In fitting temperature trends for each WMO station code/imod combination, I've fitted a first order polynomial (i.e. linear regression) trend to the combined series for all available station data post-1880. For both raw and adjusted data I've fitted a trend line only there are at least 20 years (not necessarily consecutive years) of data available from 1880 to 2010 for a given station. Maps for NCDC GHCN V2 data are here. Maps for GISS V2 data are here.
|