Python script for downloading and organizing historical API data

Hey all,

I just finished this python script to access historical data through the API. It pulls data for all monitors in a registered group. It then creates a file structure where it dumps the data and compiles it into a single CSV for each monitor. It’s currently setup to download the files in a very similar format to the way the thingspeak download page used to work (primary and secondary A and B channels). I also have it making a files similar to the format that comes off the SD cards. This way, if you’re like me and had a bunch of analysis scripts setup to work with the old thingspeak files, you don’t need to change them much.

This script also makes a folder to dump in all of the compiled secondary A&B files to perform checks of the data. I’ll post an R script soon that performs the A and B channel comparisons and provides metrics for sensor performance.

Big thanks to Zuber Farooqui, who provided the base code that I adapted into this script. I’m very new to python, so I don’t think I would have figured it out otherwise.

Feel free to tweak and use as needed.

4 Likes

Also, in case anyone isn’t aware, you need your own API read key to make this work (line 28), and it needs to have approval to access historical endpoints.

There are also 2 locations in the code where you’ll need to set your own directory:

Lines 67 and 239

You also need to set the start and end dates:

Lines 34 and 35

and the group number:

Line 38

I am not really a programmer by vocation (hardware designer) and so find writing code to be “difficult” work. So please excuse me if my questions seem easy to see the answer to if you are a programmer.
Does your Python script output PM2.5 values for each sensor and provide all the same corrections that result from a download from the PurpleAir web page?
What version of (Win10 PC based) Python does your script need to compile and run properly? I have in the past encountered versions of Python and incompatibilities between code and a specific version of Python.
Thanks for your help.

Hey Bob, this script downloads the purpleair CF1 data, which is the standard data that PA monitors report. If you want to change that, you can insert different parameters into the field lists included on lines 86, 95, 104, 113, and 122. The full list of fields you can query is available on the API website:

https://api.purpleair.com/#api-groups-get-member-history

If you are looking for corrected PM2.5 data, I believe that the pm2.5_alt field is Lance’s corrected PM2.5 data, which has the most documentation associated with it. I correct my own data from the particle number fields, so I primarily utilize the data in the “secondary” data files.

I’m running this code on Windows 10 with python 3.11. I do believe you’ll need to download a couple modules to run this script if you don’t already have them (specifically the requests module). I just had my undergraduate assistant do it on his computer, and it seemed to work easily enough. I do think there could be issues if you’re running an older version of python that isn’t compatible with some of these modules.

Bob and Aaron–

Aaron is right about the pm2.5_alt variable being my corrected PM2.5 data. This algorithm avoids both the Plantower proprietary algorithms CF_1 and CF_ATM and thus, as Aaron states, uses only the secondary output of the smallest four of the six number fields: >0.3 um, > 0.5, >1, and >2.5. Of course, one needs to subtract each succeeding value to get N1, N2, and N3 for the 0.3-0.5 um, 0.5-1 um, and 1-2.5 um size categories contributing to PM2.5. Then there is a simple equation to determine PM2.5, which has been implemented by PurpleAir to give the PM2.5 estimate directly.

1 Like

hi Aaron.

please, what’s the maximum average_time limit supported?

That goes back to Zuber’s original script. He has the following options for the average time:

0 (real-time), 10 (default if not specified), 30, 60

So I guess that would be hourly. Of course, you can always average it any way you want after the download. That’s why I always pull the real-time data.

1 Like

Maximum average time goes up to 24 hours (1440 minutes). I did not include that in script as I will not recommend using this as the data are in UTC. It will generate 24 hours average based on UTC which may not be useful to your location. I will suggest to get hourly average and convert to your time zone and then calculate 24 hours average.

Thanks!

2 Likes

Hi Aaron,

what do you mean when you write “it needs to have approval to access historical endpoints”?

How does one get approval?

Thanks,
Marcello

Marcello, you need an API read key to access any data through the PA API, but in order to access historical data you need additional permissions. You can gain these by emailing PA support and requesting them. Until you have these permissions, you won’t be able to run this script.

2 Likes