Easiest way to download historic data for research

Hi PA Community,

I am a policy researcher at RAND, and we deployed sensors in Santa Monica (n=25) and Pittsburgh (n=18) back in 2018. I am currently conducting performance evaluation of PA sensors research and had a couple queries. Would be great if I can have your response:

a) With the recent API changes, what is the best way to retrieve data from our previously deployed sensors (Santa Monica - Jan 2018-Dec 2020) and Pittsburgh (March 2019 - Dec 2020). I prefer CSV over JSON.

b) Is there a way to get data from all PA sensors within the city of Santa Monica and the city of Pittsburgh area? I have used Python scripts for historic data before, but it seems like the usage is limited now. That would be super helpful for correlating density of low cost monitors with accuracy with respect to reference monitors (e.g. 10-minute data for 3 days etc.).

I’d appreciate if you guys have leads, or if someone from PA can help retrieve historic data from our deployed sensors for analysis.

Thanks,
Jalal

Hi Jalal, there are a couple of different ways to collect sensor data.

  1. Data can be downloaded online. The following link will give your more information: Download Sensor Data.

  2. You can use the historical API endpoints. Currently, only one sensor can be queried at a time for historical data. However, we ask that you keep requests low. This API is not intended for collecting large amounts of data from a high volume of sensors. EDIT: Larger amounts of data can now be collected through the historical endpoints without issue.

  3. We may be able to provide the data. If you need a very large data set (several thousands of sensors), it may be preferred that you contact us, and we provide the data. This will help maintain low usage of the API.

You can collect data from several sensors at once using the real-time API. Individual sensors can be added to a group, and then you can query that group to collect the data. You can also use a latitude and longitude bounding box and get the data for all sensors within that region.

You can query the real-time API every minute or so, then store that data in your own database. It is more cost-effective and less server intensive for PurpleAir to use the real-time API than the historical API. We recommend this strategy.

1 Like

Thanks for the prompt response, Ethan. Since we’ll need more granular (hourly or sub-hourly) data for more than 1 year, appreciate if you can put me in touch with someone from PA for access. That seems like the most efficient way to go about it, while not overwhelming the API.

The way the EPA has dealt with downloading larger gigabyte size model datasets by the research community is to provide them as files on google drive. Would be great to see Purple Air provide this kind of API alternative with larger historical files available by year/region or some other useful temporal/spatial divisions. Making these historical files available on google drive(or Amazon, etc) would put the network load on them and not the Purple Air servers or infrastructure.

5 Likes

Wonder if some PA data would qualify to be hosted on AWS Open Data: https://registry.opendata.aws/

1 Like