Data and Daycare
Finding daycare, especially for an infant, in my county is increasingly difficult. One of the driving factors is due to the fact that, between 2019 and 2021, 16 in-home daycares have closed.
Childcare Aware of Kansas provides a yearly snapshot of child care options for each county in Kansas. That’s where I found the drop of 16 daycares in Ellis County. What they don’t talk about is the statewide trend. I had no way of knowing if that drop was common throughout the state or if we were an outlier.?
I wasn’t about to open all 105 county pdfs on the website and write down their numbers to compare, so it was time to bust out some Python.
My first thought was to scrape the page holding all of the county pdfs. After doing a find_all with BeautifulSoup, none of the pdf links were being pulled. The page had the pdfs buried in a gnarly table laid out like Kansas, so that must have been throwing the search off.?
Instead of trying to find the right combination of tags and elements to scrape with, I looked at the url for a single pdf page and saw that the format was consistently ended with ?county=‘insert_county’. So I quickly created a .txt file with all of the Kansas county names, looped through the file while using f-strings to insert each county at the appropriate spot, and downloaded all 105 pdfs in a matter of seconds.
PDFs are a hard format to work with. The strategy you used for you last project, could not translate to your current. That’s what happened for me here. I had used PyPDF2 previously to read text from fillable pdfs, which worked wonderfully. When I attempted to use that module to extract the text this time, nothing was ever returned.
Cue a lot of Googling and landing on pdfminer.six.
This nifty tool was able to extract the pdf text….in one long continuous string. Not ideal.
I had to take a deep breathe and add ‘import re’ to the beginning of my file.
After a quick refresher on regex from Automate the Boring Stuff, I was able to extract the portion of the pdf I needed to see the change of in-home daycare providers. Of course, the three years of data were also being grabbed as one long string.?
Cue head scratch on how to breakup that string of digits.
The answer I came up with was if/elif statements based off the length of the string. The length of the string indicated how it should be split up to correspond to the correct year.
领英推荐
Here are two examples:
if len(numbers) == 6
??county_dict = { 'County': county,?'2019':numbers[:2], '2020': numbers[2:4], '2021': numbers[4:]}
??dict_copy = county_dict.copy()
??county_listDict.append(dict_copy)
elif len(numbers) == 9:
??county_dict = { 'County': county, '2019': numbers[:3], '2020': numbers[3:6], '2021': numbers[6:]}
??dict_copy = county_dict.copy()
??county_listDict.append(dict_copy):
You’ll also notice that I was putting the data into a dictionary, copying the dictionary, and then appending that to a list. This was so I could eventually write it all to a csv file.
Back to the splitting of a string.
Those were the first two instances I wrote based on how I knew the numbers from my county and the largest county in the state would need to be handled. When I ran the program, I realized that I was missing about 40% of the county data because not all of the strings were either 6 or 9 characters long.
I had to write if statements for every string length from 3 to 9. This doesn’t actually seem like the cleanest way to handle the data. Consider the strings 8910 and 9109. The first shows continued growth from 8 to 9 to 10. The later shows a gain then a loss from 9 to 10 to 9. Handling 9109 in the same way as 8910 would leave me with 9, 1, 09, instead of the correct 9, 10, 9.
Having looked through the csv that I ultimately generated, it doesn’t seem like that happened, but it certainly could in another data set.
After that hurdle was cleared, all that was left was to create a csv file with my list of dictionaries then use pandas to run a tiny bit of analysis to get a better overview of the in-home daycare situation across Kansas.
Every step along the way to getting the data that I wanted was enjoyable. The thrill of seeing a problem, thinking through possible solutions, trying, failing, revising, and then running correctly is a thrill that Python continually gives me.
FYI: Kansas has lost -390 in-home daycares since 2019. Apparently it's not just our county losing childcare options. View all the data by clicking the link below.