Along with our developer a lot of my time this week has been testing several book API providers to see who might be a good fit for us.
What are my goals with this data?
- Know which genres/subgenres a book belong within
- Know what age children’s books are aimed at (including middle/YA)
- Know the first date published for the book (I don’t care about editions, I just want to know when it was first published)
- Possibly improve our topic detection engine (using LC data).
- Determine the readership level for each book.
What have I learned so far?
The data is super messy… which makes sense as it is entered by humans and isn’t really verified by anyone. That is ok, and eventually, we will put together a librarian system to crowdsource improvements. I also aim to give authors control over their books as we get further down the road.
This is a big one and everything looks good.
Now we are doing big data pulls to figure out if we are going to use BISAC or THEMA (or if we are feeling really crazy some combination of the two). I really like THEMA’s qualifier setup
and I think it is superior to BISAC, but I am not sure if American books use it correctly.
I love categorization structures, so I have been nerding out a bit as I dig into this.
Ages For Children’s Books
This is a really important one and I am disappointed with the data so far. My hope was I could use the data to map out books per age, but I am not sure if this is possible yet.
I am going through another data dump today to test. I am hoping most children’s books have THEMA data for age range as that appears to be the only place to get this information. I am also going to test a second provider to see if they have better data sources.
Date First Published
I was hoping it would be easy to get the first date published for a book… or at least we could do something fancy where we pulled related ISBNs for ISBN #1 and then determine which version was the oldest. So far this is not looking good.
For example, if Dune is published in 1972 but you have a new edition of the book it will say it was published in 2015. So, that means if we really want this data we will have to add more ISBNs manually under a book so that we have a better chance of determining the first published date…
We are still testing but I think this is a bust.
Improve Topic Detection
This looks like a huge win, as we should be able to feed in the Library of Congress data at a 1:1 relationship to Wikidata topics. This will drastically improve our topic detection and I am super excited about this :)!
Determine Readership Level
This also looks good, and the data should allow us to one-day filter books and book lists around general readers versus academic and so on.