As you can imagine, it is near impossible to do this as a human.
Not only would it be time-consuming to manually analyze each book, but it is hard to be consistent in quality. And, we already have 6,000 books in our database and by the end of 2022, I hope to have 30,000+ books in our database.
I was hoping we could use something called NLP (Natural Language Processing) to analyze the data we have around each individual book and assign topics automatically. NLP uses a kind of computer magic called machine learning and only recently has this become an option. Not only would this be efficient but it should be consistent (as only a machine can be).
How does NLP work?
NLP is trained to read text and identify what the text is about. That might sound simple but the technology behind it is incredible. It can analyze the book’s title, recommendations, and description and tell you that it thinks the book is about Abraham Lincoln and the Battle of Gettysburg.
I also want to connect these topics to Wikipedia data. This would allow us to use the Wikipedia hierarchy and pull down data about the topic.
How is testing going?
Our amazing developer had a breakthrough this week and we are feeling confident that this approach is feasible (especially for our small budget).
- Sally Hemings (Q257464) 0.24686605
- Philosophers (Q6997215) 0.20841492
- United States (Q1410960) 0.20841492
- American people (Q3919762) 0.20841492
- Thomas Jefferson (Q7036420) 0.20841492
- Presidencies (Q11708087) 0.20841492
- American politicians (Q7029555) 0.20841492
- Activists (Q7062576) 0.20841492
- American political philosophy (Q8247581) 0.20841492
- American philosophers (Q7040585) 0.20841492
Now you can see this is still pretty rough and we have a lot to improve, but this is still amazing.
Why is this amazing?
If we know that this book is about
Sally Hemmings we can pull up the Wikidata entry for her and know in an automated fashion that she is human, her gender, her place of birth, the dates she lived, her family details, her ethnic group, her social classification (enslaved), and more.
Why is this cool?
Long term, I want to use this data to help people discover books in a myriad of new and innovative ways. For example, we could build a timeline of notable Black American women, and build an experience where people can browse along that timeline and meet those women. Not only could we show when they were born, a picture, and a short bio of who they were but we can also show connected books and book lists.
There are a lot of possibilities here and I am very excited :). This is what I am hoping to play with next year as we get further along, and I can start creating more unique approaches after we get basic topic pages, search, and a new frontpage shipped.
PS. I use the word magic around NLP, but to be clear there is a TON of ongoing work to create, maintain, and train a model like this.