Building Shepherd - Topic Pages, NLP, and computer magic.

Jul 23, 2021

1,453 authors have turned in a book list for Shepherd or committed to writing one(amazing 😃).

July 12th - June 23rd.

Topic Pages, NLP, and computer magic...

Topic pages are focused around big topics like World War 2 or American Presidents or Depression. They are one of the most crucial features on the website as they give readers a starting point to start wandering through our virtual book shelves.

One of the things I have dreamed about doing is to break down the topics into smaller slices to help people discover things they might not otherwise know about. I want to find new and unique ways to help people explore knowledge and books.

For example, I would love to show key groups in WW2 and help people find books about them, whether that is women in WW2, the Dutch resistance, spies, or the French army.

Here is a mockup showing how I imagine that might work for the World War 2 topic page and showing key people, key events, key groups, and key locations of that war.

Imagined feature showing topics within a World War 2 topic page.

As you can imagine, it is near impossible to do this as a human.

Not only would it be time-consuming to manually analyze each book, but it is hard to be consistent in quality. And, we already have 6,000 books in our database and by the end of 2022, I hope to have 30,000+ books in our database.

I was hoping we could use something called NLP (Natural Language Processing) to analyze the data we have around each individual book and assign topics automatically. NLP uses a kind of computer magic called machine learning and only recently has this become an option. Not only would this be efficient but it should be consistent (as only a machine can be).

How does NLP work?

NLP is trained to read text and identify what the text is about. That might sound simple but the technology behind it is incredible. It can analyze the book's title, recommendations, and description and tell you that it thinks the book is about Abraham Lincoln and the Battle of Gettysburg.

I also want to connect these topics to Wikipedia data. This would allow us to use the Wikipedia hierarchy and pull down data about the topic.

How is testing going?

Our amazing developer had a breakthrough this week and we are feeling confident that this approach is feasible (especially for our small budget).

Here is an example of a very early test... we ran some data about a book entitled Jefferson's Daughters: Three Sisters, White and Black, in a Young America. And, here are the results of what the machine thought this book was about (the number to the right is a Wikidata ID followed by the strength of the connection):

Sally Hemings (Q257464) 0.24686605
Philosophers (Q6997215) 0.20841492
United States (Q1410960) 0.20841492
American people (Q3919762) 0.20841492
Thomas Jefferson (Q7036420) 0.20841492
Presidencies (Q11708087) 0.20841492
American politicians (Q7029555) 0.20841492
Activists (Q7062576) 0.20841492
American political philosophy (Q8247581) 0.20841492
American philosophers (Q7040585) 0.20841492

Now you can see this is still pretty rough and we have a lot to improve, but this is still amazing.

Why is this amazing?

If we know that this book is about Sally Hemmings we can pull up the Wikidata entry for her and know in an automated fashion that she is human, her gender, her place of birth, the dates she lived, her family details, her ethnic group, her social classification (enslaved), and more.

Why is this cool?

Long term, I want to use this data to help people discover books in a myriad of new and innovative ways. For example, we could build a timeline of notable Black American women, and build an experience where people can browse along that timeline and meet those women. Not only could we show when they were born, a picture, and a short bio of who they were but we can also show connected books and book lists.

There are a lot of possibilities here and I am very excited :). This is what I am hoping to play with next year as we get further along, and I can start creating more unique approaches after we get basic topic pages, search, and a new frontpage shipped.

PS. I use the word magic around NLP, but to be clear there is a TON of ongoing work to create, maintain, and train a model like this.

Quick updates for the last 2 weeks...

Shepherd now has 944 published pages on the website, and we should hit 1,000 very soon. I am behind on publishing this week, but working to catch up today.

Jessy is doing awesome and caught up on the backlog of authors who had responded but we hadn't followed up with. As well as all the authors who asked us to check back in the future.

I am working with the designer to finalize designs for the topic pages, search feature, and the new front page. As soon as I have those I will share those mockups.

We are starting to receive more referrals from authors and publicists. That is amazing and helps augments the outreach we do. Thank you!

What else is going on?

I turned 40 years old! Very cool to be entering a new age decade, and I had an amazing birthday lunch and celebration with my wife and son :)

Have a great weekend,

Thanks, Ben

Shepherd & Book DNA Build Diary

Discussion about this post

Ready for more?