NLP-based book recommendation

Francisco ESPIGA
10 min readDec 16, 2021

Building AI-based products from PoC to production-ready

Photo by Austin Distel on Unsplash

Overview

I love reading books, but sometimes, turning the last page of a book is distressing, not only because we depart from the wonderful world they have created, but also because that implies looking for the next story we want to immerse ourselves in.

Wouldn’t it be great if we could outsource this decision-making to someone… or something? In this series of articles, I will explain how to create our own recommendation engine for books, using Deep Learning models, and following a product-oriented mindset.

The goal is to walk the path of AI-based product creation, from the first baby steps to, progressively, building a more complex solution. Each stage of the journey will build on the previous one:

  • In the first article of the series, we will create a PoC to validate our approach, using NLP. You can find all the necessary code in my github repository
  • In the second article of the series, we will build an app using streamlit as the front end, and fastAPI in the backend.
  • In the third article, we will migrate the data to an ElasticSearch database.
  • In the last article, we will create a telegram bot to ask for suggestions.

So, let’s get started!

What is a PoC?

A PoC or proof of concept, in non-product jargon, is something that has just the functionality required to show that an idea works.

It is important not to mistake PoCs with MVPs. An MVP is the acronym for a minimum viable product, and it goes beyond the idea itself, providing a simplified version of the end-to-end product.

Of course, there are MVPs in many flavors and it can get more complex than that, but we will not dive deeper into that.

Our MVP

It will consist of 3 different building blocks: the database, the query translator, and the candidate retrieval in the backend.

  • The database: we will transform our raw data, using an NLP model to add the embedding of the book description. If you want to know more about embeddings, and their importance, you can check this article. It will be stored as a jsonlines file.
  • The query translator: using the same NLP model we relied on to build the database, we will encode the text query of a user to create the query embedding.
  • The candidate retrieval backend will take the query embedding and compute the distance against the database entries, to retrieve the top k nearest neighbors.

The database

The data that we have used for this project is the Book Repository dataset, which is publicly available in Kaggle. It consists of several .csv files. The ones we are interested in are dataset.csv, authors.csv, and categories.csv.

We will process the data to replace the authors and categories arrays with their corresponding text from the other two CSV files. Moreover, we will compute the embeddings of both the description of the book, and the average embedding of the categories.

Although we will not use this in our PoC, it might be useful for future iterations.

Last, we will store the data in a jsonlines file and the embeddings in a cloudpickle. The original dataset has more than 1 million samples. We will work only with 10% of the data, 100.000 examples, as it would be time- and resource-consuming otherwise, for a PoC.

We will do this randomly, to have a more diverse dataset in case the original data is alphabetically sorted.

The query translator

To translate the query, we rely on the same NLP model. It is built with the sentence transformers library, which wraps HuggingFace to create good sentence embeddings of several BERT-based model variants.

Our choice will be the paraphrase-multilingual-MiniLM-L12-v2 mainly because the multilingual part allows us to encode descriptions from different source languages, providing future flexibility and the mini-version because we can then reduce the number of dimensions of the embedding from the original 768 to 384, thus decreasing the storage requirements.

The candidate retrieval backend

The retrieval backend should receive a query, process it using the query translator, find the best candidates in the database, and output the results to the user.

As this is just a PoC, we will simplify the process. Our backend will be a simple script that takes the query, encodes it and compares it against our database.

The best candidates will be obtained by their cosine similarity, and to simplify the process we will leverage the tensor that we stored on the previous step in the cloudpickle file. Once the top k candidates by similarity are computed, we store their indices. Those will be the indices of the jsonlines file to be retrieved and shown to the user.

Wrapping up

Results

We have run 3 sample queries to test the approach of our PoC:

  • Query 1: mildly specific and in Spanish. “quiero un libro de piratas y una isla” (I want a book about pirates and an island).
{'title': 'The Ghost Pirates', 'description': '"The Ghost Pirates . . . is a powerful account of a doomed and haunted ship on its last voyage, and of the terrible sea-devils (of quasi-human aspect, and perhaps the spirits of bygone buccaneers) that besiege it and finally drag it down to an unknown fate. With its command of maritime knowledge, and its clever selection of hints and incidents suggestive of latent horrors in nature, this book at times reaches enviable peaks of power." -- H.P. Lovecraft', 'authors': ['Darrell Schweitzer', 'William Hope Hodgson']}
*************
{'title': 'Wind-up Pirate Ship', 'description': 'Kids can really get involved in the timber- shivering tale on the high seas, with this novelty board book which comes with a wind-up pirate ship. There are 3 sturdy tracks embedded in the pages; the wind-up ship runs round them, crossing the seven seas and avoiding ghost ships, whirlpools and sea monsters on the way. A great gift which will engage and entertain. Ages: 3+', 'authors': ['Christyan Fox', 'Louie Stowell']}
*************
{'title': 'The Grotlyn', 'description': "A stunningly illustrated picture book full of mystery and suspense, from the bestselling author of THE STORM WHALE and GRANDAD'S ISLAND.", 'authors': ['Benji Davies']}
*************
{'title': 'Under the Sea', 'description': 'This picture book takes the reader on a journey all the way through the sea from one shore to another far across the world. From a bustling bright coral reef (by day and by night), out into the open sea to swim alongside giant whales, and diving down and down to discover what lives in the deepest darkest part of the ocean. This book introduces a child to the wonders of the sea and all kinds of sealife. The stunning images and lyrical text will leave a lasting impression, and can be treasured again and again.', 'authors': ['Anna Milbourne']}
*************
{'title': 'Port Side Pirates', 'description': 'Travel the high seas with a lively band of buccaneers as they enjoy a melodic adventure aboard their galleon. Includes fun information about historical pirates, pirates around the world, and even a helpful chart naming the parts of a ship. Book with CD editions include song sung by Mark Collins.', 'authors': ['Debbie Harter', 'Oscar Seaworthy']}
*************
  • Query 2: broad and in English. “I want a book about Latin-American magic realism
{'title': 'One Hundred Years Of Solitude', 'description': 'In the book which put South America on the literary map, Marquez tells the haunting story of a community lost in the depths of that almighty continent where time passes slowly. A poetic masterpiece whose rich and powerful language easily survives the translation from Spanish, this is the most celebrated text of magic realism, the literary movement which has dominated world fiction for the last thirty years.', 'authors': ['Gabriel Garcia Marquez']}
*************
{'title': "Now That's What I Call Chaos Magick: v. 1 & 2", 'description': "This book gives the beginner and experienced practitioner alike a modern, 21st century view into the powerful and often misunderstood magical current called 'Chaos Magick'. Written in a clear and easily accessible style it examines the theory behind many techniques used in magical, artistic, religious and scientific systems of thought; then links and applies them towards desired goals. Separated into two volumes the book can be used by the reader as a workbook with rituals, techniques and exercises to be followed, as a window into contemporary magical thought at the turn of the century or simply as a rollercoaster of a good read! However you choose to use it, this book will leave you feeling positive, inspired and ready to apply any of the methods presented to your own life.", 'authors': ['Dave Lee', 'Greg Humphries', 'Julian Vayne']}
*************
{'title': 'Queen of Dreams : The Story of a Yaqui Dreaming Woman', 'description': "For readers of Carlos Castaneda and Lynn Andrews, this book presents the fascinating true story of a woman's dramatic spiritual odyssey as the wife of a Yaqui Indian chief and sorcerer. Drawing readers into an intriguing world, Valencia describes her shamanistic experiences among the Native American people and their rich spiritual tradition. Lightning Print On Demand Title", 'authors': ['Heather Valencia', 'Rolly Kent']}
*************
{'title': 'Magick in Theory and Practice', 'description': "2018 Facsimile of the 1929 Edition. Illustrated. Many consider this work by Crowley to be the foremost book on ceremonial magic written in the twentieth century. It was written especially for beginners and is considered one of Crowley's better books. Illustrated with graphs and charts. The original was privately printed in 1929 after Crowley failed to find a publisher in London and has been considered a scarce work since that time.", 'authors': ['Aleister Crowley']}
*************
{'title': 'Three Books of Occult Philosophy', 'description': "The first book of Agrippa's famous treatise on magic and Alchemy. Vital for Ceremonial Magicians of all forms.", 'authors': ['Henry Cornelius Agrippa', 'Kevadrin Dolluson']}
*************
  • Query 3: very specific, in English. “I want to read a story of robots, spacecrafts and aliens attacking earth”.
{'title': 'History News: Explorers News', 'description': "Read about the astonishing discoveries and remarkable adventures of explorers -- from the voyages of the ancient Polynesians and the Vikings to satellites in space. The History News: Explorers, a popular book in the award-winning News series, is available in paperback and in a space-saving reduced trim size. Covering the astonishing discoveries and bold adventures of world explorers, this acclaimed book presents history in a unique, kid-friendly format that's as accessible as the morning newspaper. Back matter includes a time line, an index, and sources.", 'authors': ['Michael Johnstone', 'Various']}
*************
{'title': 'Dreams of Other Worlds : The Amazing Story of Unmanned Space Exploration - Revised and Updated Edition', 'description': "Dreams of Other Worlds describes the unmanned space missions that have opened new windows on distant worlds. Spanning four decades of dramatic advances in astronomy and planetary science, this book tells the story of eleven iconic exploratory missions and how they have fundamentally transformed our scientific and cultural perspectives on the universe and our place in it. The journey begins with the Viking and Mars Exploration Rover missions to Mars, which paint a startling picture of a planet at the cusp of habitability. It then moves into the realm of the gas giants with the Voyager probes and Cassini's ongoing exploration of the moons of Saturn. The Stardust probe's dramatic round-trip encounter with a comet is brought vividly to life, as are the SOHO and Hipparcos missions to study the Sun and Milky Way. This stunningly illustrated book also explores how our view of the universe has been brought into sharp focus by NASA's great observatories--Spitzer, Chandra, and Hubble--and how the WMAP mission has provided rare glimpses of the dawn of creation.", 'authors': ['Chris Impey', 'Holly Henry']}
*************
{'title': 'Pluto 2', 'description': 'Una fascinante historia sobre emociones, relaciones y pensamientos. . . de los robots. Y de cómo ellos interactúan con los humanos, qué les mueven y, al final, qué los hace más humanos que nosotros mismos.', 'authors': ['Daruma Serveis Lingüístics', 'Naoki Urasawa']}
*************
{'title': '"Life": Nature\'s Fury : When the Weather Spins out of Control', 'description': 'This title explores in explanations and startling photographs the immense power of the forces acting on our planet through the eyes of those who have been most affected by it - including scientists, researchers and innocent bystanders. It covers earthquakes, tornedoes, volcanoes and hurricanes.', 'authors': ['Editors Of Life Magazine']}
*************
{'title': 'War of the Worlds', 'description': 'When four Martian spaceships land in England, masses of people flee the cities, driven by an overwhelming fear of the alien creatures with their devastating weapons of death and destruction. Excellently adapted by Bob Blaisdell for youngsters, this easy-to-read version of the 19th-century, science-fiction classic is enhanced with 6 original illustrations by John Green. Abridged.', 'authors': ['H. G. Wells', 'John Green', 'Robert Blaisdell']}

Key takeaways

Although in general, the results are not terrible, they are distant from being optimal. We retrieve 100 years of solitude by García Márquez as our first result for Latin-American magic realism, the War of the Worlds in our very specific Sci-fi query and pirates and island related results for the first query, but this should be improved.

We might argue that we do not have the whole database and that those might be the best results, having no threshold cutoffs. But that is a debate for another article on model and ranking improvements.

The key takeaways:

  • The longer the description, the higher the similarity if the keywords are contained.
  • The model works well regardless of the query or description original languages.
  • Some keywords strengthen the connection between query and result (Magic) whereas others are less impactful (book / libro).

Conclusions and next steps

Photo by Lindsay Henwood on Unsplash

In this article, we have walked through the different steps necessary to create a working PoC of a book recommender, with minimal implementation.

During testing, we have shown that very specific queries provide more meaningful results, regardless of the original language of the query.

However, this PoC requires re-running the script and loading the database every time we want to run an example. For that reason, in my next article, towards building our MvP, we will migrate the back end to fastAPI and create a minimal streamlit version to be used as the front end.

--

--

Francisco ESPIGA

Data Science & AI Tech Lead@SANDOZ and Teacher@ESIC. Reinforcement learning, optimization and AI enthusiast building a more interesting world 1 epoch at a time.