Skip to content

About Development of a multi-user book search engine platform based in Java, anchored in an inverted index, encompassing crawling, cleaning indexing, and efficient querying mechanisms for heightened precision and user experience.

Notifications You must be signed in to change notification settings

LiBookSearchEngine/LiBook

Repository files navigation

LiBook: Book Search Engine 🔍

In this repository, you can find the source code for building up an inverted index based search engine for books obtained from both Project Gutenberg and registered users' accounts directly. We also implemented both relational and non-relational datamarts to be able to make queries on the available books. This is a micro-service-oriented application that consists of the next modules:

  • Crawler: Obtains books directly from Project Gutemberg book platform and stores them into our datalake.
  • Cleaner: Processes the books and prepares them to be indexed.
  • Indexer: Indexes the books into our inverted index structure in Hazelcast.
  • MetadataDatamartBuilder: Creates a metadata datamart for queries.
  • QueryEngine: Offers an API for users to be able to query our inverted index.
  • UserService: Handles users' accounts in MongoDB, and session tokens through a distributed Hazelcast datamart.
  • UserBookProcessor: Processes the books uploaded by users and sends them to the cleaner.
  • ApiGateway: Serves an API merging all the public APIs of the final application, improving security on petitions.

Crucially, this project employs three distinct datamart technologies—Hazelcast, MongoDB, and Rqlite. Rqlite, based on SQLite and adapted for clustered usage, is particularly notable for its role in distributed relational database management within the application. The integration of these datamarts enhances the overall scalability, efficiency, and versatility of the search engine, accommodating both centralized and distributed data processing needs.


Image for Dark Mode



1) How to run (Docker and Docker Compose)

For each module, you should generate the corresponding docker image. In our case, we will deploy in a Google Cloud virtual machine instance the Crawler, Cleaner, API Gateway, User Service and User Books Processor services. To do so, after connecting to the Google Cloud server, we execute the docker-compose up command, for the docker compose file given in this repository. Now, we have to run the micro-services that are left to the in-premise processing.

Indexer

To execute the indexer, we should run the docker image as follows:
docker run -p 8082:8082
            -e "SERVER_API_URL=http://34.16.163.134"
            -e "SERVER_MQ_PORT=443"
            -e "index=1"
            --network host
ricardocardn/indexer                  

Make sure you should specify the option --network host, or some problems related to Hazelcast may raise.

Metadata Datamart Builder

To run this service, we will have to run both rqlite and metadata datamart builder images. Thus, execute the rqlite image in one single computer of your cluster:

docker run -p4001:4001 -p4002:4002 rqlite/rqlite

And, for the Metadata Datamart builder, execute:

docker run -e "SERVER_MQ_PORT=443"
            -e "SERVER_API_URL=http://34.16.163.134"
            -e "SERVER_CLEANER_PORT=80"
            -e "LOCAL_MDB_API=http://34.16.163.134"
ricardocardn/metadata-datamart-builder

Query Engine

Query engine will make use of both Hazelcast and Rqlite, so we should run its images as follows:

docker run -p 8080:8080
            --network host
susanasrez/queryengine2

User Service

The user service will be connected to both Hazelcast and MongoDB datamarts, so make sure to use the --network=host and you have a MongoDB Atlas account. So:

 docker run -p 8082:8082
            -e "MONGO_ATLAS_PASSWORD=..."
            -e "SERVER_API_URL=http://34.16.163.134"
            -e "SERVER_BOOKS_PORT=80" ricardocardn/user-service

Local API Gateway

Local API Gateway will be used so that computers in the cluster will be able to deploy several services. Thus, when routing with the load balancer, each user petition can be attended by any computer.

docker run -p 8080:8080
            -e "USER_SERVICE_API=http://localhost:{local user service's port}"
            -e "QUERY_ENGINE_SERVICE_API=http://localhost:{query engine service's port}"
            -e "CLEANER_SERVICE_API=http://{server's ip}:{cleaner's port}"
          ricardocardn/local-api-gateway

Credits

About

About Development of a multi-user book search engine platform based in Java, anchored in an inverted index, encompassing crawling, cleaning indexing, and efficient querying mechanisms for heightened precision and user experience.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages