Sustainable use of Brazilian Biodiversity

Using Linked Data for
Natural Product Discovery

publications

Improving Natural Product Knowledge Extraction from Academic Literature with Enhanced PDF Text Extraction and Large Language Models

Paulo Ricardo Viviurka do Carmo, Marcos Paulo Silava Gôlo, Jonas Gwozdz, Edgard Marx, Ricardo Marcacini

The biodiversity of tropical environments offers a rich variety of species for the process of finding new drugs based on Natural Products. Databases like The Brazilian Biodiversity Natural Products Database (NUBBE$_{DB}$), where they hold compounds and characteristics about them, are important for computational assistance. However, these databases are difficult to update since data about compounds is mostly published in academic papers. Therefore, automatic Knowledge Extraction like on the state-of-the-art Benchmark for Natural Product Knowledge Extraction from Academic Literature (NatUKE), is an important task for the field. The dataset uses a Knowledge Graph version of the NUBBE$_{DB}$ and it evaluates different Knowledge Graph Embedding models for the task. The best performer from NatUKE is an embedding propagation model that uses pre-trained language models as the start-up embedding for the nodes that contain text data. This work investigates two avenues for increasing performance out of NatUKE. We focused on better text extraction from PDFs and using Large Language Models as the start-up embeddings. Our results surpassed state-of-the-art in 3 out of 5 extracted features while maintaining competitive performance on the remaining features.


@inbook{pdfandllm2025docarmo,
author = {Viviurka do Carmo, Paulo and Silva G\^{o}lo, Marcos Paulo and Gwozdz, Jonas and Marx, Edgard and Marcondes Marcacini, Ricardo},
title = {Improving Natural Product Knowledge Extraction from Academic Literature with Enhanced PDF Text Extraction and Large Language Models},
year = {2025},
isbn = {9798400706295},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3672608.3707858},
booktitle = {Proceedings of the 40th ACM/SIGAPP Symposium on Applied Computing},
pages = {980–987},
numpages = {8}
}

Development of a novel chemoinformatic tool for natural product databases

Paulo Ricardo Viviurka do Carmo, Ricardo Marcacini, Marilia Valli, João Victor Silva-Silva, Leonardo Luiz Gomes Ferreira, Alan Cesar Pilon, Vanderlan da Silva Bolzani, Adriano D Andricopulo, Edgard Marx

Aim: This study aimed to develop a chemoinformatic tool for extracting natural product information from academic literature. Materials & methods: Machine learning graph embeddings were used to extract knowledge from a knowledge graph, connecting properties, molecular data and BERTopic topics. Results: Metapath2Vec performed best in extracting compound names and showed improvement over evaluation stages. Embedding Propagation on Heterogeneous Networks achieved the best performance in extracting bioactivity information. Metapath2Vec excelled in extracting species information, while DeepWalk and Node2Vec performed well in one stage for species location extraction. Embedding Propagation on Heterogeneous Networks consistently improved performance and achieved the best overall scores. Unsupervised embeddings effectively extracted knowledge, with different methods excelling in different scenarios. Conclusion: This research establishes a foundation for frameworks in knowledge extraction, benefiting sustainable resource use.


@inproceedings{development2023docarmo,
author={Paulo Ricardo Viviurka do Carmo, Ricardo Marcacini, Marilia Valli, João Victor Silva-Silva, Leonardo Luiz Gomes Ferreira, Alan Cesar Pilon, Vanderlan da Silva Bolzani, Adriano D Andricopulo, Edgard Marx},
booktitle={Future Drug Discovery, Vol. 5, No. 2}, 
title={Development of a novel chemoinformatic tool for natural product databases},
year={2023},
}

Preface of the First International Biochemical Knowledge Extraction Challenge (BiKE)

Edgard Marx, Marilia Valli, Joao da Silva e Silva, Sanju Tiwari, Paulo do Carmo

The knowledge of over 50 years of studies on biodiversity available in scientific articles can become easier accessible when organized and shared through knowledge graphs. It can assist in the development of different fields of science and bio-friendly products with high added value as well as guide public policies to bring benefits both to science and to strengthen the bio-economy. However, to date, most of the structured biochemical information available on the Web is manually curated, and it is practically impossible to keep pace with the research being constantly published in scientific articles. The First International Biochemical Knowledge Extraction Challenge (BiKE) aims at accelerating and promoting the research on automatic biochemical knowledge extraction mechanisms by the Semantic Web scientific community to increase the information available on natural products and contribute to the development of environmental-friendly products while increasing the community awareness of the biodiversity value. The following papers were accepted for publication and presented at the workshop: • BiKE Challenge: Result of ChemiScope by using ChatGPT • Improving Natural Product Automatic Extraction with Named Entity Recognition • Enhancing Biochemical Extraction with BFS-driven Knowledge Graph Embedding approach


@inproceedings{bike2023marx,
  author={Edgard Marx, Marilia Valli, Joao da Silva e Silva, Sanju Tiwari, Paulo do Carmo},
  booktitle={Joint Proceedings of the Second International Workshop on Knowledge Graph Generation From Text and the First International BiKE Challenge co-located with 20th Extended Semantic Conference (ESWC 2023)}, 
  title={Preface of the First International Biochemical Knowledge Extraction Challenge (BiKE)},
  year={2023},
}

Improving Natural Product Automatic Extraction With Named Entity Recognition

Stefan Schmidt-Dichte, István Mócsy

Knowledge graphs (KGs) play a vital role in providing structured data for various applications, but their creation is time-consuming and prone to errors. To address these challenges, automatic knowledge extraction methods using machine learning (ML) have gained attention. ML algorithms have shown promise in capturing subtle nuances in language data, offering comprehensive and robust solutions. In the field of biochemistry, knowledge extraction is crucial for advancing scientific research, product development, and policy-making. The First International Biochemical Knowledge Extraction Challenge focuses on extracting biochemical knowledge from scientific articles. This paper presents an updated approach that incorporates named entity recognition (NER) using scispaCy models to improve the accuracy and relevance of extracted entities. The evaluation of the approach utilizes the NatUKE benchmark and demonstrates improved performance in extracting bioactivity and isolation type. However, challenges remain in identifying compound names and species. Future research may explore hybrid approaches combining different techniques to address these specific challenges.


@inproceedings{ner2023schmidtdichte,
  author={Schmidt-Dichte, Stefan and M{\'o}csy, Istv{\'a}n J},
  booktitle={Joint Proceedings of the Second International Workshop on Knowledge Graph Generation From Text and the First International BiKE Challenge co-located with 20th Extended Semantic Conference (ESWC 2023)}, 
  title={Improving Natural Product Automatic Extraction With Named Entity Recognition},
  year={2023},
}

Leveraging ChatGPT API for Enhanced Data Preprocessing in NatUKE

Pit Fröhlich, Jonas Gwozdz, Matthias Jooß

This scientific paper presents an approach for enhancing the performance of machine learning models by utilizing ChatGPT, a state-of-the-art language model developed by OpenAI, for data preprocessing. The study focuses on the existing Project NatUKE (A Benchmark for Natural Product Knowledge Extraction from Academic Literature) and investigates the impact of incorporating ChatGPT in the preprocessing pipeline. By leveraging the natural language processing capabilities of ChatGPT, we aim to improve the quality and relevance of the data used as input for the knowledge graph embedding algorithms. This paper provides a detailed description of the methodology employed, the experimental setup, and the results obtained, highlighting the benefits and limitations of this approach.


@inproceedings{gpt2023frohlich,
  author={Fr{\"o}hlich, Pit and Gwozdz, Jonas and Joo{\ss}, Matthias},
  booktitle={Joint Proceedings of the Second International Workshop on Knowledge Graph Generation From Text and the First International BiKE Challenge co-located with 20th Extended Semantic Conference (ESWC 2023)}, 
  title={Leveraging ChatGPT API for Enhanced Data Preprocessing in NatUKE},
  year={2023},
}

Assessing Bias on Entity Retrieval Models through Conjunctive Fallacies

Edgard Marx
International Conference on Semantic Computing, 2023

Information retrieval methods, machine learning models, and humans can suffer from a failure in judging information representativeness. We refer to this problem as information bias. In this work, we propose a method to evaluate information bias through conjunctive fallacies. An experimental evaluation of different state-of-the-art entity retrieval models and human-curated benchmarks shows that both methods perform poorly on judging query-entity representativeness while statistically based methods perform considerably better than humans.

@inproceedings{icsc2023informationBias,
  author={Marx, Edgard},
  booktitle={2023 IEEE 17th International Conference on Semantic Computing (ICSC)}, 
  title={Assessing Bias on Entity Retrieval Models through Conjunctive Fallacies}, 
  year={2023},
  pages={260-261},
  doi={10.1109/ICSC56153.2023.00050}
}

NatUKE: A Benchmark for Natural Product Knowledge Extraction from Academic Literature

Paulo Viviurka do Carmo, Edgard Marx, Ricardo Marcacini, Marilia Valli, João Victor Silva e Silva, Alan Pilon
International Conference on Semantic Computing, 2023

This work introduces a benchmark for natural product knowledge extraction from academic literature and evaluates different, state-of-the-art unsupervised embedding generation methods for this task. We show that it can automatically extract chemical compound characteristics from academic literature with an unsupervised pipeline based on graph embedding methods. We evaluated Four methods (DeepWalk, Node2Vec, Metapath2Vec, and EPHEN) in a similarity-based graph completion evaluation scenario. EPHEN achieves reasonable hits@k performance at bioactivity and isolation type extraction with 0.64 when k = 5 and 0.75 when k = 1, respectively. Meanwhile, Metapath2Vec was the best performer, but with underwhelming results, when extracting compound name and specie with 0.20 and 0.44 when k = 50, respectively. These results show that using text data and previously extracted knowledge from the knowledge graph provides the most stable performance. They also show us that some characteristics from these papers are more challenging to extract than others, and using the knowledge graph topology as context data helps in these scenarios.

@inproceedings{icsc2023natuke,
  author={Do Carmo, Paulo Viviurka and Marx, Edgard and Marcacini, Ricardo and Valli, Marilia and Silva e Silva, João Victor and Pilon, Alan},
  booktitle={2023 IEEE 17th International Conference on Semantic Computing (ICSC)}, 
  title={NatUKE: A Benchmark for Natural Product Knowledge Extraction from Academic Literature}, 
  year={2023},
  pages={199-203},
  doi={10.1109/ICSC56153.2023.00039}
}

Sustainable use of Brazilian Biodiversity Using Linked Data forNatural Product Discovery