How to Use the Hugging Face Tokenizers Library to Preprocess Text Data - KDnuggets (2024)

How to Use the Hugging Face Tokenizers Library to Preprocess Text Data - KDnuggets (1)
Image by Author

If you have studied NLP, you might have heard about the term "tokenization." It is an important step in text preprocessing, where we transform our textual data into something that machines can understand. It does so by breaking down the sentence into smaller chunks, known as tokens. These tokens can be words, subwords, or even characters, depending on the tokenization algorithm being used. In this article, we will see how to use the Hugging Face Tokenizers Library to preprocess our textual data.

Setting Up Hugging Face Tokenizers Library

To start using the Hugging Face Tokenizers library, you'll need to install it first. You can do this using pip:

pip install tokenizers

The Hugging Face library supports various tokenization algorithms, but the three main types are:

  • Byte-Pair Encoding (BPE): Merges the most frequent pairs of characters or subwords iteratively, creating a compact vocabulary. It is used by models like GPT-2.
  • WordPiece: Similar to BPE but focuses on probabilistic merges (doesn't choose the pair that is the most frequent but the one that will maximize the likelihood of the corpus once merged), commonly used by models like BERT.
  • SentencePiece: A more flexible tokenizer that can handle different languages and scripts, often used with models like ALBERT, XLNet, or the Marian framework. It treats spaces as characters rather than word separators.

The Hugging Face Transformers library provides an AutoTokenizer class that can automatically select the best tokenizer for a given pre-trained model. This is a convenient way to use the correct tokenizer for a specific model andcan be imported from thetransformerslibrary. However, for the sake of our discussion regarding the Tokenizers library, we will not follow this approach.

We will use the pre-trained BERT-base-uncased tokenizer. This tokenizer was trained on the same data and using the same techniques as the BERT-base-uncased model, which means it can be used to preprocess text data compatible with BERT models:

# Import the necessary componentsfrom tokenizers import Tokenizerfrom transformers import BertTokenizer# Load the pre-trained BERT-base-uncased tokenizertokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Single Sentence Tokenization

Now, let's encode a simple sentence using this tokenizer:

# Tokenize a single sentenceencoded_input = tokenizer.encode_plus("This is sample text to test tokenization.")print(encoded_input)

Output:

{'input_ids': [101, 2023, 2003, 7099, 3793, 2000, 3231, 19204, 3989, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

To ensure correctness, let's decode the tokenized input:

tokenizer.decode(encoded_input["input_ids"])

Output:

[CLS] this is sample text to test tokenization. [SEP]

In this output, you can see two special tokens. [CLS] marks the start of the input sequence, and [SEP] marks the end, indicating a single sequence of text.

Batch Tokenization

Now, let's tokenize a corpus of text instead of a single sentence using batch_encode_plus:

corpus = [ "Hello, how are you?", "I am learning how to use the Hugging Face Tokenizers library.", "Tokenization is a crucial step in NLP."]encoded_corpus = tokenizer.batch_encode_plus(corpus)print(encoded_corpus)

Output:

{'input_ids': [[101, 7592, 1010, 2129, 2024, 2017, 1029, 102], [101, 1045, 2572, 4083, 2129, 2000, 2224, 1996, 17662, 2227, 19204, 17629, 2015, 3075, 1012, 102], [101, 19204, 3989, 2003, 1037, 10232, 3357, 1999, 17953, 2361, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

For better understanding, let's decode the batch-encoded corpus as we did incase of single sentence. This will provide the original sentences, tokenized appropriately.

tokenizer.batch_decode(encoded_corpus["input_ids"])

Output:

['[CLS] hello, how are you? [SEP]', '[CLS] i am learning how to use the hugging face tokenizers library. [SEP]', '[CLS] tokenization is a crucial step in nlp. [SEP]']

Padding and Truncation

When preparing data for machine learning models, ensuring all input sequences have the same length is often necessary. Two methods to accomplish this are:

1. Padding

Padding works by adding the special token[PAD]at the end of the shorter sequences to match the length of the longest sequence in the batch or max length supported by the model ifmax_lengthis defined. You can do this by:

encoded_corpus_padded = tokenizer.batch_encode_plus(corpus, padding=True)print(encoded_corpus_padded)

Output:

{'input_ids': [[101, 7592, 1010, 2129, 2024, 2017, 1029, 102, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 2572, 4083, 2129, 2000, 2224, 1996, 17662, 2227, 19204, 17629, 2015, 3075, 1012, 102], [101, 19204, 3989, 2003, 1037, 10232, 3357, 1999, 17953, 2361, 1012, 102, 0, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]]}

Now, you can see that extra 0s are placed, but for better understanding, let's decode to see where the tokenizer has placed the [PAD] tokens:

tokenizer.batch_decode(encoded_corpus_padded["input_ids"], skip_special_tokens=False)

Output:

['[CLS] hello, how are you? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]', '[CLS] i am learning how to use the hugging face tokenizers library. [SEP]', '[CLS] tokenization is a crucial step in nlp. [SEP] [PAD] [PAD] [PAD] [PAD]']

2. Truncation

Many NLP models have a maximum input length sequence, and truncation works by chopping off the end of the longer sequence to meet this maximum length. It reduces memory usage and prevents the model from being overwhelmed by very large input sequences.

encoded_corpus_truncated = tokenizer.batch_encode_plus(corpus, truncation=True, max_length=5)print(encoded_corpus_truncated)

Output:

{'input_ids': [[101, 7592, 1010, 2129, 102], [101, 1045, 2572, 4083, 102], [101, 19204, 3989, 2003, 102]], 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}

Now, you can also use thebatch_decodemethod, but for better understanding, let's print this information in a different way:

for i, sentence in enumerate(corpus): print(f"Original sentence: {sentence}") print(f"Token IDs: {encoded_corpus_truncated['input_ids'][i]}") print(f"Tokens: {tokenizer.convert_ids_to_tokens(encoded_corpus_truncated['input_ids'][i])}") print()

Output:

Original sentence: Hello, how are you?Token IDs: [101, 7592, 1010, 2129, 102]Tokens: ['[CLS]', 'hello', ',', 'how', '[SEP]']Original sentence: I am learning how to use the Hugging Face Tokenizers library.Token IDs: [101, 1045, 2572, 4083, 102]Tokens: ['[CLS]', 'i', 'am', 'learning', '[SEP]']Original sentence: Tokenization is a crucial step in NLP.Token IDs: [101, 19204, 3989, 2003, 102]Tokens: ['[CLS]', 'token', '##ization', 'is', '[SEP]']

This article is part of our amazing series on Hugging Face. If you want to explore more about this topic, here are some references to help you out:

Kanwal Mehreen Kanwal is a machine learning engineer and a technical writer with a profound passion for data science and the intersection of AI with medicine. She co-authored the ebook "Maximizing Productivity with ChatGPT". As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She's also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having founded FEMCodes to empower women in STEM fields.


More On This Topic

  • How to Use Hugging Face’s Datasets Library for Efficient Data Loading
  • How to Use Hugging Face AutoTrain to Fine-tune LLMs
  • How to Use GPT for Generating Creative Content with Hugging Face…
  • A community developing a Hugging Face for customer data modeling
  • Top 10 Machine Learning Demos: Hugging Face Spaces Edition
  • Build AI Chatbot in 5 Minutes with Hugging Face and Gradio
How to Use the Hugging Face Tokenizers Library to Preprocess Text Data - KDnuggets (2024)

FAQs

How do you preprocess a text dataset? ›

Some of the common text preprocessing / cleaning steps are:
  1. Lower casing.
  2. Removal of Punctuations.
  3. Removal of Stopwords.
  4. Removal of Frequent words.
  5. Removal of Rare words.
  6. Stemming.
  7. Lemmatization.
  8. Removal of emojis.

What does tokenizer do in huggingface? ›

Tokenizers are one of the core components of the NLP pipeline. They serve one purpose: to translate text into data that can be processed by the model. Models can only process numbers, so tokenizers need to convert our text inputs to numerical data.

What is tokenization in text preprocessing? ›

Tokenization, in the realm of Natural Language Processing (NLP) and machine learning, refers to the process of converting a sequence of text into smaller parts, known as tokens. These tokens can be as small as characters or as long as words.

What is the purpose of tokenization in the processing of text by llms? ›

Tokenization is a crucial text-processing step for large language models. It splits text into smaller units called tokens that can be fed into the language model. Many tokenization algorithms are available; choosing the correct algorithm leads to better model performance and lower memory requirements.

What are the 5 major steps of data preprocessing? ›

Steps in Data Preprocessing
  • Step 1: Import the Libraries. ...
  • Step 2: Import the Loaded Data. ...
  • Step 3: Check for Missing Values. ...
  • Step 4: Arrange the Data. ...
  • Step 5: Do Scaling. ...
  • Step 6: Distribute Data into Training, Evaluation and Validation Sets.
Sep 28, 2023

What are the steps of text preprocessing? ›

It includes steps like removing punctuation, tokenization (splitting text into words or phrases), converting text to lowercase, removing stop words (common words that add little value), and stemming or lemmatization (reducing words to their base forms).

What is huggingface and how do you use it? ›

Hugging Face lets users create interactive, in-browser demos of machine learning models. This lets users showcase and test models more easily. Research. Hugging Face has been involved in collaborative research projects, such as the BigScience research workshop, aiming to advance the field of NLP.

What is the difference between Huggingface tokenizer and Tokenizerfast? ›

But to use them is exactly the same just the Tokenizer Fast is considerably faster by almost 20 times and this would make it an import choice if you need to tokenizer in a production environment.

What is an example of a tokenizer? ›

Character-based. This type of tokenizer splits a text into individual characters. This is often used for tasks such as text classification or sentiment analysis. For example, the sentence “I love ice cream” would be tokenized into seven characters: [“I”, “ “, “l”, “o”, “v”, “e”, “ “].

What is an example of data tokenization? ›

When a merchant processes the credit card of a customer, the PAN is substituted with a token. 1234-4321-8765-5678 is replaced with, for example, 6f7%gf38hfUa. The merchant can apply the token ID to retain records of the customer, for example, 6f7%gf38hfUa is connected to John Smith.

What is tokenization for dummies? ›

In general, tokenization is the process of issuing a digital, unique, and anonymous representation of a real thing. In Web3 applications, the token is used on a (typically private) blockchain, which allows the token to be utilized within specific protocols.

How do you do tokenization? ›

The simplest way to tokenize text is to use whitespace within a string as the “delimiter” of words. This can be accomplished with Python's split function, which is available on all string object instances as well as on the string built-in class itself. You can change the separator any way you need.

Why do we need tokenizer? ›

Information retrieval: Tokenization is important in search engines and information retrieval systems. These systems break down information into tokens to better index and analyze information. Text preparation: Tokenization helps classify and prepare text by categorizing text into predefined tokens.

What is the main reason for tokenization? ›

What is the Purpose of Tokenization? The purpose of tokenization is to protect sensitive data while preserving its business utility. This differs from encryption, where sensitive data is modified and stored with methods that do not allow its continued use for business purposes.

How will tokenization work? ›

Tokenization replaces sensitive data with unique tokens that have no intrinsic value, while encryption transforms data into an unreadable format that can be reversed with a decryption key.

What is pre-processing of a text document? ›

The text preprocessing process involves unitization and tokenization, stan- dardization and cleaning, stop word removal, and lemmatization or stemming. A custom stop word dictionary can be created to eliminate noise in the text.

Why do we preprocess text data? ›

The goal of text preprocessing is to enhance the quality and usability of the text data for subsequent analysis or modeling. Text preprocessing typically involves the following steps: Lowercasing. Removing Punctuation & Special Characters.

How to do text processing? ›

The process of text preprocessing removes all the noise from our text data to make it ready for text representation and to be trained for the machine learning model. The key steps in data preprocessing are tokenization, stopword removal, punctuation removal, lemmatization, and stemming.

How do I clean a text dataset? ›

Top 20 Essential Text Cleaning Techniques
  1. Removing HTML Tags and Special Characters. HTML tags and special characters are common in web-based text data. ...
  2. Tokenization. ...
  3. Lowercasing. ...
  4. Stopword Removal. ...
  5. Stemming and Lemmatization. ...
  6. Handling Missing Data. ...
  7. Removing Duplicate Text. ...
  8. Dealing with Noisy Text.
Sep 18, 2023

References

Top Articles
Latest Posts
Article information

Author: Msgr. Benton Quitzon

Last Updated:

Views: 6365

Rating: 4.2 / 5 (63 voted)

Reviews: 86% of readers found this page helpful

Author information

Name: Msgr. Benton Quitzon

Birthday: 2001-08-13

Address: 96487 Kris Cliff, Teresiafurt, WI 95201

Phone: +9418513585781

Job: Senior Designer

Hobby: Calligraphy, Rowing, Vacation, Geocaching, Web surfing, Electronics, Electronics

Introduction: My name is Msgr. Benton Quitzon, I am a comfortable, charming, thankful, happy, adventurous, handsome, precious person who loves writing and wants to share my knowledge and understanding with you.