1. Introduction
In recent years, cryptocurrencies have emerged as a significant asset class in the global financial landscape, attracting substantial interest from investors seeking diversification and alternative investment opportunities. The decentralized nature and technological innovations underpinning cryptocurrencies, such as Bitcoin and Ethereum, have reshaped traditional financial paradigms, introducing new dynamics driven by market sentiment and information dissemination.
By understanding public sentiment for a specific cryptocurrency in advance and monitoring market tension, we can predict price movements. Negative sentiment in the crypto market often leads to price drops, while positive sentiment tends to drive price increases. For instance, the recent fear of potential conflict between Israel and Iran caused a marginal 10% decrease in the value of all cryptocurrencies.
In this context, our study focuses on the comparative analysis of large language models (LLMs) and NLP techniques in the domain of cryptocurrency sentiment analysis. Specifically, we investigate the efficacy of fine-tuned models such as GPT-4, BERT, and FinBERT for discerning sentiment from cryptocurrency news articles. By leveraging the capabilities of these sophisticated language models, we aim to explore how effectively they can capture nuanced sentiment patterns, thus contributing to a deeper understanding of sentiment dynamics within the cryptocurrency market.
The primary objectives of this study are twofold: firstly, to evaluate the performance of LLMs and NLP models in cryptocurrency sentiment analysis through a comparative classification study, providing insights that can inform investment strategies and risk management practices within cryptocurrency markets; and secondly, to address specific research inquiries that previous studies have not adequately covered.
-
Q1: Among the fine-tuned models of GPT-4, BERT, and FinBERT, which one demonstrates superior predictive capabilities in cryptocurrency news sentiment analysis and classification?
-
Q2: What is the impact of fine-tuning LLMs and NLP models for specific tasks?
2. Literature Review
Cryptocurrency sentiment analysis involves employing sophisticated NLP models to analyze textual data sources like news articles, social media posts, forums, and blogs related to cryptocurrencies. The primary objective is to extract and quantify sentimentโwhether positive, negative, or neutralโexpressed in these texts to comprehend the prevailing market sentiment towards specific cryptocurrencies or the broader market as a whole.
The significance of cryptocurrency sentiment analysis in financial markets is underscored by several compelling reasons:
-
Market sentiment understanding: Cryptocurrency markets are highly sensitive to sentiment influenced by news events, social media trends, regulatory developments, and investor perceptions [4]. Sentiment analysis aids in assessing the prevailing sentiment and mood of market participants, offering insights into potential market movements.
-
Decision-making support: Sentiment analysis plays a pivotal role in guiding investment decisions. Traders and investors leverage sentiment-driven insights to make informed decisions; positive sentiment may signal buying opportunities, while negative sentiment could prompt caution or sell signals [7,8].
-
Risk management: Assessing sentiment contributes to managing risks linked with cryptocurrency investments [9]. Sudden shifts in sentiment towards specific cryptocurrencies could indicate potential price volatility or market downturns.
In the cryptocurrency ecosystem, sentiment analysis offers unique advantages owing to the decentralized and rapidly evolving nature of cryptocurrencies. The wealth of online data sources, including social media platforms and cryptocurrency news websites, renders sentiment analysis particularly relevant and valuable for comprehending market behavior and investor sentiment in this domain.
By harnessing sentiment analysis techniques, researchers and market participants attain deeper insights into the factors driving cryptocurrency markets, empowering them to make more informed decisions based on sentiment-driven intelligence. This underscores the pivotal role of sentiment analysis in enriching market intelligence and facilitating effective decision-making within the dynamic and volatile cryptocurrency landscape.
2.1. NLP in Sentiment Analysis
NLP is integral to the field of sentiment analysis, offering tailored techniques to effectively extract sentiment from textual data. In sentiment analysis, NLP methods are essential for discerning the emotions and opinions conveyed through language. Some of the most widely used techniques and applications within financial contexts include:
-
Sentiment lexicons: Sentiment lexicons are curated dictionaries featuring words or phrases annotated with sentiment scores (e.g., positive, negative, neutral) [7], aiding sentiment analysis by quantifying sentiment from specific word usage. Widely used lexicons like SentiWordNet and Vader are foundational resources. In financial sentiment analysis, specialized lexicons capturing financial terminology enhance accuracy by accounting for domain-specific expressions and market conditions.
-
Deep learning approaches: Deep learning architectures, such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), and bidirectional encoder representations from transformers (BERT) are extensively applied in sentiment analysis [10]. CNNs are effective for feature extraction, while RNNs excel in sequence modeling [11], and BERT specializes in capturing contextual information from language. Their application demonstrates remarkable accuracy in deciphering intricate patterns within text. These models undergo fine-tuning with financial datasets to better understand the nuances of financial language, thereby improving sentiment analysis in financial markets. Furthermore, integrating domain-specific features through NLP techniques enhances their effectiveness, enabling the extraction of meaningful signals from financial text data [12].
2.2. Previous Studies on Sentiment Analysis in Cryptocurrencies
Our literature review presents a comprehensive overview of 49 papers on the evolving role of sentiment analysis in understanding and predicting cryptocurrency market dynamics. The studies showcase a diverse array of methodologies, ranging from traditional sentiment analysis to cutting-edge deep learning models, that highlight the significant impact of sentiment-driven factors on cryptocurrency price movements and investor behavior. By leveraging data sources such as social media platforms, news articles, and market indicators, researchers demonstrate the potential of sentiment analysis to provide valuable insights for navigating the volatile cryptocurrency landscape and making informed investment decisions. Despite the progress made, challenges such as bias mitigation in sentiment data and the nuanced interplay between sentiment and fundamental market factors underscore the ongoing need for further exploration and refinement in this burgeoning field of research. Through innovative approaches and practical applications, the reviewed studies collectively contribute to advancing sentiment analysis as a powerful tool for cryptocurrency market analysis and decision-making.
2.2.1. Traditional Supervised Learning
2.2.2. Deep Learning
2.2.3. Lexicon-Based Sentiment Analysis
2.2.4. BERT and Transformer Models
2.2.5. Time-Series Analysis
2.2.6. Computational Text Analysis
2.2.7. Hybrid Models
2.2.8. Other Techniques
3. Materials and Methods
This study initially focuses on utilizing the innovative GPT-4 LLM, alongside a parallel comparison with BERT and FinBERT NLP models. All the models have demonstrated proficiency in comprehending human text. In this proposed approach, the models will undergo fine-tuning on a dataset using few-shot learning, and their sentiment analysis capabilities will be evaluated through direct comparisons before and after fine-tuning.
In the early stages of our research, we rigorously assess GPT-4โs ability to accurately identify sentiments within a crypto news article. This involves conducting an exhaustive performance evaluation by comparing the base GPT-4 model with its fine-tuned counterpart. Subsequently, we proceed to fine-tune the BERT and FinBERT models for cryptocurrency sentiment analysis. We compare the fine-tuned LLM and NLP models, demonstrating how the fine-tuning process enhances their capability to effectively address the complex challenges of sentiment analysis with improved precision.
To meet this paperโs goals, a tailored research methodology was crucial, chosen to facilitate data cleaning, feature engineering, model implementation, and fine-tuning. Subsequent subsections elaborate on this framework, offering a comprehensive strategy for the study.
3.1. Dataset Splitting, Cleaning, and Preprocessing
To ensure the quality and effectiveness of our predictive modeling and fine-tuning procedures, we meticulously prepared the dataset through a systematic approach comprising several essential steps designed to enhance its suitability for classification and sentiment analysis tasks.
Initially, we parsed the โsentimentโ column, extracting solely the class label from dictionaries like {โclassโ: โnegativeโ, โpolarityโ: -0.01, โsubjectivityโ: 0.38}. Following this, we focused on data preprocessing to enhance dataset quality, specifically removing special characters and unnecessary white spaces to maintain data integrity and coherence. This step was crucial to ensure compatibility for modeling purposes and prevent unwanted noise that could affect NLP model performance.
Additionally, text normalization played a crucial role in ensuring uniformity and standardization of text data. Operations included converting accented characters to their base forms, ensuring consistent treatment of words with accent variations. Furthermore, to achieve case-insensitivity, we systematically converted all text to lowercase.
For our few-shot learning study, we randomly selected 5000 rows from the dataset of 31,037, ensuring equal distribution of labels across the three sentiment categories.
In NLP tasks, dataset division is pivotal for effective model development, refinement, and evaluation. Our methodology involved splitting the dataset into training, validation, and test sets, facilitating model learning from input data during fine-tuning, tuning hyperparameters for enhanced generalization, and ultimately evaluating model performance through predictions on the test set.
Fine-tuning BERT or its variants (like FinBERT) typically requires label encoding due to the nature of the underlying frameworks (like PyTorch and TensorFlow). These frameworks expect numerical labels for classification tasks. Although higher-level libraries (such as Hugging Faceโs Transformers) allow working with string labels at a more abstract level, internally, these labels are converted to numerical values. For example, when using Hugging Faceโs Transformers library, you can specify labels as strings in the configuration or during dataset preparation, but the framework will encode these labels into integers for processing by the model. For our classification task, we selected transforming the string sentiment labels into integers using nominal encoding: positive as 1, negative as 0, and neutral as 2, without implying any order. Additionally, we employed cross-entropy loss, the default loss function in BertForSequenceClassification, which is well suited for classification tasks with nominally encoded labels.
3.2. GPT Prompt Engineering
Our objective was to develop a prompt that seamlessly integrates with various LLMs, enhancing the accessibility of their results through our code. We focused not only on the promptโs content but also on its output formatting. To provide further clarity on the process of deriving the prompt design, we elaborate on several key aspects:
-
Initial research and understanding: The process began with a thorough examination of various language model architectures, including but not limited to GPT-3, GPT-4, LLaMA-2, and others. This involved understanding their capabilities, limitations, and unique features. This foundational research ensured that our prompt design would be compatible and effective across a diverse range of models.
-
Identification of challenges: We identified two primary challenges in crafting the prompt: (a) creating content that would be model-agnostic and (b) ensuring the accessibility of the output format [55]. The former required developing a prompt that could be understood and responded to by any LLM, regardless of its architecture. The latter involved designing an output format that would facilitate ease of use and integration into code, particularly emphasizing JSON formatting for versatility and compatibility.
-
Finalization and validation: After rigorous testing and refinement, we arrived at a finalized prompt that effectively elicited responses in the desired output format. Validation through Source Code 1 demonstrates the promptโs ability to generate meaningful responses comprehensible to all models, thus fulfilling our objectives of accessibility and compatibility.
conversation.append({โroleโ: โsystemโ, โcontentโ: โYou are a crypto expert.โ}) |
conversation.append({โroleโ: โuserโ, |
โcontentโ: โEvaluate the sentiment of the news article. Return your response in JSON format {โsentimentโ: |
โnegativeโ} or {โsentimentโ: โneutralโ} or {โsentimentโ: โpositiveโ}. Article:\nโ + |
ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย input[โtextโ] + โ}) |
3.3. Model Deployment, Fine-Tuning, and Predictive Evaluation
As discussed earlier in this study, we employ two distinct models to address sentiment analysis and classification tasks within the cryptocurrency sector. While previous research has explored cryptocurrency sentiment analysis using various algorithms and techniques, there is a significant gap in utilizing LLMs for this purpose. In this study, we have decided to extensively test the GPT-4, BERT, and FinBERT models both before and after fine-tuning. Below, we present an overview of how each model is deployed, showcasing our innovative approach to utilizing them for this specific task.
3.3.1. GPT Model Deployment and Fine-Tuning
Conceptual Background of LLM Fine-Tuning
Fine-tuning an LLM involves taking a pre-trained language model and training it further on a specific dataset to adapt it for a particular task. This process allows the model to learn task-specific patterns and nuances that may not be covered during its general pre-training phase.
-
Low-rank adaptation (LoRA): this technique reduces the number of trainable parameters by decomposing the weight matrices into lower-rank matrices, making the training process more efficient.
-
Parameter-efficient fine-tuning (PEFT): PEFT methods focus on fine-tuning only a small subset of the modelโs parameters, reducing the computational load and memory usage.
-
DeepSpeed: an optimization library that facilitates efficient large-scale model training by improving GPU utilization and memory management.
Fine-Tuning Phase
We fine-tuned the gpt-4-0125-preview base model using the official OpenAI API, incorporating the following steps:
-
Data preparation: We generated two JSONL files for training and validation, containing promptโcompletion pairs (Source Code 2). As described in Section 3.1, these pairs were derived from our training and validation CSV files, which contain the textual content of news articles and their corresponding sentiment labels.
-
Training configuration: the fine-tuning was conducted with the following hyperparameters:
-
Job ID: ft:gpt-4-turbo-0125:personal::9Aa7kYOh.
-
Total tokens: 765,738.
-
Epochs: 3.
-
Batch size: 6.
-
Learning rate (LR) multiplier: 8.
-
Seed: 65426760.
-
Training process: The model underwent a multi-epoch training strategy, iteratively refining its understanding and capabilities. The initial training loss was 0.9778, which gradually decreased to nearly 0.1, indicating significant improvement in model performance over time.
Sentiment Analysis and Evaluation Phase
Following fine-tuning, the model was deployed to perform sentiment analysis on a test set of news articles. The process included the following steps:
-
Prediction generation: The fine-tuned model was tasked with predicting the sentiment of each article in the test set by making calls to the OpenAI API, specifying the ID of the fine-tuned model. Analyzing the textual content, the model examined the articles and generated sentiment labels based on the patterns learned during fine-tuning, presenting the results in a JSON format.
-
Comparison with original labels: The predicted sentiment labels were rigorously compared against the original labels in the dataset. This comparison facilitated a comprehensive analysis to assess the modelโs effectiveness in capturing article sentiments and its alignment with human judgments.
-
Integration of results: the outcomes of the sentiment analysis were integrated into the test_set.csv file, providing a consolidated view for subsequent comparative analyses.
-
Performance metrics: To evaluate the modelโs performance, standard metrics such as accuracy, precision, recall, and F1-score were computed. These metrics provided insights into the modelโs ability to generalize and perform accurately on unseen data.
The same prediction and evaluation procedure was applied to the GPT-4 base model using zero-shot training. This allowed for a direct comparison between the base model and the fine-tuned model, underscoring the necessity of fine-tuning for specific tasks.
{โmessagesโ: [ |
{โroleโ: โsystemโ, โcontentโ: โYou are a crypto expert.โ}, |
{โroleโ: โuserโ, โcontentโ: โEvaluate the sentiment of the news article.โ |
โReturn your response in JSON format {\โsentiment\โ: \โnegative\โ}โ |
โor {\โsentiment\โ: \โneutral\โ} or {\โsentiment\โ: \โpositive\โ}.โ |
โArticle:\n โฆโ}, |
{โroleโ: โassistantโ, โcontentโ: โ{\โsentiment\โ:\โnegative\โ}โ} |
]} |
3.3.2. BERT and FinBERT Model Deployment and Fine-Tuning
Training Phase
-
Data preparation: the dataset, as described in Section 3.1, was prepared for our task, resulting in the creation of three files: test, train, and validation CSV files.
-
Training configuration: both bert-base-uncased and FinBert models were trained with the following hyperparameters:
-
Optimizers: Adam and AdamW.
-
Epochs: 3.
-
Batch size: 6.
-
Learning rate: 6.
-
Maximum sequence length: 512.
-
Training process: The training procedure took place in the Google Colab environment. Each model underwent three epochs of training, with progress monitored using the tqdm library. The training involved backpropagation, optimization, and validation, leveraging an A100 GPU for accelerated computations. Post-training, the fine-tuned models and tokenizers were saved to a directory in Google Drive for future use.
Sentiment Analysis and Evaluation Phase
4. Results
4.1. GPT Base Model Evaluation Phase
In the initial phase of our study, we used the GPT-4 LLM base model, specifically gpt-4-0125-preview, to conduct sentiment analysis and classification on cryptocurrency news articles. We employed the test set consisting of 1000 articles with the goal of assessing the accuracy of the GPT base model in categorizing each articleโs sentiment (positive, neutral, or negative).
Upon thorough analysis of the modelโs outputs and comparison with the original user-provided labels, we made several notable observations. The gpt-4-0125-preview model exhibited a commendable level of accuracy, correctly predicting the sentiment class for 82.9% of the articles. This equates to successful predictions for 829 out of the 1000 articles in our test dataset. These results demonstrate that even the base model, without specific fine-tuning, shows a significant ability to infer sentiment from content based on contextual clues.
It is important to highlight that achieving an 82.9% accuracy rate in predictive modeling is quite high, indicating that the model can be confidently applied to sentiment analysis and classification tasks. Nonetheless, further improvements can be achieved by fine-tuning the models to better suit the specific task at hand. This refinement process can enhance their ability to capture and understand complex patterns and relationships within the data.
4.2. Fine-Tuned Models Evaluation Phase
In the subsequent phase of our research, we focused on fine-tuning the GPT-4, BERT, and FinBERT models using two different optimizers. This fine-tuning process was conducted using a training set of 3200 cryptocurrency articles, with the primary objective of enhancing the modelsโ performance and their ability to analyze cryptocurrency sentiment.
Precision values, indicating the modelsโ ability to accurately classify instances within each class, vary across models and classes. Interestingly, the GPT base and fine-tuned models demonstrate relatively higher precision in predicting positive labels (class 1), while both the BERT and FinBERT models excel in predicting neutral labels (class 2), suggesting a higher tendency for false positives in the class 0 and 1.
The recall values reflect the modelsโ proficiency in correctly capturing instances of each class. For instance, the ft:finbert-adamw model exhibits high recall for class 2, highlighting its effectiveness in identifying instances belonging to that class.
The F1-score, striking a balance between precision and recall for each class, showcases the ft:gpt-4 modelโs superior performance with the highest F1-score for class 1, indicating its balanced precision and recall for that class.
In summary, while each model displays strengths in specific metrics, the fine-tuned GPT-4 model emerges as the standout performer with high accuracy, balanced precision and recall, and low mean absolute error, suggesting its superior performance across multiple evaluation criteria. Nevertheless, the choice of model may vary depending on specific use cases or priorities, where other models may offer advantages in certain aspects such as recall for specific classes or precision in particular scenarios.
4.3. Assessing Modelsโ Performance and Proximity with Original Labels
Armed with these insights, further optimizations can be pursued through adjustments to enhance overall model performance.
5. Discussion
In earlier sections, we delved into the methodology and outcomes related to the predictive performance of the GPT-4 LLM, BERT, and FinBERT NLP models before and after fine-tuning for cryptocurrency news sentiment analysis and classification. This section unveils the research discoveries and insights gained by the authors concerning the effectiveness of LLMs and NLP models as valuable tools for sentiment analysis in the cryptocurrency domain, addressing pertinent research inquiries.
5.1. Research Findings
Moreover, both the BERT and FinBERT models with the ADAM optimizer nearly matched the predictive capability of a robust LLM, differing by only 3.4% and 2.4%, respectively. This small variation could be attributed to differences in their pre-training phases, where the GPT-4 model had the advantage of exposure to a wider and more varied set of public datasets. This extensive training likely facilitated more comprehensive and targeted fine-tuning, granting the GPT model a deeper comprehension of nuanced cryptocurrency themes. Consequently, the GPT model might be better positioned to produce precise and contextually relevant responses.
It is important to note that each model exhibits higher accuracy for specific labels. For instance, the fine-tuned GPT-4 model is more accurate in predicting positive labels, while BERT and FinBERT excel in predicting neutral labels. This highlights that each model has its own strengths and weaknesses, suggesting that a hybrid approach would be more effective in maximizing results in a production environment.
Finally, it is imperative to highlight that even without fine-tuning, the GPT-4 base model achieved 82.9% accuracy. This demonstrates that LLMs, due to their extensive pre-training with billions of parameters, can perform tasks with high accuracy whether fine-tuned on a specific dataset or not. Such capabilities could lead to the development of zero-shot, simple yet accurate tools for non-specialized teams and organizations.
5.2. The Impact of Fine-Tuning
Customizing LLMs and NLP models via fine-tuning for specialized tasks within domains like finance and cryptocurrency is essential to optimize their effectiveness in practical applications. Models such as OpenAIโs GPT-4 or BERT are initially trained on vast text datasets from diverse sources, which equips them to understand and generate human-like text across various domains. However, fine-tuning these models for specific tasks like sentiment analysis significantly boosts their performance and relevance in focused contexts. For instance, the FinBERT model, a variant of BERT, is fine-tuned on specialized financial label and unlabeled data. Moreover, even after fine-tuning, these models can undergo further fine-tuning for specific tasks.
Based on the results derived from this study, fine-tuning an LLM or an NLP model on large and representative datasets allows it to gain a deeper comprehension of the nuances and patterns unique to sentiment analysis, such as in the crypto domain, leading to more accurate predictions. This adaptation involves learning to recognize crypto-related terminology, grasp context, and extract relevant features from textual inputs. Through iterative adjustments in the fine-tuning process, using both training and validation data, the model becomes increasingly proficient at capturing these domain-specific intricacies, resulting in improved accuracy and other performance metrics.
5.3. Exploring the Cost and Usability of LLMs and NLP Models
Our research revealed that both the GPT and NLP models achieved high accuracy in the sentiment analysis of cryptocurrency news articles. While the fine-tuned GPT model showed slightly higher accuracy compared to BERT and FinBERT, it is important to note that BERT is an open-source model available for free use in a self-hosted environment. On the other hand, GPT-4 requires access through the official OpenAI API, which incurs a cost.
For stakeholders or investors seeking accurate and immediate predictions, the GPT API, despite its cost, offers a more convenient and sustainable solution due to its user-friendly interface. Conversely, for companies equipped with an AI team, deploying a self-hosted solution with BERT or similar NLP models may be a more cost-effective choice.
Specifically, within the cryptocurrency sector, where prices can fluctuate within minutes, software equipped with NLP technologies can detect market trends through news blogs, Telegram, or socials, providing crypto investors with information to adjust their strategies and potentially maximize profits. For instance, if a global event like a conflict between two countries is imminent, an investor could rely on the model to analyze user comments on platforms like Telegram and preemptively sell certain cryptocurrencies to avoid a significant drop in their value. Conversely, investors could also take advantage of low-cost cryptocurrencies that, based on crypto signals, might rapidly increase in value, leading to substantial gains.
Nevertheless, it is prudent to caution investors about spam comments or articles deliberately created to stimulate investment. For example, many articles are written daily speculating that Ethereum could reach $10,000 by 2025. Such articles serve two purposes: generating clicks and enticing inexperienced investors to invest in the cryptocurrency with the expectation of substantial value multiplication.
6. Conclusions
In summary, the integration of LLMs and NLP models for cryptocurrency sentiment analysis represents a powerful toolset that enhances investment decision-making in the dynamic cryptocurrency market. This study showcases the efficacy of state-of-the-art models like GPT-4 and BERT in accurately interpreting and categorizing sentiments extracted from cryptocurrency news articles. The key strength of these models lies in their ability to capture subtle shifts and complexities in market sentiments, enabling investors to navigate the volatile cryptocurrency landscape with greater insight and confidence.
By leveraging advanced NLP capabilities, including few-shot fine-tuning processes, this study highlights the adaptability and robustness of LLMs and NLP models in analyzing sentiment data within the cryptocurrency domain. These findings underscore the transformative potential of advanced NLP techniques in empowering investors with actionable insights, enabling them to proactively manage risks, identify emerging trends, and optimize investment strategies to maximize returns.
Specifically, the application field of the research results extends to cryptocurrency investment and risk management. By providing a deeper understanding of sentiment dynamics in cryptocurrency markets, this study facilitates a more informed and data-driven approach to cryptocurrency investment, enabling investors to make well-informed decisions based on the real-time sentiment analysis of news articles and other relevant sources.