Introduction to ChatGPT for R Programming

In this lab, we explore how to use ChatGPT effectively and responsibly for programming in R.

While ChatGPT can be a powerful tool for solving specific programming challenges, it can also easily be misused if not properly understood.

As a result, ChatGPT often confidently offers incorrect solutions. Always double-check the work, and remember that critical thinking is more crucial than ever in the era of LLMs.

What is ChatGPT?

GPT-3.5/GPT-4.0 is an example of a Large Language Model (LLM), which is a type of deep learning model that is designed to interpret and generate human language. These are the underlying models of the free and pro versions of ChatGPT as of January 2024. LLMs are trained using vast amounts of text data. The training process involves using machine learning frameworks such as PyTorch or TensorFlow to train neural networks with layers that simulate a form of artificial “neurons” and “synapses,” where weights are adjusted through the process of training. The outcome of this training is file which contains a complex array of weights that capture patterns in data. To understand the scale, the GPT-5 model contains 175 billion parameters but was trained on only estimated 570GB of compressed data because training these types of models is extremely computationally expensive.

These massive, non-human readable files containing the weights of the neural network are what we refer to as “models” and may look something like this if opened up in a normal text file:


�i�]��g���a�<�f�+U�k�:�ڷ�2���A��ܹ������n��7�b�>�O��W�
��N�]��ɻ��Z��Ś�M8�ټ�X��v�b��^�=�r��x�d�K�G�ۏ�0�|�����
��ؽ��U�z�����q�֔�:�r�:�%��ˬ�E�,�����+�'�����E��j�i�

Large Language Models use these models to simply predict the next item in a sequence. Imagine a text that begins with “Paris is the capital of,” the model will predict that the next word is likely to be “France” based on the training it has received. It does this by calculating the probabilities of various words being the next sequence and selecting the most likely option. This is a simplification, but it captures the essence of how these models operate—predicting text based on patterns learned from a dataset. So the LLM does not “understand” the text it is generating, it is simply predicting the next most likely sequence of text based on the training it has received, which is crucial to always keep in mind when using a model like ChatGPT.

ChatGPT for Programming: Responsible and Irresponsible Uses

Responsible Uses	Irresponsible Uses
- Solve discrete, well-defined programming issues	- Writing large amounts of code without oversight
- Generate conceptual examples for guidance	- Expecting error-free or optimal code as the free version does not run any code
- Explain coding concepts and functions	- Substituting comprehensive understanding of the language which is necessary to critically analyze ChatGPT outputs
- Offer potential debugging strategies	- Trusting ChatGPT for up-to-date information without considering the age of the training data
	- Making assumptions about what ChatGPT knows; LLMs lack “common sense” and explicit context must be provided

Tips for Optimal Use of ChatGPT

Context Is Key: Supply detailed context to enhance the accuracy of ChatGPT’s responses. If you don’t fully understand what you need and cannot explain it in great detail, ChatGPT will not understand either but will make incorrect assumptions to fill in any blanks.
Avoid Negatives: AI in general struggles with negatives. For example, “No wheat in the recipe” can often be misinterpreted as “… wheat in the recipe.” A better option is to stick to positives, “Gluten-free recipe.”
Divide and Conquer: Break down complex issues into smaller, more manageable tasks.Pose sequential questions to guide ChatGPT through the problem-solving process.
Verify Solutions: Cross-reference ChatGPT’s solutions with trusted sources or work with ChatGPT to generate code you can use to verify solutions make sense.
Incorporate Debugging Into Your Code: Use ChatGPT to incorporate debugging strategies into your code, such as printing intermediate results.
Incremental Testing: Test changes incrementally to confirm their correctness.
Feedback Loop: Remember, the free version of ChatGPT cannot run code, so you must provide specific feedback to steer ChatGPT in the right direction.
Provide Examples: Even if you think it might be obvious, always provide examples of what you are looking for. A fantantsic approach to using ChatGPT for programming is to provide dummy results that look like what you would like to produce.
Consistent Threads: Threads are an amazing and often overlooked tool to enhance ChatGPT. Maintaining a thread for a given topic can train that individual thread to best meet your needs.
Avoid Leading Questions: ChatGPT can be easily coerced by leading questions. To get less biased results, avoid leading questions.

My biggest tip: Imagine you are a manager and ChatGPT is your perpetually brand-new employee. They are an enthusiastic and very fast worker, but are often wrong, don’t have any background knowledge in your field yet, and are extremely forgetful. Let them do the grunt work, but you need to be the one to always double-check their work. If ChatGPT is your brand-new employee, think of yourself as the editor-in-chief.

Ethical Considerations

Acknowledgment of AI Assistance: Credit ChatGPT’s contributions where appropriate. Do not misrepresent AI-generated content as solely human-created.

Activity

Practice using ChatGPT to troubleshoot the following code. What do you find? What does this reveal about how ChatGPT should be used and why you need to be very cautious when using an LLM?

rm(list=ls()) 

library(ggplot2)

Warning: package 'ggplot2' was built under R version 4.3.2

library(dplyr)

Warning: package 'dplyr' was built under R version 4.3.2


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(MASS)

Warning: package 'MASS' was built under R version 4.3.2


Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select

# Part 1: Create a data frame
data <- data.frame(
  Category = c("A", "B", "C", "D", "E"),
  Value = c(5, 3, 15, 9, 12),
  Exclude = c(TRUE, FALSE, TRUE, FALSE, TRUE)
)

# Part 2: Perform data transformation using dplyr's select function (Error introduced here)
transformed_data <- data %>%
  select(-Exclude) %>%
  mutate(RelativeValue = Value / sum(Value))

Error in select(., -Exclude): unused argument (-Exclude)

# Part 3: Create a ggplot
plot <- ggplot(transformed_data, aes(x = Category, y = RelativeValue)) +
  geom_col() +
  labs(title = "Relative Values by Category")

Error in eval(expr, envir, enclos): object 'transformed_data' not found

# Part 4: Print the plot
print(plot)

function (x, y, ...) 
UseMethod("plot")
<bytecode: 0x0000020ea162dc08>
<environment: namespace:base>

Answer:

The select() function for dplyr is being masked by the MASS package. The solution is to use dplyr::select() to specify the select() function from the dplyr package. This is a common solution when working with the select() function since there is a lot of packages with functions of the same name.

ChatGPT will not initially catch the cause of this error and may actually attempt to confidently teach you completely incorrect information about the dplyr::select() function:

A tip is to have ChatGPT walk you through each line of the code. This is a great way to use ChatGPT to help you understand a script you do not understand. In doing so ChatGPT will will warn you about potentially conflicts between packages:

Another tip in general is to use ChatGPT to add debugging or print statements to your code. However, in this particular case that is unlikely to help.

In summary, this is great example of the limits of ChatGPT. Because the free version cannot run the code, it relies on you for feedback and does not readily catch this kind of error. Also in many cases, it will be confidently incorrect and if you do not have a strong understanding on programming it could teach you incorrect information. More responsible uses of ChatGPT is to first use it better understand the code itself. Then use it help you troubleshoot bit by bit instead of giving it large chunks of code.

rm(list=ls()) 

library(dplyr)
library(MASS)
library(ggplot2)

# Part 1: Create a data frame
data <- data.frame(
  Category = c("A", "B", "C", "D", "E"),
  Value = c(5, 3, 15, 9, 12),
  Exclude = c(TRUE, FALSE, TRUE, FALSE, TRUE)
)

# Part 2: Perform data transformation using dplyr's select function (Error introduced here)
transformed_data <- data %>%
  dplyr::select(-Exclude) %>%
  mutate(RelativeValue = Value / sum(Value))

# Part 3: Create a ggplot
plot <- ggplot(transformed_data, aes(x = Category, y = RelativeValue)) +
  geom_col() +
  labs(title = "Relative Values by Category")

# Part 4: Print the plot
print(plot)

References

Stevens, Martin Henry Hoffman. 2009. A Primer of Ecology with r. Springer.