Deeper Look Into ChatGPT Generated Data

3 min readApr 27, 2023

Yesterday, I wrote about the very impressive inferential capabilities of ChatGPT as a data generator. This, however, only scratched the surface of those capabilities.

I wanted to see if that generated data was truly random (uniform distribution) or if ChatGPT was “smarter” than I was already giving it credit for.

To do this, I decided to take Noteable for a spin. For those of us who have been using IPython/Jupyter for years, Noteable is (as far as my limited testing is concerned) a highly-data-science-optimized “souped up”/”batteries included” and cloud-based augmentation of Jupyter Notebooks.

So, I grabbed the data that ChatGPT generated and saved it to a file. Then uploaded the file to Noteable and created a “Data Asset” off it. I then used their Eldritch magiks “sql” cell to load up and query the data in the notebook with a super simple query

SELECT * 
FROM '../datasets/personal_data/personal_data.csv'

Which resulted in this neat and useful presentation as table as well as basic sparkline column analytics

Okay, with data loaded, I wanted to see if data was correlated, so in the configure menu, I selected scatter plot. It automatically assigned two columns, in this case weight vs. age. Had the data been truly random, this scatter plot would be expected to be all over the place with an R-squared close to zero. Instead, it showed these two columns had a R-squared of 0.78!

Now this had my interest piqued, let’s see what else we can find in this data, so in the menus I chose height as the x-axis. Again, if the data is random, we should expect to see a messy scatter plot. But, to my surprise it wasn’t. Height was correlated to weight even more strongly than age with an R-squared of ~0.97!

It’s here when it began to dawn on me that ChatGPT generated this data with some basis in reality. It must be aware that the older a person is they heavier they get and that the taller they are, the heavier they are as well.

But we must go deeper!

Let’s try a ridgeline plot (you can find them in the configuration menu “Summary Charts”) of weight vs. sex. And much like what we’ve seen so far, the data shows a lower height distribution for females than males.

Ridgeline plot of Males vs Female Height

At this point, I’m kind of astounded by the intelligence capabilities of GPT when generating data. But surely we can find more interesting things in this dataset. Let’s try a word cloud chart of the names to gauge their relative prevalence.

Alright, cool….WAIT A FRIGGIN MINUTE! Let’s check out popularity of baby names in 2020 (approx. when ChatGPT-3 was last trained):

Social Security List Of Popular Baby Names in 2020

I’m both elated and humbled at the same time. ChatGPT has the potential to replace and displace so many tools, people and businesses while at the same time being one of the most empowering tools ever created.

Deeper Look Into ChatGPT Generated Data

Written by Ruben Orduz

Responses (1)