Generating Fake CSV Data With Python

I write content for AWS, Kubernetes, Python, JavaScript and more. To view all the latest content, be sure to visit my blog and subscribe to my newsletter. Follow me on Twitter.

This is Day 24 of the #100DaysOfPython challenge.

This post will use the Faker library to generate fake data and export it to a CSV file.
We wil be emulating some of the free datasets from Kaggle, in particular the Netflix original films IMDB score to generate something similar.
The final code can be found here.
Prerequisites
  • Familiarity with Pipenv. See here for my post on Pipenv.
  • Familiarity with JupyterLab. See here for my post on JupyterLab.
  • Getting started
    Let's create the generating-fake-csv-data-with-python directory and install Pillow.
    # Make the `generating-fake-csv-data-with-python` directory
    $ mkdir generating-fake-csv-data-with-python
    $ cd generating-fake-csv-data-with-python
    # Create a folder to place your icons
    $ mkdir docs
    
    # Init the virtual environment
    $ pipenv --three
    $ pipenv install faker
    $ pipenv install --dev jupyterlab
    At this stage, we have the packages that we
    Now we can start up the notebook server.
    # Startup the notebook server
    $ pipenv run jupyter-lab
    # ... Server is now running on http://localhost:8888/lab
    The server will now be up and running.
    Creating the notebook
    Once on http://localhost:8888/lab, select to create a new Python 3 notebook from the launcher.
    Ensure that this notebook is saved in generating-fake-csv-data-with-python/docs/generating-fake-data.ipynb.
    We will create four cells to handle four parts of this mini project:
  • Importing Faker and generating data.
  • Importing the CSV module and exporting the data to a CSV file.
  • Before generating our data, we need to look at what we are trying to emulate.
    Emulating The Netflix Original Movies IMDB Scores Dataset
    Looking at the preview for our dataset, we can see that it contains the following columns and example rows:
    Title Genre Premiere Runtime IMDB Score Language
    Enter the Anime Documentary August 5, 2019 58 2.5 English/Japanese
    Dark Forces Thriller August 21, 2020 81 2.6 Spanish
    We only have two rows for example, but from here we can make a few assumptions about how we want to emulate it.
  • In our langauges, we will stick to a single language (unlike the example English/Japanese).
  • IMDB scores are between 1 and 5. We won't be too harsh on any movies and go from 0.
  • Runtimes should emulate a real movie - we can set it to be between 50 and 150 minutes.
  • Genres may be something we need to write our own Faker provider for.
  • We are going to be okay with non-sense data, so we can just use a string generator for the names.
  • With this said, let's look at how we can fake this.
    Emulating a value for each column
    We will create seven cells - one to import Faker and one for each column.
    For the first cell, we will import Faker.
    from faker import Faker
    
    fake = Faker()
    Secondard, we will fake a movie name with words:
    def capitalize(str):
        return str.capitalize()
    words = fake.words()
    capitalized_words = list(map(capitalize, words))
    movie_name = ' '.join(capitalized_words)
    print(movie_name) # Serve Fear Consider
    Third, we will generate a date this decate and use the same format as the example:
    from datetime import datetime
    
    date = datetime.strftime(fake.date_time_this_decade(), "%B %d, %Y")
    print(date) # April 30, 2020
    Fourth, we will create our own fake data geneartor for the genre:
    # creating a provider for genre
    from faker.providers import BaseProvider
    import random
    
    # create new provider class
    class GenereProvider(BaseProvider):
        def movie_genre(self):
            return random.choice(['Documentary', 'Thriller', 'Mystery', 'Horror', 'Action', 'Comedy', 'Drama', 'Romance'])
    
    # then add new provider to faker instance
    fake.add_provider(GenereProvider)
    
    # now you can use:
    movie_genre = fake.movie_genre()
    print(movie_genre) # Horror
    Fifth, we will do the same for a language:
    # creating a provider for genre
    from faker.providers import BaseProvider
    import random
    
    # create new provider class
    class LanguageProvider(BaseProvider):
        def language(self):
            return random.choice(['English', 'Chinese', 'Italian', 'Spanish', 'Hindi', 'Japanese'])
    
    # then add new provider to faker instance
    fake.add_provider(LanguageProvider)
    
    # now you can use:
    language = fake.language()
    print(language) # Spanish
    Sixth we need to generate a runtime:
    # Getting random movie length
    movie_len = random.randrange(50, 150)
    print(movie_len) # 143
    Lastly, we need a rating with one decimal point between 1.0 and 5.0:
    # Movie rating
    random_rating = round(random.uniform(1.0, 5.0), 1)
    print(random_rating) # 2.2
    Now that we have all our information together, it is time to generate a CSV with 100 entries.
    Generating the CSV
    We can place everything we know into a last cell to generate some data:
    from faker import Faker
    from faker.providers import BaseProvider
    import random
    import csv
    
    class GenereProvider(BaseProvider):
        def movie_genre(self):
            return random.choice(['Documentary', 'Thriller', 'Mystery', 'Horror', 'Action', 'Comedy', 'Drama', 'Romance'])
    
    class LanguageProvider(BaseProvider):
        def language(self):
            return random.choice(['English', 'Chinese', 'Italian', 'Spanish', 'Hindi', 'Japanese'])
    
    fake = Faker()
    
    fake.add_provider(GenereProvider)
    fake.add_provider(LanguageProvider)
    
    # Some of this is a bit verbose now, but doing so for the sake of completion
    
    def get_movie_name():
        words = fake.words()
        capitalized_words = list(map(capitalize, words))
        return ' '.join(capitalized_words)
    
    def get_movie_date():
        return datetime.strftime(fake.date_time_this_decade(), "%B %d, %Y")
    
    def get_movie_len():
        return random.randrange(50, 150)
    
    def get_movie_rating():
        return round(random.uniform(1.0, 5.0), 1)
    
    def generate_movie():
        return [get_movie_name(), fake.movie_genre(), get_movie_date(), get_movie_len(), get_movie_rating(), fake.language()]
    
    with open('movie_data.csv', 'w') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(['Title', 'Genre', 'Premiere', 'Runtime', 'IMDB Score', 'Language'])
        for n in range(1, 100):
            writer.writerow(generate_movie())
    Running the cell will output the CSV file movie_data.csv in our root that looks like this:
    Title,Genre,Premiere,Runtime,IMDB Score,Language
    Discuss According Model,Horror,"February 09, 2020",107,2.6,Japanese
    People Conference Be,Comedy,"April 25, 2020",84,1.8,Chinese
    Forget Great Kind,Drama,"May 22, 2021",128,3.3,Chinese
    Trial Employee Cover,Drama,"February 24, 2020",90,3.6,Spanish
    Choose System We,Drama,"June 29, 2020",102,3.3,Spanish
    Range Laugh Reach,Comedy,"August 09, 2021",92,3.9,Spanish
    Increase Fire Popular,Romance,"May 03, 2020",107,4.1,Japanese
    Show Job Believe,Thriller,"March 13, 2021",62,1.6,English
    Or Power Century,Comedy,"February 29, 2020",146,2.3,Spanish
    Ago Ability Within,Drama,"July 23, 2020",120,4.8,Italian
    Foreign Always Sing,Mystery,"May 16, 2021",112,1.9,English
    Once Movie Artist,Documentary,"February 09, 2020",79,4.1,Hindi
    Near Explain Process,Action,"July 17, 2021",134,2.0,Spanish
    Big Information Grow,Romance,"February 25, 2020",64,4.4,Spanish
    Wind Project Heavy,Drama,"February 20, 2021",128,4.8,English
    Child Form Theory,Mystery,"January 12, 2021",91,3.0,Spanish
    Bring Sport Present,Drama,"March 02, 2021",87,2.7,Hindi
    Themselves That Activity,Action,"August 20, 2020",148,3.0,Spanish
    City Threat Almost,Thriller,"February 16, 2020",107,3.9,Spanish
    See Main Student,Drama,"January 17, 2020",125,1.4,Chinese
    Population Impact Season,Action,"March 19, 2020",109,2.3,Italian
    Manager Thank Truth,Documentary,"February 12, 2021",124,4.1,Hindi
    Child South Believe,Thriller,"April 18, 2020",65,3.9,Italian
    Present Main Themselves,Romance,"September 08, 2020",89,3.8,Hindi
    Maintain Order Old,Drama,"December 14, 2020",110,1.8,Hindi
    Difficult Town Hair,Documentary,"October 12, 2020",51,4.9,Japanese
    Page Hold Discussion,Drama,"November 01, 2020",139,1.9,Chinese
    Style True Car,Comedy,"July 03, 2021",84,5.0,Japanese
    Care Item Sing,Comedy,"November 16, 2020",100,4.9,Japanese
    Do Car Organization,Romance,"February 28, 2021",129,1.1,Japanese
    Learn Service Figure,Documentary,"March 04, 2020",50,2.0,Italian
    Forget Situation Fact,Comedy,"January 22, 2020",52,3.9,English
    Order International Report,Documentary,"December 17, 2020",101,2.2,Chinese
    Another Black Teach,Mystery,"December 08, 2020",96,4.2,Italian
    Professor Watch Throughout,Action,"September 15, 2020",111,4.0,English
    Which Quickly Son,Documentary,"July 02, 2021",98,2.4,Chinese
    Change East Article,Comedy,"March 28, 2020",61,2.4,English
    Partner Individual Local,Romance,"May 07, 2020",149,5.0,English
    Instead Watch Particular,Horror,"May 04, 2020",115,2.3,Hindi
    Democratic Someone Available,Romance,"July 26, 2021",98,1.4,Italian
    Place Would Mind,Drama,"May 09, 2021",141,2.4,Italian
    Likely Economy Weight,Mystery,"February 03, 2021",106,3.1,Hindi
    Could Certain More,Drama,"January 31, 2021",137,4.9,Hindi
    Source Operation Sure,Action,"March 03, 2020",81,3.3,Hindi
    Really Share Treat,Documentary,"August 05, 2020",99,2.2,English
    Edge When Data,Drama,"July 27, 2020",115,1.6,Italian
    Huge Imagine Federal,Romance,"August 08, 2021",141,3.0,Chinese
    Tend Often Collection,Documentary,"June 25, 2020",73,3.2,Chinese
    Wait Major Move,Action,"June 17, 2021",120,2.5,Spanish
    Firm Reason With,Thriller,"July 16, 2021",67,2.6,Spanish
    Significant Fall Travel,Romance,"March 14, 2021",123,2.0,Hindi
    Send Size Eye,Comedy,"June 18, 2021",74,3.5,Spanish
    Describe Hospital She,Drama,"March 14, 2021",90,1.4,Spanish
    Give Drive Better,Mystery,"March 15, 2020",106,1.2,Spanish
    Their Measure Choose,Action,"April 28, 2021",86,2.8,Italian
    Resource Sell Agent,Thriller,"February 08, 2020",50,3.1,Hindi
    Next Plan Soon,Action,"May 16, 2021",93,3.7,Hindi
    Land Allow Simply,Mystery,"May 23, 2021",144,1.0,Hindi
    Friend Total Few,Mystery,"June 12, 2021",93,4.1,Italian
    Role Might Bad,Drama,"December 08, 2020",100,3.5,Japanese
    Opportunity Public Certainly,Horror,"August 07, 2020",76,2.0,Italian
    Else Play Politics,Drama,"August 01, 2021",145,2.5,Italian
    Staff Main West,Documentary,"May 09, 2021",76,2.5,Japanese
    Ready Treat Everything,Drama,"July 24, 2021",121,1.6,Hindi
    Ahead Yourself Crime,Horror,"February 09, 2021",80,4.9,Italian
    Next These Night,Comedy,"February 20, 2020",65,3.4,Hindi
    Line Else Along,Comedy,"February 05, 2020",83,1.8,Hindi
    Degree Continue Green,Documentary,"March 10, 2020",73,3.8,Hindi
    Marriage Until Cover,Thriller,"November 26, 2020",147,4.8,English
    Republican Way Mission,Drama,"April 04, 2021",57,2.9,Chinese
    Prepare Rich Street,Romance,"February 26, 2021",94,2.6,Japanese
    Term Five On,Horror,"September 06, 2020",62,2.7,English
    Sister Manage Relate,Documentary,"August 17, 2020",76,4.4,Hindi
    Scientist Beat Wonder,Horror,"June 23, 2021",137,1.5,Chinese
    Fast Staff If,Romance,"February 05, 2021",148,2.7,Hindi
    Ready Campaign Field,Comedy,"October 25, 2020",147,2.7,Chinese
    Worker State Every,Mystery,"May 17, 2021",104,1.7,English
    Bar Wind Story,Action,"January 28, 2021",108,3.2,Hindi
    At Total Half,Thriller,"December 03, 2020",79,4.4,Spanish
    One Something Focus,Thriller,"June 29, 2020",59,1.2,Japanese
    Play We Impact,Comedy,"March 19, 2020",88,1.3,Hindi
    Message After Again,Comedy,"May 28, 2021",75,4.1,Chinese
    Such Something Information,Comedy,"June 01, 2021",145,2.2,Spanish
    Power Organization Myself,Action,"January 29, 2021",119,1.4,Hindi
    Apply Boy Success,Documentary,"August 06, 2020",93,1.4,Italian
    Evening Production Bar,Romance,"April 13, 2020",102,2.5,Chinese
    Work For Form,Drama,"September 19, 2020",80,4.4,Hindi
    Occur Billion Cover,Documentary,"December 03, 2020",56,3.7,Chinese
    Budget Wall Tv,Horror,"January 02, 2021",135,1.0,English
    Share Beyond Loss,Action,"January 23, 2021",55,1.5,Italian
    Professional Source Make,Horror,"December 08, 2020",107,4.1,Japanese
    To Protect Improve,Mystery,"July 30, 2020",100,3.6,Japanese
    Democratic Hundred Appear,Horror,"August 18, 2020",84,4.3,Hindi
    Face Central Summer,Documentary,"November 25, 2020",63,1.8,Spanish
    Involve Clearly At,Documentary,"November 25, 2020",56,1.5,Italian
    Fall Term Drug,Horror,"April 05, 2020",52,2.2,Chinese
    Fly Language Where,Romance,"May 18, 2021",102,4.4,Chinese
    Service Local Door,Drama,"August 04, 2020",63,1.9,Italian
    Son Avoid Himself,Drama,"July 30, 2020",53,1.8,Hindi
    Success!
    Summary
    Today's post demonstrated how to use the Faker package to generate fake data and the CSV library to export that data to file.
    In future, we may use this data to make our data sets to work with and some some data science around.
    Kaggle and Open Data are great resources for data and data visualization for any use you may also have when not generating your own data.
    This "100 Days in Python" series will move towards data science and machine learning from here on out.
    Resources and further reading
    Photo credit: pawel_czerwinski
    Originally posted on my blog. To see new posts without delay, read the posts there and subscribe to my newsletter.

    31

    This website collects cookies to deliver better user experience

    Generating Fake CSV Data With Python