Crafting a Data Science Work Flow

Kaila Stone
Analytics Vidhya
Published in
6 min readJul 16, 2020

--

The walls between art and engineering exist only in our minds. — Theo Jansen

Theo Jansen is a Dutch artist who builds kinetic sculptures by mixing engineering, biological theory, and artificial intelligence (known together as evolutionary computation) with artistic methods. These sculptures are collectively called “Strandbeesten” or Beach Animals; and exhibitions can be found by checking this website.

I look for inspiration from many sources; one of my recent endeavors has been to re-align my thought patterns surrounding Data Science to how I chose to view my Culinary career: striking a balance between art, movement, and precision.

How does any of this pertain to crafting a workflow? I would say that the single most important thing regarding our outputs into the world as individuals is that they are a direct reflection of our internal thought processes and morals. I also believe that with this freedom comes a great responsibility to be intentional about our discernment and productivity!

Louisiana harvested honey semifreddo, Ponchatoula strawberry sorbet, Madagascar vanilla shortcake, Sherry & Bay Leaf roasted strawberry compote

Regardless of the chosen form of self expression, I would argue that no great works come from slapping a bunch of random thoughts together and calling it a day. Crafting a workflow is one of the most intimate parts of creativity, and is paramount to keeping a clean and efficient work environment.

If there is anything that I have learned about Data Science over these past few months, it would have to be that it truly is an art form! Opening my mind to that idea has allowed an amazing perception to the data that I couldn’t quite tackle before. This post is in reference to my recent project for Flatiron School: modeling and predicting data on Farmers Markets in the continental United States. (The link to my GitHub repository follows)

Now that I’ve convinced you of the importance of deliberation while working, let’s dig into a process methodology that is structured while managing to leave room for individual expression: CRISP-DM.

Cross Industry Standard Process for Data Mining

CRISP-DM is a standardized workflow developed around 6 key principles that encourage the person to develop an intimate understanding of not only the work that needs to be performed, but the importance behind it, and how the outcome of this work will affect the strategy at large.

Business Understanding

I once had a manager tell me, “When delegating, if you couple the ‘what’ with the ‘why’, a person will be more invested in the task at hand. They will see the importance and value of their work, and therefore themselves.” That is leadership; not just efficiently running a business — but encouraging growth, teamwork, and inspiration.

I have carried that message with me throughout my life, and was reminded of the importance of business understanding when I first started my Module 3 project. In regards to the Farmers Market data, the first thing that I did was to create a text file in my Jupyter Notebook entitled “CRISP-DM” where I kept a log of notes, findings, and questions.

Keeping in mind that the analysis of farmers markets is not really a business oriented topic, it is however one of socioeconomic importance and can follow the same guidelines:

  • Research: What is the importance of this topic; both to myself personally and the community at large?
  • Preliminary Questions: What am I aiming to discover or answer in my initial research? (During this stage, it is important to have a loose structure while maintaining an open mind in order to allow the data to determine the trajectory.)
  • Project Plan: Gathering the materials and resources, establishing a list of to-do’s, and implementing a routine for organization.
  • Metric of Success: At what point do you define ‘success’ for this particular project?

Data Understanding & Preparation

According to the CRISP-DM method, data understanding and data preparation are actually two separate principles. However, I learn best by getting my hands dirty so to speak, and personally decided on combining the two.

  • Exploratory Data Analysis: Importing and getting familiar with the data frame(s) and various features; cleaning the data as necessary, gaining insight into the broader story
  • Annotation of Work: Keeping track of thought processes during production through the use of notes and markdown cells; getting rid of (or relocating) unnecessary code
  • Engineering Features: Utilizing data to create new features that help to tell the story without compromising significance or creating false insights
  • Visualization: Performing analysis through the use of charts and graphs that can aid in perception and support hypotheses

A note on visualizations: I found it incredibly efficient to choose a style and theme with my first visual; and then to integrate that into my workflow. Including axis labels, titles, and other necessary customization in my workflow allowed me to create a visualization that was deployable as needed- keeping me from going back to re-work my charts during the final stages of the project.

Modeling

Having a deeper understanding of the types of data, and having developed (and written down!) new inquisitions; it’s now time to determine which types of modeling techniques to use. While this post is not specifically about modeling; there are a few key concepts to be followed:

  • Establish a plan of action for each type of model chosen, and what determines the success of that model
  • Define the parameters to be utilized
  • Train the models on the data set
  • Test the models
  • Assess the models and tune as necessary

Evaluation

The evaluation step is where you start to see the light at the end of the tunnel, except you realize just how much work there is left to do. However, during this project I can say that the use of an organized and intentional work flow drastically lowered the amount of time spent on wrapping up and integrating my project into one cohesive unit.

  • Assessment: Gauging the results of data mining and modeling
  • Models: Choosing final models to be deployed
  • Reports: Developing reports for various aspects of work such as data cleaning, modeling, visualization, etc.
  • Future Work: Creating a list of actionable ideas that can be done in addition to the finished product. No idea is too big or too small here!

Deployment

Wrapping up the project with a well-earned virtual bow, complete with documentation on how to utilize the model. The deliverables:

  • README: A summary of the project and findings: complete with visualizations, links to various reports, and future work list
  • Workspace: A clean and organized Jupyter Notebook that is well documented
  • Presentation: A visual presentation and walk-through recording of the project and findings
  • Blog: An article written about something pertaining to the project; a tutorial of a method used or deeper explanation of the findings

While crafting your personal work flow; look to past experiences, artists such as Theo Jansen, or other individuals such as peers and mentors for inspiration on how to add unique touches to your production environment. Remember that art and creativity can be interwoven into every step of the process, it’s up to you on how to achieve that!

For more in depth information on CRISP-DM, follow this link.

To see my GitHub repo, click here.

--

--