Concurrency Data and Analytics Internship

Author by Chance Alexander

Hi! My name is Chance Alexander, and this blog post will entail the experiences and responsibilities within Concurrency's internship program on the data and analytics team.
 
The first three days of the internship program were days to remember.  These three days were full of information – we learned about what role we would play in our teams, the other teams within Concurrency, and we even learned fine dining etiquette!  Most importantly, Concurrency paid for our meals for those three days.  That was the best part.
 
After the three days of introductions were completed, each of the interns moved forward with their individual project champions on their respective teams, and in my case, my project champions were Seth Bauer and Michael Carlo, two Microsoft Power BI MVPs.  After meeting them, we outlined what my summer would look like here at Concurrency, and that is when I learned my training was not yet over.  My first three weeks scheduled were filled with training materials – I was tasked with taking five courses through edX to broaden my knowledge on the applications I would be using.  The courses that I took were Introduction to Data Modeling, Delivering a Relational Data Warehouse, Implementing ETL with SQL Server Integration Services, Developing a SQL Server Analysis Services Tabular Data Model, and Analyzing and Visualizing Data with Power BI.  I was given the option to audit these courses and take them for free, since the estimated completion time on average was 2-6 weeks for each course, depending on which course it was.  The other option was to pursue a Microsoft certificate in each of the courses, which required a score of 70% or higher in the course.  I decided to challenge myself and pursue the certificates, as it would give my resume some more “swag.”  In the three weeks of time, I completed all five of the courses, but only got a certificate in four of them.  It was disappointing to have failed one of the certificate exams, but I luckily had the option to retake the course a few weeks later, and I passed it the second time around, resulting in all five certificates being attained.
 
After training, I was moved into my first and only project during my time at Concurrency, which is coined “The Marketo Project.”  Marketo is an account-based marketing analytics service that provides valuable information on email activities within Concurrency.  I was tasked with moving this data from Marketo into an on-premise SQL database, and then moving that data into Power BI for advanced analytics.  Moving the data in SQL Server Integration Services provided a much more difficult challenge than anticipated.  To start, the first day was filled with Visual Studio crashes, due to having too low of permissions to be able to complete anything.  Once I was able to surpass these problems, I was able to finally get to the data.  I went through the Marketo data source, moving data from each of the tables into flat files to determine what we would most likely need in a Power BI report, and eventually set up data flows from Marketo to our on-premise SQL database.  Due to having messed with Marketo within SSIS so much, I do have some pieces of advice for anyone who plans to use SSIS:
  • Check the output buffers in SSIS to be sure they fit your source data sizes
    • If the output buffer in SSIS is smaller than the size of your data values, you will return an error.
  • Run your data flow tasks concurrently
    • I was able to successfully do this up until 8 concurrent data flow tasks.  As long as you are under this number, you should be able to make this process more efficient.
  • Ensure that your data does not exceed the max buffer size in SSIS
    • Unfortunately, since I am pulling from Marketo I was unable to see the source data that I was getting and could not pinpoint rows that were bigger than the max buffer size.  Due to this, I was unable to pull the data in two dates in 2017, which we are still looking for a solution for.
  • If you plan to automate your SSIS package, keep Marketo happy
    • What I mean by this is Marketo is very particular with their date parameters when using bulk extract.  You must input your date in YYYY-MM-DD format.  No other format will be accepted, and to automate the package, try to create variables that fit this requirement and increment when necessary.
  • Keep Marketo’s data pull limit in mind
    • Marketo has a data pull limit of 500 MB per day, or 10,000 API calls per day.  Be sure that you do not exceed these limits, or you will get an error returned.
  • Check Marketo’s default data types within SSIS
    • SSIS tries to recognize the data types coming in from Marketo, and for long strings, the default data type is set to either DT_TXT or DT_NTXT, which are text strings and Unicode text strings respectively.  These two data types will cause an error when trying to complete a data flow in SSIS, so change these to DT_STR or DT_WSTR (string or Unicode string, respectively) with large output buffers to avoid this.  You will lose efficiency if your output buffer is unnecessarily large.
  • Use stored procedures in the execute SQL tasks
    • This will make your package ever so slightly faster, and will make the package contents easier to navigate
At the end of this process, I created a pipeline that truncates staging tables in an execute SQL task, uses six data flow tasks to move data from Marketo into our destination tables, and then an execute SQL task to wrap things up and put only new data into our processing tables.
 
After the SSIS process, I got into SQL Server Management Studio and started putting data into their respective homes.  As mentioned in the SSIS process, I created staging tables before the completion of the pipeline.  These staging tables were empty, low-requirement tables that ensured all data would successfully be transferred.  It’s important to always keep these tables empty for each new data flow – if these tables were never emptied, your process would take a much longer time checking which data can and cannot be inserted into processing tables, which is why I used an execute SQL task to empty these tables before each new process.  After creating the staging tables, I created the processing tables.  In the processing tables, I made sure my data types matched those in the SSIS process, and I also created primary key constraints, ensuring I had no repeating data in these tables.  The tricky process is thinking about how I could enter only new data into these tables, which is where I had to do some research.  Using a modified insert query, I was able to create stored procedures containing something that looked like this:

Create Procedure [dbo].[Chance_Load_Leads] AS
Insert Into Leads (accountId, campaignid, company, createdAt, email, fullname, id, leadid, updatedAt)
Select accountId, campaignid, company, createdAt, email, fullname, id, leadid, updatedAt From Stage_Leads
Where
           Not Exists (Select id from Leads Where Leads.id = Stage_Leads.id);
 
This statement basically says to enter data into my processing table from my staging table whenever the data within my staging table does not already exist within my processing table.  I wrote procedures for each table in my SSIS process, and then stored all those procedures into one stored procedure, which looked like this:

Create Procedure [dbo].[Chance_Load_Process_Tables]
AS
              EXEC Chance_Load_Activity;
              EXEC Chance_Load_Campaign;
              EXEC Chance_Load_Email;
              EXEC Chance_Load_Program;
              EXEC Chance_Load_Leads;
              EXEC Chance_Load_List;

Thus, in my final execute SQL task in SSIS, I simply wrote “EXEC Chance_Load_Process_Tables” and each of those procedures would run.  There was an error that I encountered a few times that I was not expecting, and I quickly found a way around it.  The error said that I could not enter data into the processing table because the unique identifier already exists within that table.  At first glance, you may think, “Oh, that’s just the procedure saying it’s not entering that value because it’s already there.”  That’s wrong, and it’s wrong because the procedure is already written to work around that problem.  It turns out, the error was actually caused by repeating data that was in the staging table from the one data flow, since I empty the tables after each run.  This may not be the most efficient way to solve the problem, but I created a clone of my staging table, and inserted all distinct values into that clone, and then ran the same insert procedure to then successfully load my process table.  After the data was in, I decided not to map table relationships in SSMS, since I felt that doing this in Power BI was much easier.
 
Power BI was the third and final application that I used during this internship, and it is by far the most enjoyable part of doing work.  In Power BI, I assigned foreign keys and created relationships between tables with a simple drag and drop feature in the data modeling tab.  Once I completed that, I was able to clean up some of the tables by altering the table and column names to make it more user-friendly, and also deleted some junk data that I found (my favorite being the e-mail address “iamadork.com.”)  Once these things were done, I moved on to create my visuals for my Power BI report.  My first visual was a funnel, showing the breakdown of our e-mail activity data.  The funnel started with the number of e-mails sent, followed by how many of those e-mails were received, and then how many were opened.  The fourth part of the funnel showed the number of e-mails that were clicked, and then the number of links were clicked.  I also created cards for some important activities that didn’t fit in the funnel, those being: visited webpage, emails hard bounced, unsubscribed from email, and forms filled out.  The next graph I had showed the list of primary attribute values, which to my assumption is the furthest-drilled-down idea of an “event” that Marketo has, and the number of activities performed per primary attribute value.  This was done using a matrix design.  Lastly, I created a column and line chart to show the number of activities performed by domain with the bars, and the number of unique primary attribute values interacted with per domain.  To round things up, I created a domain slicer and a date slicer, to search all this data by domain or by a time frame.
 
All in all, this internship was a great experience.  It was my first internship, and I got to work under some genuinely intelligent people who had my best interests as a young professional in mind.  I also was offered an extension on my internship, so this may not be the last blog post you are reading from me!