Here's how to configure data extractions in process mining to pull data from an on-premise database into the EMS cloud for analysis.
In this 2nd of four "Path of the Data" short tutorial, I demonstrate setting up a MySQL data connection, creating an extraction job, and adding tables for extraction.
Key steps covered include selecting columns, data types, batch size, duplicates removal, date filters, and executing delta versus full loads. Extraction jobs can be scheduled and are logged for monitoring. The extracted data lands in a Vertica SQL data warehouse within the EMS environment. This raw data can then be transformed and loaded into a process mining data log for analysis in the next stages.
I provide a clear walkthrough of configuring basic extractions as part of the overall data integration process.
Stay tuned for more tips and process intelligence how-tos!
Watch the tips here and you can read the transcript afterwards.
Hi my name is James Newman. This is the second episode of the Path to Data course in Process Mining IQ. In this course we're talking about extractions - this is how you get the data out of the source system into the EMS data storage, which is a Vertica SQL database in the cloud. Some of the steps, and you'll be able to run Transformations on that and then load it into a data log.
We're in our AppDev environment in our Aardvark data pool, and in our last episode we set up the MySQL connection. Now we're going to move into the next portion of this data processing - this would be an ETL type idea - we're doing the E, the extractions.
As you can see they kind of have it laid out here so you see the Extractions. Right now we don't have any Transformations, Loads all coming off of the data storage, which is the Vertica SQL database that's sitting in the cloud.
So you click, you want to create a Data Job first. Give it a name and a Data Connection - Global Data Connection would refer to flat file uploads, event logs, things like that. That's not what we're going to do here - we want the MySQL connection, which will say "Oh we want these Transformations and Extractions are for the SQL connections." Save that.
Receive the Data Job - it's created here and it gives you the scope, which is the MySQL connection data scope. Then the tasks are the sub-things - sub-names, sub-widgets of the jobs. You have jobs, within jobs you have tasks - tasks can either be extractions, transformations, or loads.
This first video we're just going to add an extraction. Give it a name - it's going to pop up into the extraction screen. Right away it doesn't have any tables, but if you click Add Tables it is going to go in and do the ping to get the query - say "Okay what tables are there in the database you specify?" So right now we've set this test table, click that, click Save.
Now all the tables are going to pop up on the left hand side here and you have some information in the center of the screen so you can configure what columns - right now there's just one column, a case key, and this would be the primary key. It's important for joining tables you need to have that primary key selected. If we were to have multiple tables this is the primary key for this table.
What data type it is - it'll try to automatically detect it. And so string or text - string in Vertica is shorter, more of a typical string, and text is kind of long format. So if you're pulling JSON you might want to pull in text. Then pseudonymize will use that algorithm you specified to hash this column. If there was a PII column you would probably want to confirm that.
Each time you edit anything it's going to pop up this blue Save button - it's really important to save that. I try to do it each time I make a change just so I don't lose it if I go back.
Rename Target Table - this option - when you're extracting it'll come in as just the table "test". You can rename it, you'll get the option to say "Oh we want maybe video_test" and it'll be saved as video_test in the SQL database.
Batch size is on the server side, the extractor - how many rows should it pull at a time? Max string length is self-explanatory. Limit total records - again just if you want to pull only a certain number of records you can do that.
And then Remove Duplicates - this is really nice - if there's duplicates that are there you don't like, you just want the unique values, select that. And then it has some Joins and Filters - we're just going to go over it in high level.
Join is joining to another table - so if you want to say "Okay this table is kind of dependent on another table" - we're not going to do that right now.
Time Filter is if there is a date column it'll try to identify that and you can select either a Change Date filter or a Creation Date filter - and that'll put in a simple WHERE clause in the query - only pull where this date is greater than or equal to this day.
It is important to note that when you do this and you have the Creation Date, it has a NOW() function - now literally means like right that moment of the extraction.
And then Additional Filters you can type into text - so if case key was an integer that iterated up, incremented up, you could put in "Okay, case key greater than five" or something like that and skip the first five test cases that are in the system.
Delta Filter Statement - each Delta load only pull in new rows if this filter query is acceptable - whoever qualifies for that filter.
Extraction Preview - it'll pull and do logs. So right now there's not much data in there.
One last thing - it is enabled by default whenever you create one - so if you're creating a test one and you don't want to enable, always make sure you go over to the three buttons and hit Disable and that'll change it to No.
Execute Data Job - it defaults to a Delta Load here. If you want to do a Full Load, the difference is Delta Load will go off that Delta Change filter. Full Load will pull everything, not just the rows that are new.
You're going to execute this - it'll run and bring you to the log page right away. We can click in and see "Okay it's running" - these are the configuration for the database it's pulling from, what the version information is, things like that. And then Test - so it'll have a query that you can use. Right now I don't have any records but it'll run and then check that.
So this is now you see we have one extraction, and in the next video we'll talk about adding a transformation. I think in the meantime I will have added some rows into the test table, so pay attention at the beginning of the next video and we'll see that in there.
Hopefully this was helpful, and join us for the next Transformation video in Path of the Data. Have a great day!
Don't forget to sign up for our Process Mining Tip of the Week! Click on the image, share your email, and you'll get a helpful tip in your inbox every week!
Comments