Everything you need to know to get started with workflows for collecting bulk data.
Workflows allow you to perform automated accessing of endpoints and collect combined data in a usable format, such as CSV or flattened JSON.
Every workflow is tied to a single API endpiont, but unlike API endpoints all workflows are private to your account. To get started with workflows, you can either create a new one from scratch from an existin API endpoint, or import a formula to create pre-made workflows you can quickly use.
To create a workflow from scratch, you need to first navigate to the API endpoint you want the workflow to access. Look on the top right of the page for a drop down menu and you'll see the ability to create a new workflow from the endpoint.
Depending on what you're doing, odds are that someone else may have already dont it and created a workflow "formula" you can import instead of making a new workflow from scratch.
To do this, simply navigate to the formula you're interested in and click the "Import" button. This will create one or more new workflows into your account, copying the author's definitions into your account. Once a workflow is imported, you can make as many changes to it as you want to experiment - you can always just "re-import" it again to "start over."
Since each workflow is assigned to a single API endpoint, it has the same "inputs" as does the underlying API input. This means that whatever values you can provide as inputs to single endpiont exeuctions, you can also configure in worfklows, but in more advanced ways.
If you create new workflow from an endpoint from scratch, the default behavior is to have the workflow ask for single values for each input. This is fine for inputs (such as authentication tokens) which should be the same for every request you're making.
Workflows were meant for batch processing, so you may have a list of inputs you want to process. E.g. you may have a list of social media users you want to individually look up.
For these cases, you'll want to use "input collections" which allow you to create a list of values you want to look up and then associate that list with the API endpoint's input. This will tell the workflow to iterate through every item in your list and pass that list value as the input value into the underlying endpoint.
Workflows are also useful for paginating through a large result set to combine the results together. Many APIs let you specify a pagination offset (or page number) in the request - e.g. you're supposed to make a resuest for page 1, then 2, etc...
You can set an input to work as an auto-increment as well, simply select this option on the workflow screen and specify the auto increment amount.
You may want to set a "pagination limit" it the API doesn't return a bad response when you get to the end of the list, to ensure you don't keep crawling for data infinitely.
Some APIs usre a "pagination cursor" for moving through results. This is where the API will respond to your first request and provide a string value you must send back in your second request to get the second page of results.
To achieve this, you first need to declare an extractor for the underlying endpoint with a column whitelist of the column to be used to populate the subsequent value. Once this extractor is defined in your account, you can simply select it when linking up the source input.
Workflows make repeated requests to the same API endpoint and combine the results into single files for each extractor you have defined in the workflow. This means that if you make 1,000 requests to the same API, it will combine the results from all 1,000 responses into a single file for a specific extractor collection.
To get extraction files from workflow runs, you need to link one or more extractors to a workflow. If you import workflows using a formula, this will be done for you automatically. Otherwise, you may need to first create extractors for the endpoint you're interested in and then link them to the workflow, to tell the workflow to produce output files.
Once the inputs and extractors are configured for a workflow, you're ready to make executions. Each workflow can have multiple executions (with different inputs and configurations for each execution). You should only create seperate workflows if each workflow is doing something fundamentally different.
To run a workflow, just provide the inputs you need to set for a specific run and hit the execute button - then sit back and you can monitor its progress.
Let's walk through a few common use cases for running workflows.
You want to look up details about a list of social media users. You'll want to put all of the usernames into an input collection (one per line) and then link that input collection to the appropriate input in the workflow.
The workflow will then make one request per username provided and combine all the results together into a single output file.
Maybe you want to use the Spotify API's search feature (which only returns a little bit of the results at a time). You need to "paginate" through to get the full results. You would declare an auto-increment (or self-loop if a pagination cursor) and the workflow will make as many requests (changing the pagination input each time) and combine all the results together at the end.
What if you want to combine 2 types of workflows? E.g. run a search and paginate through the results in one workflow, then perform detail lookups on these results on a second workflow?
You'll want to first run the initial worfklow (step 1) on its own and inspect the output to find the exact column you want to process in a second step. Once you find this column in the CSV output, go to the second workflow (what you want to use as step 2) and select "Workflow Chaining" and then paste in the column value to complete the chain.
Once a workflow has been chained, you need to execute the root workflow to begin the process (e.g. you can't initiate the 2nd step of the workflow directly).
Each workflow will pass on all of its parameter settings (e.g. proxy settings, execution name) on to subsequent runs.
Just like executing individual API endpoints, workflow executions will run through proxies to protect your anonymity. You'll typically want to use a dedicated proxy when running more than 75 requests for cost reasons, as dedicated proxies are priced by the hour and better suited for batch processing.
The default behavior when you run a workflow is for the workflow to automatically launch a new proxy for you and then use that proxy for all requests (giving all requests a consistent IP address), and then terminate the proxy when it's no longer needed.
This is typically good practice because it gives you an exclusive, consistent IP address that will be used to make the requests and is also more cost effective than shared proxies.
If you're only running a small number of requests (under 75 in one hour) for testing, it's cheaper to use shared proxies. However, you may have a different IP address each time and these proxies are shared with other Stevesie users.
Some websites require an advanced 4G or residential proxy that we do not provide. You can use any proxy provider you like - just select the custom option and enter the URL to access the proxy.
Each API has limits on how frequently you can access it - either based on the IP address you're coming from or the authentication information you provide on each request.
The Stevesie platform is built with safe defaults to prevent anyone from getting in trouble, and you can override these defaults at a number of levels to your needs. The main settings are:
Unless otherwise specified, API endpoints that are official API wrappers will wait 1 second between requests whereas unofficial APIs will wait 2 seconds between requests. If a rate limiting response is received, we will wait 5 minutes for both types of APIs.
Each API endpoint can define rate limiting settings (where the endpiont maintainer sets them). If declared, then these will override the global defaults explained above.
If you're not happy with the global defaults or the settings that the endpoint maintainer has declared, you can override these settings in any workflow in your account. Just edit the workflow and the new settings will apply for all future executions.
If you're experiencing trouble with rate limiting, you may want to test different settings on a per-execution basis to debug. In this case, you can define the rate limiting parameters on each execution, before hitting the submit button.
If you want the execution to run at a specific time in the future, you can select the hour of the day to run it (in your local timezone). This will submit the workflow, but it will appeared as "Queued" until the time comes for it to run.
Each workflow can (and should) have multiple executions and you don't need to wait for one to finish to launch another one. You can submit as many executions as you want and they will process up to your account's parallel proxy limit.
Plus accounts can have up to 3 workflow executions running at once and premium accounts can have up to 20 workflow executions running at once. If you submit more than your limit, they will remain "queued" until you have enough room in your account to perform the executions.
The default (and recommended) output format is CSV, meaning you'll get a CSV file for each extractor linked to your workflow. If you need to use JSON for whatever reason, you can change the output format and you'll get (very bloated) JSON files instead.
Next: Scaling Workflows »