Scaling Workflows

Scale up your data collection by performing large-scale crawls for gigabytes at a time.

We can configure and run workflows in even more advanced ways for complex and custom requirements, especially for dealing with very large data sets.

Combining Inputs

In order to collect "big data," you'll need to make a lot of requests to get back the raw data from your API of choice. This typically means you'll need to make a lot of requests to the API to get back the data you need.

Cross Products

Let's say you need to look up the product prices and inventories for a list of products across a list of local stores. You effectively need to make a request for each combination of product and store on your list.

You can easily do this with workflows, using two different input collections. Simply declare an input collection called for your product IDs to look up and another collection of store IDs (these should both be simple collections, with one value per line).

Now when you run the workflow, it will automatically look up each combination and return the results back in a single file. Be careful with this approach - as if you add a lot of items to each list you can find yourself making a lot of requests very quickly!

JSON Input Collections

If you need finer grained control over the inputs you send to the API and want to process a batch, you may want to consider using a more advanced "JSON Input Collection." These are very similar to regular input collections where you enter values one per line, but allow you to enter a JSON list instead.

Let's say we want to look up real estate listings for different minimum and maximum price ranges. We can declare how we want to run our searches as a JSON list below:

  [
    {
      "min_price": 0,
      "max_price": 99
    },
    {
      "min_price": 100,
      "max_price": 199
    },
    {
      "min_price": 200,
      "max_price": 299
    }
  ]

You declare a single JSON input collection like this and will then need to link this input collection to the 2 inputs corresponding to the min_price and max_price inputs of the workflow.

You'll be presented to enter the "JSON Key" when linking to the workflow inputs - just type in whicher JSON key you used to name your JSON objects like above, you'd enter min_price and max_price.

Parallel Executions

As explained earlier, you can run parallel workflow executions to speed up your scraping. We'll talk about a few advanced scenarios you can consider for your work.

Splitting by Inputs

Sometimes the amount of data you can scrape will be limited to the API key you're using. E.g. you're only allowed to perform 10,000 lookups per day per API key. So how can you make this faster if you happen to have multiple API keys?

If you're processing a list of items (say to look up), you'll want to declare that as an input collection. Then, when you're prompted to enter an API key, instead of entering a single key, you can instead link to a source collection and create another input collection for API keys, and provide your list of keys there.

By default, the system will make dupliate API calls for each API key provided (using the cross product method described above).

You don't want to do this! Instead, you want to "split" the executions by the API key input collection. You'll see this option presented on the workflow runner interface once 2 input collections are provided.

Select this option to split by the API key and this will divide the other input collections across your API keys evenly. Now when you submit the workflow execution, you'll end up with multiple executions (one per API key) all assigned to different parts of the other list(s) to process.

Reusing Proxies

Sometimes you may want to continually run workflows that are quick (under one hour), so paying for proxies by the hour only to use a few minutes at a time is a waste of money!

You can alternatively launch your proxy(ies) first, which will remain on until you explicitly terminate them, don't forget! After your proxies are running, you can then assign workflow executions to individual proxies in your account.

This process usually requires some coordination on your end and you'll most likely want to use the Stevesie API to accomplish this, which we'll cover later.

Execution Batches

Each workflow execution can only run up to 10,000 requests at a time. So what happens if you need to perform lookups on more than 10,000 items in an input collection?

No worries! Simply run the workflow as you normally would and the initial workflow will run up to 10,000 requests and then save its results. It will then initiate a follow up execution to process the remaining items (10,000 at a time) until we exhaust all the inputs.

While this will result in multiple output files, you can easily combine them together after all the workflows are done into a single CSV file we'll cover in the next section.

Next: Workflow Files »