The release of import.io saves headache and heartache for many data lovers. Using a simple interface and a little bit of magic, manual input of data from websites is shortened to just a few minutes. I can say from experience how pleasant and easy the interface is to use, even for beginners. With all great data projects, however, there will be problems.
Having Trouble with import.io?
Let’s look at an example from Kickstarter. While the summary page of projects is informative, individual project pages in Kickstarter provide additional columns. These columns include things like project start/end date, number of backers and category. Once I became familiar with import.io, I knew that a crawler was the right route to take for this scenario. That’s because I have multiple pages that look basically identical. Although extremely user friendly, import.io’s interface was not training additional pages with the same form as the original selection. I knew that I could get the results I was looking for if I could provide the exact location of the object.
Enter XPath
After a little Googling, I came across this post from import.io regarding XPath. With very little web dev experience, this initially seemed like a daunting task. However, Chrome’s development tools provided a simple way to extract exact objects from the page.
1. Create a new column in import.io and select the Show Advanced Settings checkbox.
2. Switch over to Chrome and right-click the object you’re trying to extract. Select Inspect Element.
3. Google will highlight the associated HTML. Right-click the selection and choose Copy XPath.
4. Finally, paste the path into the import.io Manual XPath override field and select Done.
That’s all there is to it! If you run into a situation where the exact object is not being returned correctly, chances are the line of code before or after what Chrome selected will be correct. For instance, when trying to return the number of backers for a project, Chrome directs to:
//*[@id=”backers_count”]/data
However, when used in import.io, the field remains blank. Selecting the line of code right above it produces the number 117, which was expected.
//*[@id=”backers_count”]
Though these may be tasks web developers could do in their sleep, I felt like a rock star being able to accomplish them all on my own. With a little patience and some crafty searching, import.io delivers the foundation for data collection from almost any site. Should you ever run into an issue you cannot resolve, feel free to email support@import.io. Their User Support Analysts are up for the challenge and provide excellent assistance.