Website Data Collection Made Easy Using XPath in import.io

Data

Website Data Collection Made Easy Using XPath in import.io

by Katie Wagner
//

The release of import.io saves headache and heartache for many data lovers. Using a simple interface and a little bit of magic, manual input of data from websites is shortened to just a few minutes. I can say from experience how pleasant and easy the interface is to use, even for beginners. With all great data projects, however, there will be problems.

Having Trouble with import.io?

Let’s look at an example from Kickstarter. While the summary page of projects is informative, individual project pages in Kickstarter provide additional columns. These columns include things like project start/end date, number of backers and category. Once I became familiar with import.io, I knew that a crawler was the right route to take for this scenario. That’s because I have multiple pages that look basically identical. Although extremely user friendly, import.io’s interface was not training additional pages with the same form as the original selection. I knew that I could get the results I was looking for if I could provide the exact location of the object. 

Enter XPath

After a little Googling, I came across this post from import.io regarding XPath. With very little web dev experience, this initially seemed like a daunting task. However, Chrome’s development tools provided a simple way to extract exact objects from the page.

1. Create a new column in import.io and select the Show Advanced Settings checkbox.

Show Advanced Settings

2. Switch over to Chrome and right-click the object you’re trying to extract. Select Inspect Element.

Inspect Element

3. Google will highlight the associated HTML. Right-click the selection and choose Copy XPath.

Associated HTML

4. Finally, paste the path into the import.io Manual XPath override field and select Done.

Manual XPath Override

That’s all there is to it! If you run into a situation where the exact object is not being returned correctly, chances are the line of code before or after what Chrome selected will be correct. For instance, when trying to return the number of backers for a project, Chrome directs to:

//*[@id=”backers_count”]/data

However, when used in import.io, the field remains blank. Selecting the line of code right above it produces the number 117, which was expected.

//*[@id=”backers_count”]

Though these may be tasks web developers could do in their sleep, I felt like a rock star being able to accomplish them all on my own. With a little patience and some crafty searching, import.io delivers the foundation for data collection from almost any site. Should you ever run into an issue you cannot resolve, feel free to email support@import.io. Their User Support Analysts are up for the challenge and provide excellent assistance.   

More About the Author

Katie Wagner

Analytics Consultant | Training Lead
Preppin’ Data Project: Week 1 Note: A big thank you to Carl Allchin and Jonathan Allenby for initiating the Preppin’ Data project for our community. Hunker down, ...
Split and Pivot Your Data with Tableau Prep With the release of Tableau Prep in April 2018, analysts have been given an incredibly valuable tool for their analytical kit. ...

See more from this author →

Subscribe to our newsletter

  • I understand that InterWorks will use the data provided for the purpose of communication and the administration my request. InterWorks will never disclose or sell any personal data except where required to do so by law. Finally, I understand that future communications related topics and events may be sent from InterWorks, but I can opt-out at any time.
  • This field is for validation purposes and should be left unchanged.

InterWorks uses cookies to allow us to better understand how the site is used. By continuing to use this site, you consent to this policy. Review Policy OK

×

Interworks GmbH
Ratinger Straße 9
40213 Düsseldorf
Germany
Geschäftsführer: Mel Stephenson

Kontaktaufnahme: markus@interworks.eu
Telefon: +49 (0)211 5408 5301

Amtsgericht Düsseldorf HRB 79752
UstldNr: DE 313 353 072