10 Questions is an ongoing blog series in which Tableau Zen Master Dan Murray interviews some of the brightest folks in the world of data.
A couple years ago, Josh Varner was a consultant working on our BI team at InterWorks. The last time I saw Josh (when he was still on our team) was at the Vertica Conference in Boston. I was leaving early for a gig in New Jersey. I stopped by to say goodbye to the team. There was a line of people 20 deep waiting to talk to Josh about some clever hack he was showing-off. I remember thinking, “Josh is on the verge of becoming a Vertica/Tableau superstar.”
Two weeks later, he was offered a job that he couldn’t refuse from Twitter. Today, Josh is a Senior DBA and the HP Vertica Tech Leader at Twitter. We thought it would be interesting to get the perspective of someone living in the world of big data. For those of you that aren’t ninja hackers, I’ve provided relevant links so you can do some background reading you want to investigate a topic.
Q: What are the key ingredients today for a capable data solution in an enterprise with lots of data?
Varner: Scale, velocity, interoperability and flexibility. Even if an enterprise doesn’t think they have a lot of data, once they start digging around and thinking of the possibilities, it’s easy to realize that there’s quite a bit more they can do than they ever thought was possible. This is especially true once you break free of the mindset that you have to throw data away or aggressively aggregate it. Storage is cheap, and systems that can scale to use all of that storage and the data on it are critical.
The system needs velocity to scale on the read side but also the write side. A system isn’t usable if it can’t ingest large volumes of data quickly. Ingesting massive data is very difficult if it can’t play well with other sources. For example, if you’re scribing logs to Hadoop in LZO-compressed JSON, then your analytics system should be able to connect to Hadoop, decompress LZO streams and parse JSON. If you can’t do these things, you should have the flexibility to create the structure to enable the facility. It should be flexible both in the sense of how it can be extended but also in the ways it can be used.
As businesses gain more understanding of the value of their data and the value of democratizing and opening up their data, systems can’t just be walled off – made only accessible to engineers. Analysts, researchers, data scientists and others should be able to access the data and do so in a way that’s safe. The systems they use should be friendly to that.
Q: How do you see Hadoop, Vertica, traditional relational data models and Tableau working together today? In five years?
Varner: The focus is Hadoop integration. How can Vertica and other analytical databases integrate or run on top of Hadoop? How can Tableau work with Hadoop? How can we get data from our traditional RDBMSs into Hadoop? And, I think we’ll continue to see improvements in this area.
I would personally like to see more standardization and efficiency in how we move data between these systems. Right now, it’s such a mix between various systems, and many of them (traditional ODBC/JDBC-based applications) were built with single-node, single-stream systems in mind. Every enterprise seems to reinvent the wheel in this area. My hope is that we’ll see this improve in the next five years.
Q: How would you contrast Hadoop and NoSQL databases like MongoDB? Similarities and differences?
Varner: They serve completely different purposes. MongoDB, Cassandra and other document stores are meant for operational data (it’s where your data lives and breathes), interacting with the applications at the front end and generating new data. Hadoop is your archive. It’s where your data lands, ready to be researched and analyzed.
The similarity between Hadoop and systems like MongoDB is that they have flexible data schemas. You’re not tied to a certain set of columns for a table. In MongoDB, a collection can contain documents that vary widely in terms of what they contain. One could have a first and last name attribute, another could omit those entirely and just store a company name. Hadoop is similar in this way, but even more powerfully so. Want to store some Thrift or Protobuf, maybe some JSON, and tack on some CSV files as well? No problem. This flexibility is what drives a lot of people to these systems. Relational databases are starting to catch on to this, like PostgreSQL supporting array and JSON types and Vertica supporting a key/value API.
Q: What are the biggest challenges in rolling out and managing a Hadoop deployment?
Varner: There is so much to the Hadoop ecosystem. It is really just the umbrella for many different components to an evolving tool set. Understanding all the components takes time. Keeping things organized and consistent is challenging. HDFS, one of the main components of Hadoop, is just a file system. You can store whatever you want in it. Open up the documents folder in your home directory. Is it clean, easy to navigate and well kept? Well, kudos to you, because mine isn’t. Imagine this on a much larger scale and with petabytes of data.
Q: Do you see the solution in Question #2 spreading to the middle market over the next five years?
Varner: Small and medium-sized businesses are adopting and will continue to adopt tools like Hadoop, Vertica and Tableau in addition to the more traditional operational data stores. This may surprise many people, but if you think about it, businesses don’t need a lot of employees to generate a lot of data. Because of this, these tools and their ability to integrate well will continue to grow in importance.
Q: What skill sets should people in school today be acquiring to be attractive candidates in the data marketplace five years from now?
Varner: I’ll steal and paraphrase a thought from Elon Musk’s recent Reddit AMA. You have to see knowledge as a tree and understand the basic concepts, the trunk and bigger branches, first. Once you master these, you can then focus on the limbs and leaves.
For the database field: What is a database? How does it work? If you had to write your own, how would you do it? What is an index, a B-tree? Once you understand these foundational concepts, then you can move on to the shiny things like writing time series queries in Vertica with interpolated values.
Q: Do you think that Hadoop and other open source solutions running on grid/cache systems will replace the incumbent commercial data stack vendors over the next decade? Why? Why not?
Varner: The open and closed-source worlds will continue to push each other to the next level. I think we’ll see more and more mixed-model schemes where companies support the open-source development of their base product but sell closed-source add-ons (or the inverse where you sell a closed-source base but build an open-source add-on community), but I don’t think it will replace traditional closed-source solutions. As long as money can be made, companies will oblige.
Q: What specific coding skills to you feel are critical now and over the next five years for people interested in working with data?
Varner: Five years is a long time, and things could be very different by then. I’d say start with SQL. It’s nearly universal and easy to pick up. From there, I’d suggest learning Python. Write some simple ETL jobs that connect to databases, crunch some numbers and write to either a database or Hadoop. Then, maybe write some raw MapReduce code. It goes back to the foundations concept – start simple, basic, foundational, then move up to the more advanced.
Q: What specific skills do you use every day, every month, only occasionally? What new skills are you working to acquire?
Varner: This was a surprisingly difficult question to answer. I had to think back about recent projects and remember what I’ve done most recently. I use Bash, Python and SQL nearly every day. I try to spend most of my time in the command line, so Bash is a big part of my day. If you’re not automating what you’re doing, you’re doing it wrong. So, I’m constantly writing little scripts and tools in Bash and Python to make my life easier. And, I’m always looking at SQL in my line of work.
Every month? The two examples that come to mind first are analytics SQL and Puppet. Vertica supports analytics SQL features that are quite handy for particularly challenging ways of slicing/dicing data. I don’t quite get to use them every day, but when I do, they’re very handy. I also really enjoy the opportunities I get to use Puppet to help manage fleets of machines, and it’s something I don’t necessarily get to use every day, but I’m working with it at least once a month.
Working to acquire? Earlier in my career, I worked with lower-level compiled languages. Then, I moved on to scripting languages. My most recent interest has been to get closer to the metal again and work with more modern compiled languages. An upcoming project I’m planning is writing a SQL parser using either Java or Scala.
Q: What is the biggest challenge facing companies today that have a wide variety of data being ingested with increasing velocity?
Varner: I believe it’s ETL (the processes that move and transform data between systems), data quality and data discovery. Old school ETL is too cumbersome, expensive and slow, and the slapdash (totally custom) in-house systems companies build are equally expensive yet brittle and unreliable. Because of today’s work velocity, we forget to look at the quality of our data. This becomes even more problematic when more users are added to the database environment and those users are writing queries that may be complicated and difficult for other people to understand.
Every enterprise needs to think about the quality of the insights they’re getting from the data. Did the analyst that wrote this query really mean to sum the sales figures over a partition of region code? Or was it area code? Did we properly filter out NULLs and bad values? We need a way to trust the data we’re using, otherwise it’s not business intelligence; it’s business ignorance.
If you have a wide variety of data, discovery skills are critical. When you begin building a data environment you might think it’s relatively straightforward to keep a table of contents in your head or even in just a simple text file or wiki page, cataloging the data you have and where it lives. But, it’s almost certainly immediately out of date, and it definitely won’t scale.
You need a database of databases – somewhere you can look to find that needle in the haystack. Not only is this critical for helping people find the right data they’re seeking, it’s also critical in avoiding duplication of data. If your audience can’t find the regional sales figures they need, they might create a new job to a new table, not knowing that a table already exists.
Discover More Interviews
Want to read more insightful interviews like this one? Then you’ll love our 10 Questions blog series. Check out the full list of interviews here, and stay tuned for new additions.
Need Help? Let Us Know!
There you have it. If you need help with your data infrastructure, we have the experience, skill, and knowledge to insure your success deploying all of the tools Josh mentioned. Contact us today to learn more.
If you’re in college and think you might want to get into this game, head to our Careers page and apply for one of our open jobs. We’d love to hear from you.