7 大工具，驯服大数据

2019独角兽企业重金招聘Python工程师标准>>>

The floods that devastated the hard disk industry in Thailand are now half a year old, and the prices per terabyte are finally dropping once again. That means data will start piling up and people around the office will wonder what can be done with it. Perhaps there are some insights in those log files? Perhaps a bit of statistical analysis will find some nuggets of gold buried in all of that noise? Maybe we can find enough change buried in the couch cushions of these files to give us all a raise?

Can you handle Big Data?

The industry now has a buzzword, "big data," for how we're going to do something with the huge amount of information piling up. "Big data" is replacing "business intelligence," which subsumed "reporting," which put a nicer gloss on "spreadsheets," which beat out the old-fashioned "printouts." Managers who long ago studied printouts are now hiring mathematicians who claim to be big data specialists to help them solve the same old problem: What's selling and why?

It's not fair to suggest that these buzzwords are simple replacements for each other. Big data is a more complicated world because the scale is much larger. The information is usually spread out over a number of servers, and the work of compiling the data must be coordinated among them. In the past, the work was largely delegated to the database software, which would use its magical JOIN mechanism to compile tables, then add up the columns before handing off the rectangle of data to the reporting software that would paginate it. This was often harder than it sounds. Database programmers can tell you the stories about complicated JOIN commands that would lock up their database for hours as it tried to produce a report for the boss who wanted his columns just so.

The game is much different now. Hadoop is a popular tool for organizing the racks and racks of servers, and NoSQL databases are popular tools for storing data on these racks. These mechanism can be much more powerful than the old single machine, but they are far from being as polished as the old database servers. Although SQL may be complicated, writing the JOIN query for the SQL databases was often much simpler than gathering information from dozens of machines and compiling it into one coherent answer. Hadoop jobs are written in Java, and that requires another level of sophistication. The tools for tackling big data are just beginning to package this distributed computing power in a way that's a bit easier to use.

Many of the big data tools are also working with NoSQL data stores. These are more flexible than traditional relational databases, but the flexibility isn't as much of a departure from the past as Hadoop. NoSQL queries can be simpler because the database design discourages the complicated tabular structure that drives the complexity of working with SQL. The main worry is that software needs to anticipate the possibility that not every row will have some data for every column.

The biggest challenge may be dealing with the expectations built up by the major motion picture "Moneyball." All the bosses have seen it and absorbed the message that some clever statistics can turn a small-budget team into a World Series winner. Never mind that the Oakland Athletics never won the World Series during the "Moneyball" era. That's the magic of Michael Lewis' prose. The bosses are all thinking, "Perhaps if I can get some good stats, Hollywood will hire Brad Pitt to play me in the movie version."

None of the software in this collection will come close to luring Brad Pitt to ask his agent for a copy of the script for the movie version of your Hadoop job. That has to come from within you or the other humans working on the project. Understanding the data and finding the right question to ask is often much more complicated than getting your Hadoop job to run quickly. That's really saying something because these tools are only half of the job.

To get a handle for the promise of the field, I downloaded some big data tools, mixed in data, then stared at the answers for Einstein-grade insight. The information came from log files to the website that sells some of my books (wayner.org), and I was looking for some idea of what was selling and why. So I unpacked the software and asked the questions.

Big data tools: Jaspersoft BI Suite The Jaspersoft package is one of the open source leaders for producing reports from database columns. The software is well-polished and already installed in many businesses turning SQL tables into PDFs that everyone can scrutinize at meetings.

The company is jumping on the big data train, and this means adding a software layer to connect its report generating software to the places where big data gets stored. The JasperReports Server now offers software to suck up data from many of the major storage platforms, including MongoDB, Cassandra, Redis, Riak, CouchDB, and Neo4j. Hadoop is also well-represented, with JasperReports providing a Hive connector to reach inside of HBase.

This effort feels like it is still starting up -- many pages of the documentation wiki are blank, and the tools are not fully integrated. The visual query designer, for instance, doesn't work yet with Cassandra's CQL. You get to type these queries out by hand.

Once you get the data from these sources, Jaspersoft's server will boil it down to interactive tables and graphs. The reports can be quite sophisticated interactive tools that let you drill down into various corners. You can ask for more and more details if you need them.

This is a well-developed corner of the software world, and Jaspersoft is expanding by making it easier to use these sophisticated reports with newer sources of data. Jaspersoft isn't offering particularly new ways to look at the data, just more sophisticated ways to access data stored in new locations. I found this surprisingly useful. The aggregation of my data was enough to make basic sense of who was going to the website and when they were going there.

Big data tools: Pentaho Business Analytics Pentaho is another software platform that began as a report generating engine; it is, like JasperSoft, branching into big data by making it easier to absorb information from the new sources. You can hook up Pentaho's tool to many of the most popular NoSQL databases such as MongoDB and Cassandra. Once the databases are connected, you can drag and drop the columns into views and reports as if the information came from SQL databases.

I found the classic sorting and sifting tables to be extremely useful for understanding just who was spending the most amount of time at my website. Simply sorting by IP address in the log files revealed what the heavy users were doing.

Pentaho also provides software for drawing HDFS file data and HBase data from Hadoop clusters. One of the more intriguing tools is the graphical programming interface known as either Kettle or Pentaho Data Integration. It has a bunch of built-in modules that you can drag and drop onto a picture, then connect them. Pentaho has thoroughly integrated Hadoop and the other sources into this, so you can write your code and send it out to execute on the cluster.

Big data tools: Karmasphere Studio and Analyst Many of the big data tools did not begin life as reporting tools. Karmasphere Studio, for instance, is a set of plug-ins built on top ofEclipse. It's a specialized IDE that makes it easier to create and run Hadoop jobs.

I had a rare feeling of joy when I started configuring a Hadoop job with this developer tool. There are a number of stages in the life of a Hadoop job, and Karmasphere's tools walk you through each step, showing the partial results along the way. I guess debuggers have always made it possible for us to peer into the mechanism as it does its work, but Karmasphere Studio does something a bit better: As you set up the workflow, the tools display the state of the test data at each step. You see what the temporary data will look like as it is cut apart, analyzed, then reduced.

Karmasphere also distributes a tool called Karmasphere Analyst, which is designed to simplify the process of plowing through all of the data in a Hadoop cluster. It comes with many useful building blocks for programming a good Hadoop job, like subroutines for uncompressing Zipped log files. Then it strings them together and parameterizes the Hive calls to produce a table of output for perusing.

Big data tools: Talend Open Studio Talend also offers an Eclipse-based IDE for stringing together data processing jobs with Hadoop. Its tools are designed to help with data integration, data quality, and data management, all with subroutines tuned to these jobs.

Talend Studio allows you to build up your jobs by dragging and dropping little icons onto a canvas. If you want to get an RSS feed, Talend's component will fetch the RSS and add proxying if necessary. There are dozens of components for gathering information and dozens more for doing things like a "fuzzy match." Then you can output the results.

Stringing together blocks visually can be simple after you get a feel for what the components actually do and don't do. This was easier for me to figure out when I started looking at the source code being assembled behind the canvas. Talend lets you see this, and I think it's an ideal compromise. Visual programming may seem like a lofty goal, but I've found that the icons can never represent the mechanisms with enough detail to make it possible to understand what's going on. I need the source code.

Talend also maintains TalendForge, a collection of open source extensions that make it easier to work with the company's products. Most of the tools seem to be filters or libraries that link Talend's software to other major products such as Salesforce.com and SugarCRM. You can suck down information from these systems into your own projects, simplifying the integration.

Big data tools: Skytree Server Not all of the tools are designed to make it easier to string together code with visual mechanisms. Skytree offers a bundle that performs many of the more sophisticated machine-learning algorithms. All it takes is typing the right command into a command line.

Skytree is more focused on the guts than the shiny GUI. Skytree Server is optimized to run a number of classic machine-learning algorithms on your data using an implementation the company claims can be 10,000 times faster than other packages. It can search through your data looking for clusters of mathematically similar items, then invert this to identify outliers that may be problems, opportunities, or both. The algorithms can be more precise than humans, and they can search through vast quantities of data looking for the entries that are a bit out of the ordinary. This may be fraud -- or a particularly good customer who will spend and spend.

The free version of the software offers the same algorithms as the proprietary version, but it's limited to data sets of 100,000 rows. This should be sufficient to establish whether the software is a good match.

Big data tools: Tableau Desktop and Server Tableau Desktop is a visualization tool that makes it easy to look at your data in new ways, then slice it up and look at it in a different way. You can even mix the data with other data and examine it in yet another light. The tool is optimized to give you all the columns for the data and let you mix them before stuffing it into one of the dozens of graphical templates provided.

Tableau Software started embracing Hadoop several versions ago, and now you can treat Hadoop "just like you would with any data connection." Tableau relies upon Hive to structure the queries, then tries its best to cache as much information in memory to allow the tool to be interactive. While many of the other reporting tools are built on a tradition of generating the reports offline, Tableau wants to offer an interactive mechanism so that you can slice and dice your data again and again. Caching helps deal with some of the latency of a Hadoop cluster.

The software is well-polished and aesthetically pleasing. I often found myself reslicing the data just to see it in yet another graph, even though there wasn't much new to be learned by switching from a pie chart to a bar graph and beyond. The software team clearly includes a number of people with some artistic talent.

Big data tools: Splunk Splunk is a bit different from the other options. It's not exactly a report-generating tool or a collection of AI routines, although it accomplishes much of that along the way. It creates an index of your data as if your data were a book or a block of text. Yes, databases also build indices, but Splunk's approach is much closer to a text search process.

This indexing is surprisingly flexible. Splunk comes already tuned to my particular application, making sense of log files, and it sucked them right up. It's also sold in a number of different solution packages, including one for monitoring a Microsoft Exchange server and another for detecting Web attacks. The index helps correlate the data in these and several other common server-side scenarios.

Splunk will take text strings and search around in the index. You might type in the URLs of important articles or the IP address. Splunk finds them and packages them into a timeline built around the time stamps it discovers in the data. All other fields are correlated, and you can click around to drill deeper and deeper into the data set. While this is a simple process, it's quite powerful if you're looking for the right kind of needle in your data feed. If you know the right text string, Splunk will help you track it. Log files are a great application for it.

A new Splunk tool called Shep, currently in private beta, promises bidirectional integration between Hadoop and Splunk, allowing you to exchange data between the systems and query Splunk data from Hadoop.

Bigger than big dataAfter wading through these products, it became clear that "big data" was much bigger than any single buzzword. It's not really fair to lump together products that largely build tables with those that attempt complicated mathematical operations. Nor is it fair to compare simpler tools that work with generic databases with those that attempt to manage larger stacks spread out over multiple machines in frameworks like Hadoop.

To make matters worse, the targets are moving. Some of the more tantalizing new companies still aren't sharing their software yet. Mysterious Platfora has a button you can click to stay informed, while another enigmatic startup, Continuity, just says, "We're still in stealth, heads down and coding hard." They're surely not going to be the last new entrants in this area.

Despite the speed and sophistication of the new algorithms, I found myself liking the old classic reports the best. The Pentaho and Jaspersoft tools simply produce nice lists of the top entries, but this was all I needed. Knowing the top domains in my log file was enough.

The other algorithms are intellectually interesting, but they're harder to apply with any consistency. They can flag clusters or do fuzzy matching, but my data set didn't seem to lend itself to these analyses. Try as I might, I couldn't figure out any applications for my data that didn't seem contrived.

Others will probably feel differently. The clustering algorithms are used heavily in diverse applications such as helping people find similar products in online stores. Others use outlier detection algorithms to identify potential security threats. These all bear investigation, but the software is the least of the challenges.

Perhaps it is my lack of vision that left me clutching to the old sortable reports. In time, I may come to understand just how I might use the advanced algorithms to do more. This may be why most of these companies list consulting among their products. They will rent you one of their engineers, who is familiar with the software and the math, so you have a guide when you're digging around the data. This is a good option for every business because the needs and demands are often rather abstract and filled with wishful hand-waving.

At a recent O'Reilly Strata conference on big data, one of the best panels debated whether it was better to hire an expert on the subject being measured or an expert on using algorithms to find outliers. I'm not sure I can choose, but I think it's important to hire a person with a mandate to think deeply about the data. It's not enough to just buy some software and push a button.

This article, "7 top tools for taming big data," was originally published at InfoWorld.com. Follow the latest developments in business technology news and get a digest of the key stories each day in the InfoWorld Daily newsletter. For the latest business technology news,follow InfoWorld on Twitter.

转载于:https://my.oschina.net/ychenIntegration/blog/64381

7 大工具，驯服大数据相关推荐

url采集工具_大数据关键技术浅谈之大数据采集
在前几篇文章中,企通查为大家介绍了大数据处理的基本流程.从大数据的一系列处理过程中(抽取.集成.分析.解释),我们可以发现这一整套流程中涵盖了数据存储.处理.应用等多方面的技术. 大数据价值的完美体现 ...
从工具了解大数据之Kettle
最近沉迷Python爬虫学习,很有意思,即时刹车,坚持一步一学习,一步一整理 Kettle目前工作用于数据库搬运,例如Oracle定时搬运到mysql中间表,以便于加速查询 1.相关学习资料 [尚硅谷 ...
TOOM舆情分析和报告工具，大数据决策免费舆情监控辅助工具?
大数据舆情工具是一种利用大数据技术进行舆情监控.分析.评估和预测的工具,以提高企业舆情应对能力.舆情监控工具可以帮助企业提高舆情应对能力,提升企业形象,以更好地处理各种舆情问题,TOOM舆情分析和报告 ...
web报表工具在线制作数据可视化大屏
在数字化的今天,无论哪一个行业,都会产生庞大的数据,人们需要对大量的数字进行分析,从而帮助用户更直观的观察差异,作出判断,减少时间成本.报表这一产物便应运而生,报表工具是实现数据可视化的好帮手. 想要 ...
网络欺骗工具Ettercap大学霸IT达人
网络欺骗工具Ettercap大学霸IT达人 Ettercap是一个基于中间人攻击方式的网络嗅探工具,主要适用于交换局域网络.借助于Ettercap嗅探功能,管理员可以检测网络内明文数据通讯的安全性,即 ...
python程序员需要掌握哪些技术-程序员Python编程必备5大工具，你用过几个？
Python是编程入门不错的选择,现在也有不少的程序员业余时间会研究这门编程语言. 学习Python有时候没有第一时间找到好工具,会吃不少的苦头.毕竟好的工具能将工作效率多倍速提升. 在这里还是要推荐 ...
大流行后的数据中心非接触式技术
冠状病毒大流行和COVID-19后的"新常态"促进了非接触式技术的大规模发展,并使其比以往任何时候都更加无处不在,无论是在数据中心内或外.由于这项技术对数据中心产生了新的和不断增长 ...
编程软件python是什么意思_程序员Python编程必备5大工具，你用过几个？
Python是编程入门不错的选择,现在也有不少的程序员业余时间会研究这门编程语言. 学习Python有时候没有第一时间找到好工具,会吃不少的苦头.毕竟好的工具能将工作效率多倍速提升. 下面W3Csch ...
iOS开发人员不容错过的10大工具
内容简介 1.iOS简介 2.iOS开发十大实用工具之开发环境 3.iOS开发十大实用工具之图标设计 4.iOS开发十大实用工具之原型设计 5.iOS开发十大实用工具之演示工具 6.iOS开发十大实用 ...
20张可视化大屏，给数据分析师最全的大屏模板！无代码直接套用
作为数据分析师或者可视化方面的设计师,设计制作可视化大屏是非常加分的一个技能. 在缺少专业大屏软件的情况下,制作一张大屏模板可能需要编写大量的代码搭建iframe布局.接入echart图形. 而帆软大 ...

7 大工具，驯服大数据

7 大工具，驯服大数据相关推荐

最新文章

热门文章